Skip to content

ADR: Model Download Strategy

Status: Draft - Investigation Complete, Solution Pending

Date: 2025-10-13

Context: Investigation into why COMPONENTS=txt2img-flux,image-merge environment variable is not triggering model downloads during machine startup.


Problem Statement

Models are not being downloaded at machine startup even when the COMPONENTS environment variable is properly set. This breaks the component-based machine configuration system that was designed to automatically install required models and custom nodes.


Investigation Findings

System Architecture

The model download system relies on a chain of components:

  1. ComponentManagerService (component-manager.js)

    • Reads COMPONENTS environment variable
    • Fetches component requirements from API
    • Downloads models using wget
    • Installs custom nodes
  2. ComfyUIManagementClient (comfyui-management-client.js)

    • Instantiates ComponentManagerService (line 67)
    • Calls component-manager during its install() method
  3. EnhancedPM2EcosystemGenerator (enhanced-pm2-ecosystem-generator.js)

    • Configured to load ComfyUIManagementClient as installer for ComfyUI service
    • Has startDaemonService() method that calls installer.install() (line 246)

Root Cause

Line 73-75 in enhanced-pm2-ecosystem-generator.js:

javascript
// Skip daemon services - now handled at system level in entrypoint
this.logger.log('⭐⭐⭐ [ECOSYSTEM-TRACE] Skipping daemon services (handled at system level)...');
this.logger.log('⭐⭐⭐ [ECOSYSTEM-TRACE] System-level services (Ollama, etc.) should already be running');

The daemon service installation (which includes ComfyUI and its installer) is skipped with a comment stating it's "handled at system level in entrypoint."

However, investigation of the entrypoint script (entrypoint-machine-final.sh) shows:

  • ❌ No calls to component-manager
  • ❌ No model download logic for ComfyUI
  • ❌ No COMPONENTS env var checking
  • ✅ Only Ollama model downloads via OLLAMA_DEFAULT_MODELS

Historical Context

Git history investigation:

  • Commit ca18e363 (Aug 18, 2025): "cleanup: remove legacy entrypoint hooks and ComfyUI installers"

    • Removed legacy hook system
    • Comment stated functionality was "replaced by new implementations"
  • Current state:

    • ComponentManagerService code exists and is maintained
    • Documentation (COMPONENT_CONFIGURATION.md) describes the feature as working
    • Service mapping configuration is correct
    • Integration point is broken

Configuration Status

Service Mapping (service-mapping.json line 239-240):

json
"installer": "ComfyUIManagementClient",
"installer_filename": "./services/comfyui-management-client.js"

✅ Correctly configured

ComfyUIManagementClient:

javascript
this.componentManager = new ComponentManagerService({}, config);

✅ Correctly instantiated

Startup Flow (index-pm2.js):

javascript
1. Initialize telemetry
2. Generate PM2 ecosystem config  ← Skips daemon installers here
3. Start PM2 services
4. Start health server

❌ No component-manager execution


Current Model Download Systems

Working: Ollama Models

Trigger: OLLAMA_DEFAULT_MODELS environment variable Location: entrypoint-machine-final.shMethod: Direct ollama pull commands

bash
OLLAMA_DEFAULT_MODELS="llama2,codellama"
# Executes: ollama pull llama2 && ollama pull codellama

Broken: ComfyUI Models via Components

Trigger: COMPONENTS environment variable Expected Flow:

  1. PM2 generator detects ComfyUI service needed
  2. Loads ComfyUIManagementClient installer
  3. Installer calls install() which triggers ComponentManagerService
  4. ComponentManager downloads models with wget

Actual Flow:

  1. PM2 generator skips daemon services ❌
  2. Installer never loaded ❌
  3. ComponentManager never runs ❌
  4. Models never download ❌

Solution Options

Option 1: Re-enable Daemon Installation in PM2 Generator

Approach: Remove the skip logic at line 73-75 in enhanced-pm2-ecosystem-generator.js

Pros:

  • Uses existing architecture
  • ComfyUIManagementClient already integrates ComponentManager
  • Minimal code changes

Cons:

  • Slows down PM2 ecosystem generation
  • Mixes service installation with configuration generation
  • Potential timing issues with service startup

Option 2: Add Component-Manager to Entrypoint Script

Approach: Add direct component-manager execution in entrypoint-machine-final.sh before PM2 starts

Pros:

  • Clean separation: install dependencies before starting services
  • Follows same pattern as Ollama model downloads
  • Predictable execution order

Cons:

  • Requires invoking Node.js from bash script
  • Needs careful error handling
  • Duplicates some logic that exists in ComfyUIManagementClient

Option 3: Dedicated Pre-Start PM2 App

Approach: Create a PM2 app that runs component-manager once before other services

Pros:

  • Clean separation of concerns
  • PM2 handles process management
  • Can use existing ComponentManagerService code

Cons:

  • Adds complexity to PM2 ecosystem
  • One-shot apps are awkward in PM2
  • Still need to ensure it runs before ComfyUI

North Star Alignment

From CLAUDE.md:

Current: Models downloaded at startup based on component configuration North Star: Predictive Model Placement with pool-specific baked containers

This investigation reveals the current system is broken, but the architecture is sound:

  • Component-based configuration ✅
  • API-driven model requirements ✅
  • Automated installation ✅
  • Integration point missing ❌

Short-term fix (this ADR): Restore working model downloads at startup

Long-term evolution (North Star):

  • Bake common models into container images per pool type
  • Eliminate first-user wait times
  • Reduce runtime model downloads to rare edge cases

Impact Analysis

Current Production Impact:

  • Machines start without required models
  • Jobs fail with "model not found" errors
  • Manual model installation required
  • Component-based configuration is non-functional

Documentation vs Reality Gap:


Decision

TO BE COMPLETED

Will evaluate the three solution options and select the best approach for:

  1. Immediate restoration of functionality
  2. Alignment with North Star architecture
  3. Minimal technical debt
  4. Clear error handling and observability

Implementation Plan

TO BE COMPLETED

Will include:

  1. Chosen solution approach
  2. Code changes required
  3. Testing strategy
  4. Rollback plan
  5. Documentation updates

References


Notes

  • Investigation completed: 2025-10-12
  • User confirmed system was working "within the last 2 months"
  • Git history shows cleanup commit removed hooks but claimed functionality was preserved
  • Functionality was NOT preserved - integration point was lost

Released under the MIT License.