ADR: Model Download Strategy
Status: Draft - Investigation Complete, Solution Pending
Date: 2025-10-13
Context: Investigation into why COMPONENTS=txt2img-flux,image-merge environment variable is not triggering model downloads during machine startup.
Problem Statement
Models are not being downloaded at machine startup even when the COMPONENTS environment variable is properly set. This breaks the component-based machine configuration system that was designed to automatically install required models and custom nodes.
Investigation Findings
System Architecture
The model download system relies on a chain of components:
ComponentManagerService (component-manager.js)
- Reads
COMPONENTSenvironment variable - Fetches component requirements from API
- Downloads models using
wget - Installs custom nodes
- Reads
ComfyUIManagementClient (comfyui-management-client.js)
- Instantiates ComponentManagerService (line 67)
- Calls component-manager during its
install()method
EnhancedPM2EcosystemGenerator (enhanced-pm2-ecosystem-generator.js)
- Configured to load ComfyUIManagementClient as installer for ComfyUI service
- Has
startDaemonService()method that callsinstaller.install()(line 246)
Root Cause
Line 73-75 in enhanced-pm2-ecosystem-generator.js:
// Skip daemon services - now handled at system level in entrypoint
this.logger.log('⭐⭐⭐ [ECOSYSTEM-TRACE] Skipping daemon services (handled at system level)...');
this.logger.log('⭐⭐⭐ [ECOSYSTEM-TRACE] System-level services (Ollama, etc.) should already be running');The daemon service installation (which includes ComfyUI and its installer) is skipped with a comment stating it's "handled at system level in entrypoint."
However, investigation of the entrypoint script (entrypoint-machine-final.sh) shows:
- ❌ No calls to component-manager
- ❌ No model download logic for ComfyUI
- ❌ No COMPONENTS env var checking
- ✅ Only Ollama model downloads via
OLLAMA_DEFAULT_MODELS
Historical Context
Git history investigation:
Commit ca18e363 (Aug 18, 2025): "cleanup: remove legacy entrypoint hooks and ComfyUI installers"
- Removed legacy hook system
- Comment stated functionality was "replaced by new implementations"
Current state:
- ComponentManagerService code exists and is maintained
- Documentation (COMPONENT_CONFIGURATION.md) describes the feature as working
- Service mapping configuration is correct
- Integration point is broken
Configuration Status
Service Mapping (service-mapping.json line 239-240):
"installer": "ComfyUIManagementClient",
"installer_filename": "./services/comfyui-management-client.js"✅ Correctly configured
ComfyUIManagementClient:
this.componentManager = new ComponentManagerService({}, config);✅ Correctly instantiated
Startup Flow (index-pm2.js):
1. Initialize telemetry
2. Generate PM2 ecosystem config ← Skips daemon installers here
3. Start PM2 services
4. Start health server❌ No component-manager execution
Current Model Download Systems
Working: Ollama Models
Trigger: OLLAMA_DEFAULT_MODELS environment variable Location: entrypoint-machine-final.shMethod: Direct ollama pull commands
OLLAMA_DEFAULT_MODELS="llama2,codellama"
# Executes: ollama pull llama2 && ollama pull codellamaBroken: ComfyUI Models via Components
Trigger: COMPONENTS environment variable Expected Flow:
- PM2 generator detects ComfyUI service needed
- Loads ComfyUIManagementClient installer
- Installer calls
install()which triggers ComponentManagerService - ComponentManager downloads models with
wget
Actual Flow:
- PM2 generator skips daemon services ❌
- Installer never loaded ❌
- ComponentManager never runs ❌
- Models never download ❌
Solution Options
Option 1: Re-enable Daemon Installation in PM2 Generator
Approach: Remove the skip logic at line 73-75 in enhanced-pm2-ecosystem-generator.js
Pros:
- Uses existing architecture
- ComfyUIManagementClient already integrates ComponentManager
- Minimal code changes
Cons:
- Slows down PM2 ecosystem generation
- Mixes service installation with configuration generation
- Potential timing issues with service startup
Option 2: Add Component-Manager to Entrypoint Script
Approach: Add direct component-manager execution in entrypoint-machine-final.sh before PM2 starts
Pros:
- Clean separation: install dependencies before starting services
- Follows same pattern as Ollama model downloads
- Predictable execution order
Cons:
- Requires invoking Node.js from bash script
- Needs careful error handling
- Duplicates some logic that exists in ComfyUIManagementClient
Option 3: Dedicated Pre-Start PM2 App
Approach: Create a PM2 app that runs component-manager once before other services
Pros:
- Clean separation of concerns
- PM2 handles process management
- Can use existing ComponentManagerService code
Cons:
- Adds complexity to PM2 ecosystem
- One-shot apps are awkward in PM2
- Still need to ensure it runs before ComfyUI
North Star Alignment
From CLAUDE.md:
Current: Models downloaded at startup based on component configuration North Star: Predictive Model Placement with pool-specific baked containers
This investigation reveals the current system is broken, but the architecture is sound:
- Component-based configuration ✅
- API-driven model requirements ✅
- Automated installation ✅
- Integration point missing ❌
Short-term fix (this ADR): Restore working model downloads at startup
Long-term evolution (North Star):
- Bake common models into container images per pool type
- Eliminate first-user wait times
- Reduce runtime model downloads to rare edge cases
Impact Analysis
Current Production Impact:
- Machines start without required models
- Jobs fail with "model not found" errors
- Manual model installation required
- Component-based configuration is non-functional
Documentation vs Reality Gap:
- COMPONENT_CONFIGURATION.md describes working feature
- Code exists but is not executed
- Users expect
COMPONENTSenv var to work
Decision
TO BE COMPLETED
Will evaluate the three solution options and select the best approach for:
- Immediate restoration of functionality
- Alignment with North Star architecture
- Minimal technical debt
- Clear error handling and observability
Implementation Plan
TO BE COMPLETED
Will include:
- Chosen solution approach
- Code changes required
- Testing strategy
- Rollback plan
- Documentation updates
References
- CLAUDE.md - North Star Architecture
- COMPONENT_CONFIGURATION.md - Component System Docs
- component-manager.js:70-120 - Main installation logic
- comfyui-management-client.js:67 - ComponentManager integration
- enhanced-pm2-ecosystem-generator.js:73-75 - Skip logic that breaks the chain
- entrypoint-machine-final.sh - Machine startup
- Commit ca18e363: "cleanup: remove legacy entrypoint hooks and ComfyUI installers"
Notes
- Investigation completed: 2025-10-12
- User confirmed system was working "within the last 2 months"
- Git history shows cleanup commit removed hooks but claimed functionality was preserved
- Functionality was NOT preserved - integration point was lost
