EMP-Job-Queue Architecture Analysis & Docker Swarm Evaluation
Date: 2025-10-08 Purpose: Comprehensive analysis of current architecture, testing complexity, and Docker Swarm migration evaluation
Executive Summary
The emp-job-queue system is a sophisticated distributed AI workload broker designed for elastic scaling across ephemeral machines. The current architecture successfully solves complex problems around distributed job routing, model management, and multi-service orchestration.
Key Findings:
- Environment management is well-architected - Component-based system provides type-safe, flexible configuration
- Machine/worker architecture is production-ready - PM2-based orchestration works well for current needs
- Testing complexity stems from distribution, not tooling - Docker Swarm won't fundamentally reduce complexity
- Current pain points are solvable without migration - Quick wins available through better documentation and tooling
Recommendation: DEFER Docker Swarm migration. The migration would require 40-60 hours of engineering effort for marginal testing improvements. Better ROI from improving documentation, test tooling, and development workflows with current architecture.
Current Architecture Assessment
1. Environment Management System
Status: ✅ Well-designed, production-ready
Strengths:
- Type-safe service interfaces enforce required variables at build time
- Flexible component-based composition supports multiple deployment targets
- Secure automatic separation of public/secret variables
- DRY (Don't Repeat Yourself) - variables defined once, referenced everywhere
- Multi-environment - single codebase supports local/staging/production seamlessly
Architecture Pattern:
Component Files (.env) → Service Interfaces (.ts) → Generated .env Files
↓ ↓ ↓
Define capabilities Define requirements Per-service configsPain Points:
- Learning curve - New developers need to understand 3-layer system (components → interfaces → profiles)
- Debugging opacity - Variable resolution happens at build time, errors can be cryptic
- Documentation gap - System is powerful but under-documented (now resolved with new docs)
Recommendation:
- ✅ Keep current system - well-architected for monorepo microservices
- Add troubleshooting guide with common error patterns
- Create visual debugging tool to show variable resolution flow
2. Machine/Worker Architecture
Status: ✅ Production-capable, evolving toward north star
Current Design:
Docker Container (Machine)
├── PM2 Process Manager
│ ├── Service Manager (Node.js main process)
│ ├── ComfyUI Instances (per-GPU)
│ │ ├── comfyui-gpu0 (port 8188)
│ │ ├── comfyui-gpu1 (port 8189)
│ │ └── comfyui-gpuN...
│ ├── Ollama Daemon (optional)
│ └── Redis Workers (Node.js)
│ ├── Worker processes (connect to Redis)
│ └── Connector system (OpenAI, ComfyUI, Ollama, etc.)
├── Telemetry Stack
│ ├── Fluent Bit (log aggregation)
│ ├── OTEL Collector (traces/metrics)
│ └── Nginx (optional monitoring)
└── Entrypoint Scripts (startup orchestration)Lifecycle Flow:
- Container Start → Entrypoint script (
entrypoint-machine-final.sh) - Environment Setup → Decrypt environment, set variables
- Worker Bundle → Download or copy bundled worker code
- System Services → Install/start Ollama if needed
- PM2 Ecosystem Generation → Generate PM2 config from WORKERS env var
- Service Manager Start → Node.js main process (index-pm2.js)
- Sequential Startup → ComfyUI → Custom Nodes → Models → Workers
- Machine Registration → Redis registration with capabilities
- Worker Polling → Start requesting jobs from Redis
Strengths:
- PM2 robustness - Process supervision, restart on failure, log rotation
- Multi-GPU support - Horizontal scaling within single container
- Worker bundle - Pre-bundled or runtime-downloaded worker code
- Component-based - API-driven custom node and model installation
- Telemetry-first - Built-in observability with structured logging
Pain Points:
- Complexity - Many moving parts (PM2, services, workers, telemetry)
- Debugging - Distributed logs across PM2 services, container logs, telemetry
- Startup time - Sequential startup can take 5-10 minutes for ComfyUI + models
- Testing isolation - Hard to test individual components without full stack
Evolution Path (Per CLAUDE.md):
- ✅ Phase 0 (Current): Redis job broker, PM2 orchestration, component-based setup
- → Phase 1: Pool separation (Fast Lane / Standard / Heavy)
- → Phase 2: Model intelligence (predictive placement, baked containers)
- → Phase 3: Advanced optimization (ML-based routing, specialty pools)
Testing Complexity Analysis
Current Pain Points
1. Environment Setup Complexity
- Problem: Developers need to build correct environment profile, ensure secrets file exists, understand component layering
- Impact: New developer onboarding takes hours
- Root Cause: Documentation gap, not architecture flaw
- Solution: ✅ Created comprehensive environment management docs
2. Multi-Service Dependencies
- Problem: Machine requires Redis, Worker requires Machine services, API requires Database
- Impact: Can't test components in isolation easily
- Root Cause: Distributed architecture (inherent to design, not solvable by orchestration change)
- Solution: Better test fixtures, mock services, documented test patterns
3. Log Aggregation
- Problem: Logs scattered across PM2 services, container stdout, telemetry streams
- Impact: Debugging requires checking multiple sources
- Root Cause: Multi-process PM2 architecture
- Solution: Centralized log viewer tool (already have telemetry stack, needs UI)
4. State Management
- Problem: Redis state persists between test runs, causing flaky tests
- Impact: Tests pass/fail inconsistently
- Root Cause: Shared Redis instance, no automatic cleanup
- Solution: Test isolation patterns, per-test Redis namespaces, cleanup fixtures
5. Integration Test Overhead
- Problem: Full stack tests take minutes to spin up (Redis + API + Machine + Workers)
- Impact: Slow development feedback loop
- Root Cause: Real services vs mocks (trade-off for high-fidelity testing)
- Solution: Tiered testing strategy (unit → integration → e2e), better mocking
Testing Best Practices (What Works)
Unit Testing:
- ✅ Worker has excellent unit test coverage (
apps/worker/src/__tests__/) - ✅ Vitest provides fast, reliable unit test execution
- ✅ Mocking patterns work well (
apps/worker/src/mocks/)
Integration Testing:
- ✅ Redis integration tests validate job matching (
packages/core/src/redis-functions/__tests__/) - ✅ Connector tests validate external API integrations
- ⚠️ Machine integration tests are minimal (opportunity for improvement)
E2E Testing:
- ⚠️ Full workflow tests exist but are complex to maintain
- ⚠️ Debugging failures requires deep system knowledge
- ⚠️ Flaky tests due to timing issues, resource contention
Docker Swarm Migration Evaluation
What Docker Swarm Would Provide
Service Definition:
# docker-compose.swarm.yml
version: '3.8'
services:
redis:
image: redis:7
deploy:
replicas: 1
placement:
constraints: [node.role == manager]
api:
image: emp/api:latest
deploy:
replicas: 2
update_config:
parallelism: 1
delay: 10s
environment:
- REDIS_URL=redis://redis:6379
machine-base:
image: emp/machine:base
deploy:
replicas: 0 # Template only
environment:
- WORKERS=simulation:1
machine-comfyui:
image: emp/machine:comfyui
deploy:
replicas: 3
placement:
constraints: [node.labels.gpu == true]
environment:
- WORKERS=comfyui:2Swarm Features:
- Service Discovery - Services find each other by name (e.g.,
redis://redis:6379) - Scaling -
docker service scale machine-comfyui=5 - Rolling Updates - Zero-downtime deployments
- Health Checks - Automatic restart of unhealthy services
- Secrets Management - Docker secrets vs .env.secret files
- Load Balancing - Ingress routing mesh
- Constraints - Node placement rules (GPU nodes, memory requirements)
Benefits for Testing
Potential Improvements:
- Declarative Infrastructure - Test environments defined in docker-compose.test.yml
- Service Templating - Reuse machine service definition with different WORKERS env
- Network Isolation - Each test run gets isolated network namespace
- Resource Limits - Enforce CPU/memory limits to prevent resource starvation
- Parallel Testing - Spin up multiple isolated stacks simultaneously
Example Test Setup:
# docker-compose.test.yml
services:
test-redis:
image: redis:7
networks: [test-network]
test-api:
build: ./apps/api
environment:
- REDIS_URL=redis://test-redis:6379
- NODE_ENV=test
depends_on: [test-redis]
networks: [test-network]
test-machine:
build:
context: ./apps/machine
target: simulation
environment:
- HUB_REDIS_URL=redis://test-redis:6379
- WORKERS=simulation:1
depends_on: [test-api]
networks: [test-network]
networks:
test-network:
driver: overlayMigration Effort Estimate
Phase 1: Swarm Setup (8-12 hours)
- Convert docker-compose.yml to Swarm-compatible format
- Set up multi-node Swarm cluster (manager + workers)
- Configure overlay networks
- Test service discovery patterns
Phase 2: Service Adaptation (16-24 hours)
- Update entrypoint scripts for Swarm lifecycle
- Convert .env.secret files to Docker secrets
- Update health checks for Swarm healthcheck format
- Migrate PM2 orchestration to Swarm services (or keep PM2 within containers)
Phase 3: CI/CD Integration (8-12 hours)
- Update deployment scripts for
docker stack deploy - Configure rolling update strategies
- Set up secrets injection pipeline
- Test production deployment workflow
Phase 4: Testing Framework (8-12 hours)
- Create test-specific Swarm configurations
- Implement test isolation (network namespaces, service naming)
- Update test scripts to use Swarm CLI
- Document test patterns and troubleshooting
Total: 40-60 hours (1-1.5 weeks of focused engineering)
Swarm Downsides
1. Increased Operational Complexity
- Current: Docker Compose up/down (simple)
- Swarm: Swarm init, join tokens, node management, stack deploy, service update
- Impact: Higher learning curve for team, more moving parts in production
2. PM2 vs Swarm Service Management
- Current: PM2 manages processes within container (comfyui-gpu0, comfyui-gpu1, workers)
- Swarm: Either keep PM2 (no benefit) or split into separate Swarm services (complexity explosion)
- Trade-off: Lose per-GPU PM2 orchestration or gain no benefit from Swarm
3. Local Development Friction
- Current: Docker Compose works identically on laptops and servers
- Swarm: Swarm mode requires
docker swarm initon laptops, complicates simple dev workflows - Impact: Developers need two workflows (Compose for dev, Swarm for staging/prod)
4. Debugging Complexity
- Current:
docker logs container-name,docker exec -it container bash - Swarm:
docker service logs service-name(aggregated across replicas),docker execrequires finding specific task ID - Impact: Harder to debug individual instances
5. GPU Scheduling Challenges
- Current: PM2 creates comfyui-gpu0, comfyui-gpu1 within single container with GPU access
- Swarm: GPU constraints are node-level, not per-service. Would need different approach:
- Option A: One container per GPU (more containers, more overhead)
- Option B: Keep current PM2 approach (no Swarm benefit)
Critical Question: What Problem Does Swarm Solve?
Current Architecture Problems:
- ❌ Testing complexity → Swarm doesn't solve this (still need Redis, API, Machine stack)
- ❌ Environment configuration → Swarm doesn't simplify this (still need .env files or Swarm secrets)
- ❌ Log aggregation → Already have telemetry stack (Fluent Bit + OTEL)
- ❌ Multi-GPU orchestration → PM2 works well, Swarm complicates this
- ❌ State management → Redis state persists regardless of orchestrator
Swarm Solutions:
- ✅ Service discovery - Nice, but we already use explicit REDIS_URL, API_URL env vars
- ✅ Scaling - We scale by adding machines (SALAD/vast.ai), not replicating services
- ✅ Rolling updates - Useful for API/monitor, less so for ephemeral machines
- ⚠️ Secrets management - Docker secrets vs .env.secret is a lateral move, not improvement
Conclusion: Swarm solves problems we don't have, adds complexity to areas that work well.
Testing Improvements (Without Swarm)
Quick Wins (< 8 hours total)
1. Test Isolation Utilities (2 hours)
// packages/test-utils/src/redis-isolation.ts
export class RedisTestIsolation {
private namespace: string;
constructor(testName: string) {
this.namespace = `test:${testName}:${Date.now()}`;
}
// Prefix all Redis keys with namespace
jobKey(id: string): string {
return `${this.namespace}:job:${id}`;
}
// Cleanup after test
async cleanup(redis: Redis): Promise<void> {
const keys = await redis.keys(`${this.namespace}:*`);
if (keys.length > 0) {
await redis.del(...keys);
}
}
}2. Environment Setup Script (2 hours)
#!/bin/bash
# tools/setup-test-env.sh
echo "🔧 Setting up test environment..."
# Check secrets file exists
if [[ ! -f "config/environments/secrets/.env.secrets.local" ]]; then
echo "❌ Missing secrets file"
echo "👉 Copy example: cp config/environments/secrets/.env.secrets.local.example config/environments/secrets/.env.secrets.local"
exit 1
fi
# Build test environment
pnpm env:build testrunner
# Start Redis
docker-compose -f docker-compose.test.yml up -d redis
# Wait for Redis
until docker exec test-redis redis-cli ping; do
sleep 1
done
echo "✅ Test environment ready"3. Log Aggregation Tool (3 hours)
// tools/log-viewer/index.ts
// Simple web UI to view logs from:
// - PM2 services (read from /workspace/logs/)
// - Container stdout (docker logs)
// - Telemetry streams (Redis streams)
// Features:
// - Real-time tail
// - Filtering by service/level
// - Search across all logs
// - Timeline view4. Test Documentation (1 hour)
- Document standard test patterns
- Create testing troubleshooting guide
- Add examples of unit/integration/e2e test structure
Medium-term Improvements (8-16 hours)
1. Mock Service Generator (8 hours)
// Generate lightweight mock services for testing:
// - Mock Redis (in-memory, fast)
// - Mock API (stub endpoints)
// - Mock Machine (fake workers)
// Benefits:
// - Fast unit tests (no Docker)
// - Deterministic behavior
// - Easy to customize per test2. Test Fixtures Library (4 hours)
// Pre-built test data:
// - Sample jobs (comfyui, ollama, openai)
// - Worker capabilities
// - Machine configurations
// Benefits:
// - Consistent test data
// - Reduce test boilerplate
// - Easy to extend3. Dev Mode Optimizations (4 hours)
- Skip telemetry in test mode (faster startup)
- Parallel PM2 service startup (where possible)
- Cached model downloads (share across containers)
- Volume mounts for worker code (no rebuild needed)
Long-term Vision (16-24 hours)
1. Tiered Testing Strategy
Unit Tests (Fast - seconds)
↓ Use mocks, no Docker
↓ Run on every code change
Integration Tests (Medium - 30-60s)
↓ Real Redis, mocked external APIs
↓ Run on pre-commit
E2E Tests (Slow - 5-10 mins)
↓ Full stack, real services
↓ Run on PR, nightly2. Test Infrastructure as Code
- Codify test environments in docker-compose files
- Document test database setup
- Automate test environment provisioning
3. Continuous Testing
- GitHub Actions workflows for PR testing
- Automated environment provisioning
- Test result visualization
Recommendations
Immediate Actions (This Week)
- ✅ DONE: Create comprehensive environment management documentation
- Create testing procedures doc - Document standard test setup, common patterns, troubleshooting
- Add test isolation utilities - RedisTestIsolation class for namespace-based cleanup
- Create setup-test-env.sh script - One-command test environment setup
Short-term (This Month)
- Improve machine documentation - Lifecycle flow, PM2 orchestration, debugging guide
- Build log aggregation UI - Simple web viewer for PM2 logs, container logs, telemetry
- Create mock service library - Lightweight mocks for fast unit testing
- Document test patterns - Examples of good unit/integration/e2e tests
Medium-term (Next Quarter)
- Tiered testing strategy - Separate fast/medium/slow tests, optimize CI/CD
- Dev mode optimizations - Faster startup, better caching, volume mounts
- Test fixtures library - Pre-built test data for common scenarios
- Monitoring dashboard - Real-time view of machines, workers, jobs (leverage existing telemetry)
Docker Swarm Decision
Recommendation: DEFER indefinitely
Reasons:
- Marginal benefits - Swarm solves problems we don't have (service discovery works, scaling is machine-based not replica-based)
- High migration cost - 40-60 hours for uncertain ROI
- Current architecture works - PM2 orchestration is production-ready
- Better alternatives exist - Quick wins address actual pain points (testing, documentation)
- Complexity trade-off - Swarm adds operational complexity without clear testing benefits
- GPU orchestration mismatch - Current PM2 approach (multiple processes per GPU) doesn't map well to Swarm services
When to Reconsider:
- If we need true multi-node orchestration (currently use SALAD/vast.ai for scaling, not self-hosted clusters)
- If we adopt Kubernetes (more powerful than Swarm, worth migration effort)
- If PM2 orchestration becomes bottleneck (not currently the case)
Key Takeaways
- Environment management is a strength - Well-designed, just needs better documentation (now provided)
- Machine/worker architecture is production-capable - PM2 orchestration works well for GPU workloads
- Testing complexity is inherent to distribution - No silver bullet; incremental improvements are the path
- Docker Swarm is not the answer - Would add complexity without solving real pain points
- Quick wins available - Test isolation, documentation, tooling can provide immediate relief
Focus areas for next month:
- ✅ Environment documentation (done)
- Testing documentation and utilities
- Log aggregation tooling
- Developer experience improvements
Avoid:
- Large architectural rewrites (Swarm migration)
- Over-engineering test infrastructure
- Solutions looking for problems
Appendices
A. Current File Structure
emerge-turbo/
├── apps/
│ ├── api/ # Job queue API (Redis orchestration)
│ ├── machine/ # Container deployment (PM2 orchestration)
│ │ ├── Dockerfile # Multi-stage: base, comfyui, ollama, simulation
│ │ ├── src/
│ │ │ ├── index-pm2.js # Main entry point
│ │ │ ├── services/ # Service management
│ │ │ │ ├── component-manager.js # API-driven custom nodes/models
│ │ │ │ ├── comfyui-management-client.js
│ │ │ │ ├── machine-status-aggregator.js
│ │ │ │ ├── redis-worker-service.js
│ │ │ │ └── sequential-startup-orchestrator.js
│ │ ├── scripts/
│ │ │ └── entrypoint-machine-final.sh # Container startup
│ │ └── worker-bundled/ # Pre-bundled worker code (local mode)
│ ├── worker/ # Worker processes (connect to Redis)
│ │ ├── src/
│ │ │ ├── redis-direct-worker-client.ts # Redis communication
│ │ │ ├── connector-manager.ts # Connector loading
│ │ │ └── connectors/ # Service integrations
│ │ └── __tests__/ # Excellent unit test coverage
│ ├── monitor/ # Real-time monitoring UI
│ └── emprops-api/ # EmProps platform API
├── config/
│ └── environments/
│ ├── components/ # Component .env files
│ ├── services/ # Service interface .ts files
│ ├── profiles/ # Environment profiles .json
│ └── secrets/ # .env.secrets.local (gitignored)
├── packages/
│ ├── env-management/ # Environment builder
│ ├── core/ # Shared types, Redis functions
│ ├── telemetry/ # Unified telemetry client
│ └── test-utils/ # (To be created)
└── tools/ # CLI tools, debugging utilitiesB. Machine Startup Sequence
1. Container Start
└─> entrypoint-machine-final.sh
2. Environment Setup
├─> Decrypt .env.encrypted → environment variables
├─> Set MACHINE_ID, WORKER_ID
└─> Load service mappings
3. Directory Setup
├─> /workspace/.pm2 (PM2 home)
├─> /workspace/logs (logs)
└─> /workspace/ComfyUI (if comfyui profile)
4. Worker Bundle
├─> LOCAL MODE: Copy /service-manager/worker-bundled → /workspace/worker-bundled
└─> REMOTE MODE: Download from GitHub releases
5. System Services (if needed)
├─> Ollama: curl install.sh | sh
├─> Start ollama serve
└─> Pull default models (OLLAMA_DEFAULT_MODELS)
6. Telemetry Initialization
├─> Create telemetry client
├─> Add log file monitors (Winston, PM2, ComfyUI)
├─> Send machine.registered event
└─> Start telemetry pipelines (FluentBit, OTEL)
7. PM2 Ecosystem Generation
├─> Parse WORKERS env var (e.g., "comfyui:2,ollama:1")
├─> Generate PM2 config with services:
│ ├─> service-manager (main)
│ ├─> comfyui-gpu0, comfyui-gpu1 (if comfyui workers)
│ └─> worker-* (Redis workers)
└─> Write to /workspace/pm2-ecosystem.config.cjs
8. Service Manager Start
└─> PM2 starts index-pm2.js (main process)
9. Sequential Startup Orchestrator
├─> STEP 1: Health Server (port 9090)
├─> STEP 2-14: ComfyUI Installation (if comfyui profile)
│ ├─> Clone ComfyUI repo
│ ├─> Install Python dependencies
│ ├─> Start ComfyUI instances (PM2)
│ ├─> Wait for health checks
│ └─> Verify GPU access
├─> STEP 15-18: Component Manager
│ ├─> Fetch default custom nodes from API
│ ├─> Fetch workflow/collection dependencies (if COMPONENTS/COLLECTIONS env vars)
│ ├─> Install custom nodes (git clone, pip install)
│ └─> Download models (wget)
└─> STEP 19+: Worker Services
├─> Start Redis workers (PM2)
└─> Register with Redis hub
10. Machine Registration
├─> Send machine.startup event to Redis
├─> Workers register capabilities
└─> Status aggregator starts periodic updates
11. Ready for Jobs
└─> Workers poll Redis for matching jobsC. Testing Matrix
| Test Type | Tools | Speed | Coverage | When to Run |
|---|---|---|---|---|
| Unit | Vitest | < 1s | High | Every code change |
| Integration | Vitest + Docker | 30-60s | Medium | Pre-commit |
| E2E | Custom scripts | 5-10min | Full stack | PR, nightly |
| Performance | Custom + metrics | 10-30min | Throughput | Weekly |
D. Key Metrics
Environment Build Times:
- local-dev: ~2s (15 component files, 8 service interfaces)
- staging: ~2s
- production: ~2s
Machine Startup Times:
- Simulation: 10-15s (no ComfyUI)
- ComfyUI (no models): 2-3 minutes
- ComfyUI (with models): 5-10 minutes (depends on model download)
Test Execution Times (Current):
- Worker unit tests: 5-10s (45+ test files)
- Redis integration tests: 15-30s
- Full E2E: 5-10 minutes
Docker Image Sizes:
- Base machine: ~15GB (PyTorch + Node + system deps)
- ComfyUI profile: +5GB (ComfyUI + custom nodes)
- Ollama profile: +2GB (Ollama binary)
Document Prepared By: Claude (Anthropic AI Assistant) Review Status: Draft for stakeholder review Next Steps: Review findings → Implement quick wins → Re-evaluate in Q1 2026
