Skip to content

EMP-Job-Queue Architecture Analysis & Docker Swarm Evaluation

Date: 2025-10-08 Purpose: Comprehensive analysis of current architecture, testing complexity, and Docker Swarm migration evaluation


Executive Summary

The emp-job-queue system is a sophisticated distributed AI workload broker designed for elastic scaling across ephemeral machines. The current architecture successfully solves complex problems around distributed job routing, model management, and multi-service orchestration.

Key Findings:

  1. Environment management is well-architected - Component-based system provides type-safe, flexible configuration
  2. Machine/worker architecture is production-ready - PM2-based orchestration works well for current needs
  3. Testing complexity stems from distribution, not tooling - Docker Swarm won't fundamentally reduce complexity
  4. Current pain points are solvable without migration - Quick wins available through better documentation and tooling

Recommendation: DEFER Docker Swarm migration. The migration would require 40-60 hours of engineering effort for marginal testing improvements. Better ROI from improving documentation, test tooling, and development workflows with current architecture.


Current Architecture Assessment

1. Environment Management System

Status:Well-designed, production-ready

Strengths:

  • Type-safe service interfaces enforce required variables at build time
  • Flexible component-based composition supports multiple deployment targets
  • Secure automatic separation of public/secret variables
  • DRY (Don't Repeat Yourself) - variables defined once, referenced everywhere
  • Multi-environment - single codebase supports local/staging/production seamlessly

Architecture Pattern:

Component Files (.env) → Service Interfaces (.ts) → Generated .env Files
     ↓                           ↓                          ↓
Define capabilities      Define requirements        Per-service configs

Pain Points:

  • Learning curve - New developers need to understand 3-layer system (components → interfaces → profiles)
  • Debugging opacity - Variable resolution happens at build time, errors can be cryptic
  • Documentation gap - System is powerful but under-documented (now resolved with new docs)

Recommendation:

  • Keep current system - well-architected for monorepo microservices
  • Add troubleshooting guide with common error patterns
  • Create visual debugging tool to show variable resolution flow

2. Machine/Worker Architecture

Status:Production-capable, evolving toward north star

Current Design:

Docker Container (Machine)
├── PM2 Process Manager
│   ├── Service Manager (Node.js main process)
│   ├── ComfyUI Instances (per-GPU)
│   │   ├── comfyui-gpu0 (port 8188)
│   │   ├── comfyui-gpu1 (port 8189)
│   │   └── comfyui-gpuN...
│   ├── Ollama Daemon (optional)
│   └── Redis Workers (Node.js)
│       ├── Worker processes (connect to Redis)
│       └── Connector system (OpenAI, ComfyUI, Ollama, etc.)
├── Telemetry Stack
│   ├── Fluent Bit (log aggregation)
│   ├── OTEL Collector (traces/metrics)
│   └── Nginx (optional monitoring)
└── Entrypoint Scripts (startup orchestration)

Lifecycle Flow:

  1. Container Start → Entrypoint script (entrypoint-machine-final.sh)
  2. Environment Setup → Decrypt environment, set variables
  3. Worker Bundle → Download or copy bundled worker code
  4. System Services → Install/start Ollama if needed
  5. PM2 Ecosystem Generation → Generate PM2 config from WORKERS env var
  6. Service Manager Start → Node.js main process (index-pm2.js)
  7. Sequential Startup → ComfyUI → Custom Nodes → Models → Workers
  8. Machine Registration → Redis registration with capabilities
  9. Worker Polling → Start requesting jobs from Redis

Strengths:

  • PM2 robustness - Process supervision, restart on failure, log rotation
  • Multi-GPU support - Horizontal scaling within single container
  • Worker bundle - Pre-bundled or runtime-downloaded worker code
  • Component-based - API-driven custom node and model installation
  • Telemetry-first - Built-in observability with structured logging

Pain Points:

  • Complexity - Many moving parts (PM2, services, workers, telemetry)
  • Debugging - Distributed logs across PM2 services, container logs, telemetry
  • Startup time - Sequential startup can take 5-10 minutes for ComfyUI + models
  • Testing isolation - Hard to test individual components without full stack

Evolution Path (Per CLAUDE.md):

  • Phase 0 (Current): Redis job broker, PM2 orchestration, component-based setup
  • Phase 1: Pool separation (Fast Lane / Standard / Heavy)
  • Phase 2: Model intelligence (predictive placement, baked containers)
  • Phase 3: Advanced optimization (ML-based routing, specialty pools)

Testing Complexity Analysis

Current Pain Points

1. Environment Setup Complexity

  • Problem: Developers need to build correct environment profile, ensure secrets file exists, understand component layering
  • Impact: New developer onboarding takes hours
  • Root Cause: Documentation gap, not architecture flaw
  • Solution: ✅ Created comprehensive environment management docs

2. Multi-Service Dependencies

  • Problem: Machine requires Redis, Worker requires Machine services, API requires Database
  • Impact: Can't test components in isolation easily
  • Root Cause: Distributed architecture (inherent to design, not solvable by orchestration change)
  • Solution: Better test fixtures, mock services, documented test patterns

3. Log Aggregation

  • Problem: Logs scattered across PM2 services, container stdout, telemetry streams
  • Impact: Debugging requires checking multiple sources
  • Root Cause: Multi-process PM2 architecture
  • Solution: Centralized log viewer tool (already have telemetry stack, needs UI)

4. State Management

  • Problem: Redis state persists between test runs, causing flaky tests
  • Impact: Tests pass/fail inconsistently
  • Root Cause: Shared Redis instance, no automatic cleanup
  • Solution: Test isolation patterns, per-test Redis namespaces, cleanup fixtures

5. Integration Test Overhead

  • Problem: Full stack tests take minutes to spin up (Redis + API + Machine + Workers)
  • Impact: Slow development feedback loop
  • Root Cause: Real services vs mocks (trade-off for high-fidelity testing)
  • Solution: Tiered testing strategy (unit → integration → e2e), better mocking

Testing Best Practices (What Works)

Unit Testing:

  • ✅ Worker has excellent unit test coverage (apps/worker/src/__tests__/)
  • ✅ Vitest provides fast, reliable unit test execution
  • ✅ Mocking patterns work well (apps/worker/src/mocks/)

Integration Testing:

  • ✅ Redis integration tests validate job matching (packages/core/src/redis-functions/__tests__/)
  • ✅ Connector tests validate external API integrations
  • ⚠️ Machine integration tests are minimal (opportunity for improvement)

E2E Testing:

  • ⚠️ Full workflow tests exist but are complex to maintain
  • ⚠️ Debugging failures requires deep system knowledge
  • ⚠️ Flaky tests due to timing issues, resource contention

Docker Swarm Migration Evaluation

What Docker Swarm Would Provide

Service Definition:

yaml
# docker-compose.swarm.yml
version: '3.8'
services:
  redis:
    image: redis:7
    deploy:
      replicas: 1
      placement:
        constraints: [node.role == manager]

  api:
    image: emp/api:latest
    deploy:
      replicas: 2
      update_config:
        parallelism: 1
        delay: 10s
    environment:
      - REDIS_URL=redis://redis:6379

  machine-base:
    image: emp/machine:base
    deploy:
      replicas: 0  # Template only
    environment:
      - WORKERS=simulation:1

  machine-comfyui:
    image: emp/machine:comfyui
    deploy:
      replicas: 3
      placement:
        constraints: [node.labels.gpu == true]
    environment:
      - WORKERS=comfyui:2

Swarm Features:

  1. Service Discovery - Services find each other by name (e.g., redis://redis:6379)
  2. Scaling - docker service scale machine-comfyui=5
  3. Rolling Updates - Zero-downtime deployments
  4. Health Checks - Automatic restart of unhealthy services
  5. Secrets Management - Docker secrets vs .env.secret files
  6. Load Balancing - Ingress routing mesh
  7. Constraints - Node placement rules (GPU nodes, memory requirements)

Benefits for Testing

Potential Improvements:

  1. Declarative Infrastructure - Test environments defined in docker-compose.test.yml
  2. Service Templating - Reuse machine service definition with different WORKERS env
  3. Network Isolation - Each test run gets isolated network namespace
  4. Resource Limits - Enforce CPU/memory limits to prevent resource starvation
  5. Parallel Testing - Spin up multiple isolated stacks simultaneously

Example Test Setup:

yaml
# docker-compose.test.yml
services:
  test-redis:
    image: redis:7
    networks: [test-network]

  test-api:
    build: ./apps/api
    environment:
      - REDIS_URL=redis://test-redis:6379
      - NODE_ENV=test
    depends_on: [test-redis]
    networks: [test-network]

  test-machine:
    build:
      context: ./apps/machine
      target: simulation
    environment:
      - HUB_REDIS_URL=redis://test-redis:6379
      - WORKERS=simulation:1
    depends_on: [test-api]
    networks: [test-network]

networks:
  test-network:
    driver: overlay

Migration Effort Estimate

Phase 1: Swarm Setup (8-12 hours)

  • Convert docker-compose.yml to Swarm-compatible format
  • Set up multi-node Swarm cluster (manager + workers)
  • Configure overlay networks
  • Test service discovery patterns

Phase 2: Service Adaptation (16-24 hours)

  • Update entrypoint scripts for Swarm lifecycle
  • Convert .env.secret files to Docker secrets
  • Update health checks for Swarm healthcheck format
  • Migrate PM2 orchestration to Swarm services (or keep PM2 within containers)

Phase 3: CI/CD Integration (8-12 hours)

  • Update deployment scripts for docker stack deploy
  • Configure rolling update strategies
  • Set up secrets injection pipeline
  • Test production deployment workflow

Phase 4: Testing Framework (8-12 hours)

  • Create test-specific Swarm configurations
  • Implement test isolation (network namespaces, service naming)
  • Update test scripts to use Swarm CLI
  • Document test patterns and troubleshooting

Total: 40-60 hours (1-1.5 weeks of focused engineering)

Swarm Downsides

1. Increased Operational Complexity

  • Current: Docker Compose up/down (simple)
  • Swarm: Swarm init, join tokens, node management, stack deploy, service update
  • Impact: Higher learning curve for team, more moving parts in production

2. PM2 vs Swarm Service Management

  • Current: PM2 manages processes within container (comfyui-gpu0, comfyui-gpu1, workers)
  • Swarm: Either keep PM2 (no benefit) or split into separate Swarm services (complexity explosion)
  • Trade-off: Lose per-GPU PM2 orchestration or gain no benefit from Swarm

3. Local Development Friction

  • Current: Docker Compose works identically on laptops and servers
  • Swarm: Swarm mode requires docker swarm init on laptops, complicates simple dev workflows
  • Impact: Developers need two workflows (Compose for dev, Swarm for staging/prod)

4. Debugging Complexity

  • Current: docker logs container-name, docker exec -it container bash
  • Swarm: docker service logs service-name (aggregated across replicas), docker exec requires finding specific task ID
  • Impact: Harder to debug individual instances

5. GPU Scheduling Challenges

  • Current: PM2 creates comfyui-gpu0, comfyui-gpu1 within single container with GPU access
  • Swarm: GPU constraints are node-level, not per-service. Would need different approach:
    • Option A: One container per GPU (more containers, more overhead)
    • Option B: Keep current PM2 approach (no Swarm benefit)

Critical Question: What Problem Does Swarm Solve?

Current Architecture Problems:

  1. Testing complexity → Swarm doesn't solve this (still need Redis, API, Machine stack)
  2. Environment configuration → Swarm doesn't simplify this (still need .env files or Swarm secrets)
  3. Log aggregation → Already have telemetry stack (Fluent Bit + OTEL)
  4. Multi-GPU orchestration → PM2 works well, Swarm complicates this
  5. State management → Redis state persists regardless of orchestrator

Swarm Solutions:

  1. Service discovery - Nice, but we already use explicit REDIS_URL, API_URL env vars
  2. Scaling - We scale by adding machines (SALAD/vast.ai), not replicating services
  3. Rolling updates - Useful for API/monitor, less so for ephemeral machines
  4. ⚠️ Secrets management - Docker secrets vs .env.secret is a lateral move, not improvement

Conclusion: Swarm solves problems we don't have, adds complexity to areas that work well.


Testing Improvements (Without Swarm)

Quick Wins (< 8 hours total)

1. Test Isolation Utilities (2 hours)

typescript
// packages/test-utils/src/redis-isolation.ts
export class RedisTestIsolation {
  private namespace: string;

  constructor(testName: string) {
    this.namespace = `test:${testName}:${Date.now()}`;
  }

  // Prefix all Redis keys with namespace
  jobKey(id: string): string {
    return `${this.namespace}:job:${id}`;
  }

  // Cleanup after test
  async cleanup(redis: Redis): Promise<void> {
    const keys = await redis.keys(`${this.namespace}:*`);
    if (keys.length > 0) {
      await redis.del(...keys);
    }
  }
}

2. Environment Setup Script (2 hours)

bash
#!/bin/bash
# tools/setup-test-env.sh

echo "🔧 Setting up test environment..."

# Check secrets file exists
if [[ ! -f "config/environments/secrets/.env.secrets.local" ]]; then
  echo "❌ Missing secrets file"
  echo "👉 Copy example: cp config/environments/secrets/.env.secrets.local.example config/environments/secrets/.env.secrets.local"
  exit 1
fi

# Build test environment
pnpm env:build testrunner

# Start Redis
docker-compose -f docker-compose.test.yml up -d redis

# Wait for Redis
until docker exec test-redis redis-cli ping; do
  sleep 1
done

echo "✅ Test environment ready"

3. Log Aggregation Tool (3 hours)

typescript
// tools/log-viewer/index.ts
// Simple web UI to view logs from:
// - PM2 services (read from /workspace/logs/)
// - Container stdout (docker logs)
// - Telemetry streams (Redis streams)

// Features:
// - Real-time tail
// - Filtering by service/level
// - Search across all logs
// - Timeline view

4. Test Documentation (1 hour)

  • Document standard test patterns
  • Create testing troubleshooting guide
  • Add examples of unit/integration/e2e test structure

Medium-term Improvements (8-16 hours)

1. Mock Service Generator (8 hours)

typescript
// Generate lightweight mock services for testing:
// - Mock Redis (in-memory, fast)
// - Mock API (stub endpoints)
// - Mock Machine (fake workers)

// Benefits:
// - Fast unit tests (no Docker)
// - Deterministic behavior
// - Easy to customize per test

2. Test Fixtures Library (4 hours)

typescript
// Pre-built test data:
// - Sample jobs (comfyui, ollama, openai)
// - Worker capabilities
// - Machine configurations

// Benefits:
// - Consistent test data
// - Reduce test boilerplate
// - Easy to extend

3. Dev Mode Optimizations (4 hours)

  • Skip telemetry in test mode (faster startup)
  • Parallel PM2 service startup (where possible)
  • Cached model downloads (share across containers)
  • Volume mounts for worker code (no rebuild needed)

Long-term Vision (16-24 hours)

1. Tiered Testing Strategy

Unit Tests (Fast - seconds)
↓ Use mocks, no Docker
↓ Run on every code change

Integration Tests (Medium - 30-60s)
↓ Real Redis, mocked external APIs
↓ Run on pre-commit

E2E Tests (Slow - 5-10 mins)
↓ Full stack, real services
↓ Run on PR, nightly

2. Test Infrastructure as Code

  • Codify test environments in docker-compose files
  • Document test database setup
  • Automate test environment provisioning

3. Continuous Testing

  • GitHub Actions workflows for PR testing
  • Automated environment provisioning
  • Test result visualization

Recommendations

Immediate Actions (This Week)

  1. ✅ DONE: Create comprehensive environment management documentation
  2. Create testing procedures doc - Document standard test setup, common patterns, troubleshooting
  3. Add test isolation utilities - RedisTestIsolation class for namespace-based cleanup
  4. Create setup-test-env.sh script - One-command test environment setup

Short-term (This Month)

  1. Improve machine documentation - Lifecycle flow, PM2 orchestration, debugging guide
  2. Build log aggregation UI - Simple web viewer for PM2 logs, container logs, telemetry
  3. Create mock service library - Lightweight mocks for fast unit testing
  4. Document test patterns - Examples of good unit/integration/e2e tests

Medium-term (Next Quarter)

  1. Tiered testing strategy - Separate fast/medium/slow tests, optimize CI/CD
  2. Dev mode optimizations - Faster startup, better caching, volume mounts
  3. Test fixtures library - Pre-built test data for common scenarios
  4. Monitoring dashboard - Real-time view of machines, workers, jobs (leverage existing telemetry)

Docker Swarm Decision

Recommendation: DEFER indefinitely

Reasons:

  1. Marginal benefits - Swarm solves problems we don't have (service discovery works, scaling is machine-based not replica-based)
  2. High migration cost - 40-60 hours for uncertain ROI
  3. Current architecture works - PM2 orchestration is production-ready
  4. Better alternatives exist - Quick wins address actual pain points (testing, documentation)
  5. Complexity trade-off - Swarm adds operational complexity without clear testing benefits
  6. GPU orchestration mismatch - Current PM2 approach (multiple processes per GPU) doesn't map well to Swarm services

When to Reconsider:

  • If we need true multi-node orchestration (currently use SALAD/vast.ai for scaling, not self-hosted clusters)
  • If we adopt Kubernetes (more powerful than Swarm, worth migration effort)
  • If PM2 orchestration becomes bottleneck (not currently the case)

Key Takeaways

  1. Environment management is a strength - Well-designed, just needs better documentation (now provided)
  2. Machine/worker architecture is production-capable - PM2 orchestration works well for GPU workloads
  3. Testing complexity is inherent to distribution - No silver bullet; incremental improvements are the path
  4. Docker Swarm is not the answer - Would add complexity without solving real pain points
  5. Quick wins available - Test isolation, documentation, tooling can provide immediate relief

Focus areas for next month:

  • ✅ Environment documentation (done)
  • Testing documentation and utilities
  • Log aggregation tooling
  • Developer experience improvements

Avoid:

  • Large architectural rewrites (Swarm migration)
  • Over-engineering test infrastructure
  • Solutions looking for problems

Appendices

A. Current File Structure

emerge-turbo/
├── apps/
│   ├── api/                     # Job queue API (Redis orchestration)
│   ├── machine/                 # Container deployment (PM2 orchestration)
│   │   ├── Dockerfile          # Multi-stage: base, comfyui, ollama, simulation
│   │   ├── src/
│   │   │   ├── index-pm2.js    # Main entry point
│   │   │   ├── services/       # Service management
│   │   │   │   ├── component-manager.js  # API-driven custom nodes/models
│   │   │   │   ├── comfyui-management-client.js
│   │   │   │   ├── machine-status-aggregator.js
│   │   │   │   ├── redis-worker-service.js
│   │   │   │   └── sequential-startup-orchestrator.js
│   │   ├── scripts/
│   │   │   └── entrypoint-machine-final.sh  # Container startup
│   │   └── worker-bundled/     # Pre-bundled worker code (local mode)
│   ├── worker/                 # Worker processes (connect to Redis)
│   │   ├── src/
│   │   │   ├── redis-direct-worker-client.ts  # Redis communication
│   │   │   ├── connector-manager.ts           # Connector loading
│   │   │   └── connectors/                    # Service integrations
│   │   └── __tests__/          # Excellent unit test coverage
│   ├── monitor/                # Real-time monitoring UI
│   └── emprops-api/            # EmProps platform API
├── config/
│   └── environments/
│       ├── components/         # Component .env files
│       ├── services/           # Service interface .ts files
│       ├── profiles/           # Environment profiles .json
│       └── secrets/            # .env.secrets.local (gitignored)
├── packages/
│   ├── env-management/         # Environment builder
│   ├── core/                   # Shared types, Redis functions
│   ├── telemetry/              # Unified telemetry client
│   └── test-utils/             # (To be created)
└── tools/                      # CLI tools, debugging utilities

B. Machine Startup Sequence

1. Container Start
   └─> entrypoint-machine-final.sh

2. Environment Setup
   ├─> Decrypt .env.encrypted → environment variables
   ├─> Set MACHINE_ID, WORKER_ID
   └─> Load service mappings

3. Directory Setup
   ├─> /workspace/.pm2 (PM2 home)
   ├─> /workspace/logs (logs)
   └─> /workspace/ComfyUI (if comfyui profile)

4. Worker Bundle
   ├─> LOCAL MODE: Copy /service-manager/worker-bundled → /workspace/worker-bundled
   └─> REMOTE MODE: Download from GitHub releases

5. System Services (if needed)
   ├─> Ollama: curl install.sh | sh
   ├─> Start ollama serve
   └─> Pull default models (OLLAMA_DEFAULT_MODELS)

6. Telemetry Initialization
   ├─> Create telemetry client
   ├─> Add log file monitors (Winston, PM2, ComfyUI)
   ├─> Send machine.registered event
   └─> Start telemetry pipelines (FluentBit, OTEL)

7. PM2 Ecosystem Generation
   ├─> Parse WORKERS env var (e.g., "comfyui:2,ollama:1")
   ├─> Generate PM2 config with services:
   │   ├─> service-manager (main)
   │   ├─> comfyui-gpu0, comfyui-gpu1 (if comfyui workers)
   │   └─> worker-* (Redis workers)
   └─> Write to /workspace/pm2-ecosystem.config.cjs

8. Service Manager Start
   └─> PM2 starts index-pm2.js (main process)

9. Sequential Startup Orchestrator
   ├─> STEP 1: Health Server (port 9090)
   ├─> STEP 2-14: ComfyUI Installation (if comfyui profile)
   │   ├─> Clone ComfyUI repo
   │   ├─> Install Python dependencies
   │   ├─> Start ComfyUI instances (PM2)
   │   ├─> Wait for health checks
   │   └─> Verify GPU access
   ├─> STEP 15-18: Component Manager
   │   ├─> Fetch default custom nodes from API
   │   ├─> Fetch workflow/collection dependencies (if COMPONENTS/COLLECTIONS env vars)
   │   ├─> Install custom nodes (git clone, pip install)
   │   └─> Download models (wget)
   └─> STEP 19+: Worker Services
       ├─> Start Redis workers (PM2)
       └─> Register with Redis hub

10. Machine Registration
    ├─> Send machine.startup event to Redis
    ├─> Workers register capabilities
    └─> Status aggregator starts periodic updates

11. Ready for Jobs
    └─> Workers poll Redis for matching jobs

C. Testing Matrix

Test TypeToolsSpeedCoverageWhen to Run
UnitVitest< 1sHighEvery code change
IntegrationVitest + Docker30-60sMediumPre-commit
E2ECustom scripts5-10minFull stackPR, nightly
PerformanceCustom + metrics10-30minThroughputWeekly

D. Key Metrics

Environment Build Times:

  • local-dev: ~2s (15 component files, 8 service interfaces)
  • staging: ~2s
  • production: ~2s

Machine Startup Times:

  • Simulation: 10-15s (no ComfyUI)
  • ComfyUI (no models): 2-3 minutes
  • ComfyUI (with models): 5-10 minutes (depends on model download)

Test Execution Times (Current):

  • Worker unit tests: 5-10s (45+ test files)
  • Redis integration tests: 15-30s
  • Full E2E: 5-10 minutes

Docker Image Sizes:

  • Base machine: ~15GB (PyTorch + Node + system deps)
  • ComfyUI profile: +5GB (ComfyUI + custom nodes)
  • Ollama profile: +2GB (Ollama binary)

Document Prepared By: Claude (Anthropic AI Assistant) Review Status: Draft for stakeholder review Next Steps: Review findings → Implement quick wins → Re-evaluate in Q1 2026

Released under the MIT License.