EMP-Job-Queue Architecture Analysis & Docker Swarm Evaluation

Date: 2025-10-08 Purpose: Comprehensive analysis of current architecture, testing complexity, and Docker Swarm migration evaluation

Executive Summary

The emp-job-queue system is a sophisticated distributed AI workload broker designed for elastic scaling across ephemeral machines. The current architecture successfully solves complex problems around distributed job routing, model management, and multi-service orchestration.

Key Findings:

Environment management is well-architected - Component-based system provides type-safe, flexible configuration
Machine/worker architecture is production-ready - PM2-based orchestration works well for current needs
Testing complexity stems from distribution, not tooling - Docker Swarm won't fundamentally reduce complexity
Current pain points are solvable without migration - Quick wins available through better documentation and tooling

Recommendation: DEFER Docker Swarm migration. The migration would require 40-60 hours of engineering effort for marginal testing improvements. Better ROI from improving documentation, test tooling, and development workflows with current architecture.

Current Architecture Assessment

1. Environment Management System

Status: ✅ Well-designed, production-ready

Strengths:

Type-safe service interfaces enforce required variables at build time
Flexible component-based composition supports multiple deployment targets
Secure automatic separation of public/secret variables
DRY (Don't Repeat Yourself) - variables defined once, referenced everywhere
Multi-environment - single codebase supports local/staging/production seamlessly

Architecture Pattern:

Component Files (.env) → Service Interfaces (.ts) → Generated .env Files
     ↓                           ↓                          ↓
Define capabilities      Define requirements        Per-service configs

Pain Points:

Learning curve - New developers need to understand 3-layer system (components → interfaces → profiles)
Debugging opacity - Variable resolution happens at build time, errors can be cryptic
Documentation gap - System is powerful but under-documented (now resolved with new docs)

Recommendation:

✅ Keep current system - well-architected for monorepo microservices
Add troubleshooting guide with common error patterns
Create visual debugging tool to show variable resolution flow

2. Machine/Worker Architecture

Status: ✅ Production-capable, evolving toward north star

Current Design:

Docker Container (Machine)
├── PM2 Process Manager
│   ├── Service Manager (Node.js main process)
│   ├── ComfyUI Instances (per-GPU)
│   │   ├── comfyui-gpu0 (port 8188)
│   │   ├── comfyui-gpu1 (port 8189)
│   │   └── comfyui-gpuN...
│   ├── Ollama Daemon (optional)
│   └── Redis Workers (Node.js)
│       ├── Worker processes (connect to Redis)
│       └── Connector system (OpenAI, ComfyUI, Ollama, etc.)
├── Telemetry Stack
│   ├── Fluent Bit (log aggregation)
│   ├── OTEL Collector (traces/metrics)
│   └── Nginx (optional monitoring)
└── Entrypoint Scripts (startup orchestration)

Lifecycle Flow:

Container Start → Entrypoint script (entrypoint-machine-final.sh)
Environment Setup → Decrypt environment, set variables
Worker Bundle → Download or copy bundled worker code
System Services → Install/start Ollama if needed
PM2 Ecosystem Generation → Generate PM2 config from WORKERS env var
Service Manager Start → Node.js main process (index-pm2.js)
Sequential Startup → ComfyUI → Custom Nodes → Models → Workers
Machine Registration → Redis registration with capabilities
Worker Polling → Start requesting jobs from Redis

Strengths:

PM2 robustness - Process supervision, restart on failure, log rotation
Multi-GPU support - Horizontal scaling within single container
Worker bundle - Pre-bundled or runtime-downloaded worker code
Component-based - API-driven custom node and model installation
Telemetry-first - Built-in observability with structured logging

Pain Points:

Complexity - Many moving parts (PM2, services, workers, telemetry)
Debugging - Distributed logs across PM2 services, container logs, telemetry
Startup time - Sequential startup can take 5-10 minutes for ComfyUI + models
Testing isolation - Hard to test individual components without full stack

Evolution Path (Per CLAUDE.md):

✅ Phase 0 (Current): Redis job broker, PM2 orchestration, component-based setup
→ Phase 1: Pool separation (Fast Lane / Standard / Heavy)
→ Phase 2: Model intelligence (predictive placement, baked containers)
→ Phase 3: Advanced optimization (ML-based routing, specialty pools)

Testing Complexity Analysis

Current Pain Points

1. Environment Setup Complexity

Problem: Developers need to build correct environment profile, ensure secrets file exists, understand component layering
Impact: New developer onboarding takes hours
Root Cause: Documentation gap, not architecture flaw
Solution: ✅ Created comprehensive environment management docs

2. Multi-Service Dependencies

Problem: Machine requires Redis, Worker requires Machine services, API requires Database
Impact: Can't test components in isolation easily
Root Cause: Distributed architecture (inherent to design, not solvable by orchestration change)
Solution: Better test fixtures, mock services, documented test patterns

3. Log Aggregation

Problem: Logs scattered across PM2 services, container stdout, telemetry streams
Impact: Debugging requires checking multiple sources
Root Cause: Multi-process PM2 architecture
Solution: Centralized log viewer tool (already have telemetry stack, needs UI)

4. State Management

Problem: Redis state persists between test runs, causing flaky tests
Impact: Tests pass/fail inconsistently
Root Cause: Shared Redis instance, no automatic cleanup
Solution: Test isolation patterns, per-test Redis namespaces, cleanup fixtures

5. Integration Test Overhead

Problem: Full stack tests take minutes to spin up (Redis + API + Machine + Workers)
Impact: Slow development feedback loop
Root Cause: Real services vs mocks (trade-off for high-fidelity testing)
Solution: Tiered testing strategy (unit → integration → e2e), better mocking

Testing Best Practices (What Works)

Unit Testing:

✅ Worker has excellent unit test coverage (apps/worker/src/__tests__/)
✅ Vitest provides fast, reliable unit test execution
✅ Mocking patterns work well (apps/worker/src/mocks/)

Integration Testing:

✅ Redis integration tests validate job matching (packages/core/src/redis-functions/__tests__/)
✅ Connector tests validate external API integrations
⚠️ Machine integration tests are minimal (opportunity for improvement)

E2E Testing:

⚠️ Full workflow tests exist but are complex to maintain
⚠️ Debugging failures requires deep system knowledge
⚠️ Flaky tests due to timing issues, resource contention

Docker Swarm Migration Evaluation

What Docker Swarm Would Provide

Service Definition:

yaml

# docker-compose.swarm.yml
version: '3.8'
services:
  redis:
    image: redis:7
    deploy:
      replicas: 1
      placement:
        constraints: [node.role == manager]

  api:
    image: emp/api:latest
    deploy:
      replicas: 2
      update_config:
        parallelism: 1
        delay: 10s
    environment:
      - REDIS_URL=redis://redis:6379

  machine-base:
    image: emp/machine:base
    deploy:
      replicas: 0  # Template only
    environment:
      - WORKERS=simulation:1

  machine-comfyui:
    image: emp/machine:comfyui
    deploy:
      replicas: 3
      placement:
        constraints: [node.labels.gpu == true]
    environment:
      - WORKERS=comfyui:2

Swarm Features:

Service Discovery - Services find each other by name (e.g., redis://redis:6379)
Scaling - docker service scale machine-comfyui=5
Rolling Updates - Zero-downtime deployments
Health Checks - Automatic restart of unhealthy services
Secrets Management - Docker secrets vs .env.secret files
Load Balancing - Ingress routing mesh
Constraints - Node placement rules (GPU nodes, memory requirements)

Benefits for Testing

Potential Improvements:

Declarative Infrastructure - Test environments defined in docker-compose.test.yml
Service Templating - Reuse machine service definition with different WORKERS env
Network Isolation - Each test run gets isolated network namespace
Resource Limits - Enforce CPU/memory limits to prevent resource starvation
Parallel Testing - Spin up multiple isolated stacks simultaneously

Example Test Setup:

yaml

# docker-compose.test.yml
services:
  test-redis:
    image: redis:7
    networks: [test-network]

  test-api:
    build: ./apps/api
    environment:
      - REDIS_URL=redis://test-redis:6379
      - NODE_ENV=test
    depends_on: [test-redis]
    networks: [test-network]

  test-machine:
    build:
      context: ./apps/machine
      target: simulation
    environment:
      - HUB_REDIS_URL=redis://test-redis:6379
      - WORKERS=simulation:1
    depends_on: [test-api]
    networks: [test-network]

networks:
  test-network:
    driver: overlay

Migration Effort Estimate

Phase 1: Swarm Setup (8-12 hours)

Convert docker-compose.yml to Swarm-compatible format
Set up multi-node Swarm cluster (manager + workers)
Configure overlay networks
Test service discovery patterns

Phase 2: Service Adaptation (16-24 hours)

Update entrypoint scripts for Swarm lifecycle
Convert .env.secret files to Docker secrets
Update health checks for Swarm healthcheck format
Migrate PM2 orchestration to Swarm services (or keep PM2 within containers)

Phase 3: CI/CD Integration (8-12 hours)

Update deployment scripts for docker stack deploy
Configure rolling update strategies
Set up secrets injection pipeline
Test production deployment workflow

Phase 4: Testing Framework (8-12 hours)

Create test-specific Swarm configurations
Implement test isolation (network namespaces, service naming)
Update test scripts to use Swarm CLI
Document test patterns and troubleshooting

Total: 40-60 hours (1-1.5 weeks of focused engineering)

Swarm Downsides

1. Increased Operational Complexity

Current: Docker Compose up/down (simple)
Swarm: Swarm init, join tokens, node management, stack deploy, service update
Impact: Higher learning curve for team, more moving parts in production

2. PM2 vs Swarm Service Management

Current: PM2 manages processes within container (comfyui-gpu0, comfyui-gpu1, workers)
Swarm: Either keep PM2 (no benefit) or split into separate Swarm services (complexity explosion)
Trade-off: Lose per-GPU PM2 orchestration or gain no benefit from Swarm

3. Local Development Friction

Current: Docker Compose works identically on laptops and servers
Swarm: Swarm mode requires docker swarm init on laptops, complicates simple dev workflows
Impact: Developers need two workflows (Compose for dev, Swarm for staging/prod)

4. Debugging Complexity

Current: docker logs container-name, docker exec -it container bash
Swarm: docker service logs service-name (aggregated across replicas), docker exec requires finding specific task ID
Impact: Harder to debug individual instances

5. GPU Scheduling Challenges

Current: PM2 creates comfyui-gpu0, comfyui-gpu1 within single container with GPU access
Swarm: GPU constraints are node-level, not per-service. Would need different approach:
- Option A: One container per GPU (more containers, more overhead)
- Option B: Keep current PM2 approach (no Swarm benefit)

Critical Question: What Problem Does Swarm Solve?

Current Architecture Problems:

❌ Testing complexity → Swarm doesn't solve this (still need Redis, API, Machine stack)
❌ Environment configuration → Swarm doesn't simplify this (still need .env files or Swarm secrets)
❌ Log aggregation → Already have telemetry stack (Fluent Bit + OTEL)
❌ Multi-GPU orchestration → PM2 works well, Swarm complicates this
❌ State management → Redis state persists regardless of orchestrator

Swarm Solutions:

✅ Service discovery - Nice, but we already use explicit REDIS_URL, API_URL env vars
✅ Scaling - We scale by adding machines (SALAD/vast.ai), not replicating services
✅ Rolling updates - Useful for API/monitor, less so for ephemeral machines
⚠️ Secrets management - Docker secrets vs .env.secret is a lateral move, not improvement

Conclusion: Swarm solves problems we don't have, adds complexity to areas that work well.

Testing Improvements (Without Swarm)

Quick Wins (< 8 hours total)

1. Test Isolation Utilities (2 hours)

typescript

// packages/test-utils/src/redis-isolation.ts
export class RedisTestIsolation {
  private namespace: string;

  constructor(testName: string) {
    this.namespace = `test:${testName}:${Date.now()}`;
  }

  // Prefix all Redis keys with namespace
  jobKey(id: string): string {
    return `${this.namespace}:job:${id}`;
  }

  // Cleanup after test
  async cleanup(redis: Redis): Promise<void> {
    const keys = await redis.keys(`${this.namespace}:*`);
    if (keys.length > 0) {
      await redis.del(...keys);
    }
  }
}

2. Environment Setup Script (2 hours)

bash

#!/bin/bash
# tools/setup-test-env.sh

echo "🔧 Setting up test environment..."

# Check secrets file exists
if [[ ! -f "config/environments/secrets/.env.secrets.local" ]]; then
  echo "❌ Missing secrets file"
  echo "👉 Copy example: cp config/environments/secrets/.env.secrets.local.example config/environments/secrets/.env.secrets.local"
  exit 1
fi

# Build test environment
pnpm env:build testrunner

# Start Redis
docker-compose -f docker-compose.test.yml up -d redis

# Wait for Redis
until docker exec test-redis redis-cli ping; do
  sleep 1
done

echo "✅ Test environment ready"

3. Log Aggregation Tool (3 hours)

typescript

// tools/log-viewer/index.ts
// Simple web UI to view logs from:
// - PM2 services (read from /workspace/logs/)
// - Container stdout (docker logs)
// - Telemetry streams (Redis streams)

// Features:
// - Real-time tail
// - Filtering by service/level
// - Search across all logs
// - Timeline view

4. Test Documentation (1 hour)

Document standard test patterns
Create testing troubleshooting guide
Add examples of unit/integration/e2e test structure

Medium-term Improvements (8-16 hours)

1. Mock Service Generator (8 hours)

typescript

// Generate lightweight mock services for testing:
// - Mock Redis (in-memory, fast)
// - Mock API (stub endpoints)
// - Mock Machine (fake workers)

// Benefits:
// - Fast unit tests (no Docker)
// - Deterministic behavior
// - Easy to customize per test

2. Test Fixtures Library (4 hours)

typescript

// Pre-built test data:
// - Sample jobs (comfyui, ollama, openai)
// - Worker capabilities
// - Machine configurations

// Benefits:
// - Consistent test data
// - Reduce test boilerplate
// - Easy to extend

3. Dev Mode Optimizations (4 hours)

Skip telemetry in test mode (faster startup)
Parallel PM2 service startup (where possible)
Cached model downloads (share across containers)
Volume mounts for worker code (no rebuild needed)

Long-term Vision (16-24 hours)

1. Tiered Testing Strategy

Unit Tests (Fast - seconds)
↓ Use mocks, no Docker
↓ Run on every code change

Integration Tests (Medium - 30-60s)
↓ Real Redis, mocked external APIs
↓ Run on pre-commit

E2E Tests (Slow - 5-10 mins)
↓ Full stack, real services
↓ Run on PR, nightly

2. Test Infrastructure as Code

Codify test environments in docker-compose files
Document test database setup
Automate test environment provisioning

3. Continuous Testing

GitHub Actions workflows for PR testing
Automated environment provisioning
Test result visualization

Recommendations

Immediate Actions (This Week)

✅ DONE: Create comprehensive environment management documentation
Create testing procedures doc - Document standard test setup, common patterns, troubleshooting
Add test isolation utilities - RedisTestIsolation class for namespace-based cleanup
Create setup-test-env.sh script - One-command test environment setup

Short-term (This Month)

Improve machine documentation - Lifecycle flow, PM2 orchestration, debugging guide
Build log aggregation UI - Simple web viewer for PM2 logs, container logs, telemetry
Create mock service library - Lightweight mocks for fast unit testing
Document test patterns - Examples of good unit/integration/e2e tests

Medium-term (Next Quarter)

Tiered testing strategy - Separate fast/medium/slow tests, optimize CI/CD
Dev mode optimizations - Faster startup, better caching, volume mounts
Test fixtures library - Pre-built test data for common scenarios
Monitoring dashboard - Real-time view of machines, workers, jobs (leverage existing telemetry)

Docker Swarm Decision

Recommendation: DEFER indefinitely

Reasons:

Marginal benefits - Swarm solves problems we don't have (service discovery works, scaling is machine-based not replica-based)
High migration cost - 40-60 hours for uncertain ROI
Current architecture works - PM2 orchestration is production-ready
Better alternatives exist - Quick wins address actual pain points (testing, documentation)
Complexity trade-off - Swarm adds operational complexity without clear testing benefits
GPU orchestration mismatch - Current PM2 approach (multiple processes per GPU) doesn't map well to Swarm services

When to Reconsider:

If we need true multi-node orchestration (currently use SALAD/vast.ai for scaling, not self-hosted clusters)
If we adopt Kubernetes (more powerful than Swarm, worth migration effort)
If PM2 orchestration becomes bottleneck (not currently the case)

Key Takeaways

Environment management is a strength - Well-designed, just needs better documentation (now provided)
Machine/worker architecture is production-capable - PM2 orchestration works well for GPU workloads
Testing complexity is inherent to distribution - No silver bullet; incremental improvements are the path
Docker Swarm is not the answer - Would add complexity without solving real pain points
Quick wins available - Test isolation, documentation, tooling can provide immediate relief

Focus areas for next month:

✅ Environment documentation (done)
Testing documentation and utilities
Log aggregation tooling
Developer experience improvements

Avoid:

Large architectural rewrites (Swarm migration)
Over-engineering test infrastructure
Solutions looking for problems

Appendices

A. Current File Structure

emerge-turbo/
├── apps/
│   ├── api/                     # Job queue API (Redis orchestration)
│   ├── machine/                 # Container deployment (PM2 orchestration)
│   │   ├── Dockerfile          # Multi-stage: base, comfyui, ollama, simulation
│   │   ├── src/
│   │   │   ├── index-pm2.js    # Main entry point
│   │   │   ├── services/       # Service management
│   │   │   │   ├── component-manager.js  # API-driven custom nodes/models
│   │   │   │   ├── comfyui-management-client.js
│   │   │   │   ├── machine-status-aggregator.js
│   │   │   │   ├── redis-worker-service.js
│   │   │   │   └── sequential-startup-orchestrator.js
│   │   ├── scripts/
│   │   │   └── entrypoint-machine-final.sh  # Container startup
│   │   └── worker-bundled/     # Pre-bundled worker code (local mode)
│   ├── worker/                 # Worker processes (connect to Redis)
│   │   ├── src/
│   │   │   ├── redis-direct-worker-client.ts  # Redis communication
│   │   │   ├── connector-manager.ts           # Connector loading
│   │   │   └── connectors/                    # Service integrations
│   │   └── __tests__/          # Excellent unit test coverage
│   ├── monitor/                # Real-time monitoring UI
│   └── emprops-api/            # EmProps platform API
├── config/
│   └── environments/
│       ├── components/         # Component .env files
│       ├── services/           # Service interface .ts files
│       ├── profiles/           # Environment profiles .json
│       └── secrets/            # .env.secrets.local (gitignored)
├── packages/
│   ├── env-management/         # Environment builder
│   ├── core/                   # Shared types, Redis functions
│   ├── telemetry/              # Unified telemetry client
│   └── test-utils/             # (To be created)
└── tools/                      # CLI tools, debugging utilities

B. Machine Startup Sequence

1. Container Start
   └─> entrypoint-machine-final.sh

2. Environment Setup
   ├─> Decrypt .env.encrypted → environment variables
   ├─> Set MACHINE_ID, WORKER_ID
   └─> Load service mappings

3. Directory Setup
   ├─> /workspace/.pm2 (PM2 home)
   ├─> /workspace/logs (logs)
   └─> /workspace/ComfyUI (if comfyui profile)

4. Worker Bundle
   ├─> LOCAL MODE: Copy /service-manager/worker-bundled → /workspace/worker-bundled
   └─> REMOTE MODE: Download from GitHub releases

5. System Services (if needed)
   ├─> Ollama: curl install.sh | sh
   ├─> Start ollama serve
   └─> Pull default models (OLLAMA_DEFAULT_MODELS)

6. Telemetry Initialization
   ├─> Create telemetry client
   ├─> Add log file monitors (Winston, PM2, ComfyUI)
   ├─> Send machine.registered event
   └─> Start telemetry pipelines (FluentBit, OTEL)

7. PM2 Ecosystem Generation
   ├─> Parse WORKERS env var (e.g., "comfyui:2,ollama:1")
   ├─> Generate PM2 config with services:
   │   ├─> service-manager (main)
   │   ├─> comfyui-gpu0, comfyui-gpu1 (if comfyui workers)
   │   └─> worker-* (Redis workers)
   └─> Write to /workspace/pm2-ecosystem.config.cjs

8. Service Manager Start
   └─> PM2 starts index-pm2.js (main process)

9. Sequential Startup Orchestrator
   ├─> STEP 1: Health Server (port 9090)
   ├─> STEP 2-14: ComfyUI Installation (if comfyui profile)
   │   ├─> Clone ComfyUI repo
   │   ├─> Install Python dependencies
   │   ├─> Start ComfyUI instances (PM2)
   │   ├─> Wait for health checks
   │   └─> Verify GPU access
   ├─> STEP 15-18: Component Manager
   │   ├─> Fetch default custom nodes from API
   │   ├─> Fetch workflow/collection dependencies (if COMPONENTS/COLLECTIONS env vars)
   │   ├─> Install custom nodes (git clone, pip install)
   │   └─> Download models (wget)
   └─> STEP 19+: Worker Services
       ├─> Start Redis workers (PM2)
       └─> Register with Redis hub

10. Machine Registration
    ├─> Send machine.startup event to Redis
    ├─> Workers register capabilities
    └─> Status aggregator starts periodic updates

11. Ready for Jobs
    └─> Workers poll Redis for matching jobs

C. Testing Matrix

Test Type	Tools	Speed	Coverage	When to Run
Unit	Vitest	< 1s	High	Every code change
Integration	Vitest + Docker	30-60s	Medium	Pre-commit
E2E	Custom scripts	5-10min	Full stack	PR, nightly
Performance	Custom + metrics	10-30min	Throughput	Weekly

D. Key Metrics

Environment Build Times:

local-dev: ~2s (15 component files, 8 service interfaces)
staging: ~2s
production: ~2s

Machine Startup Times:

Simulation: 10-15s (no ComfyUI)
ComfyUI (no models): 2-3 minutes
ComfyUI (with models): 5-10 minutes (depends on model download)

Test Execution Times (Current):

Worker unit tests: 5-10s (45+ test files)
Redis integration tests: 15-30s
Full E2E: 5-10 minutes

Docker Image Sizes:

Base machine: ~15GB (PyTorch + Node + system deps)
ComfyUI profile: +5GB (ComfyUI + custom nodes)
Ollama profile: +2GB (Ollama binary)

Document Prepared By: Claude (Anthropic AI Assistant) Review Status: Draft for stakeholder review Next Steps: Review findings → Implement quick wins → Re-evaluate in Q1 2026

EMP-Job-Queue Architecture Analysis & Docker Swarm Evaluation ​

Executive Summary ​

Current Architecture Assessment ​

1. Environment Management System ​

2. Machine/Worker Architecture ​

Testing Complexity Analysis ​

Current Pain Points ​

Testing Best Practices (What Works) ​

Docker Swarm Migration Evaluation ​

What Docker Swarm Would Provide ​

Benefits for Testing ​

Migration Effort Estimate ​

Swarm Downsides ​

Critical Question: What Problem Does Swarm Solve? ​

Testing Improvements (Without Swarm) ​

Quick Wins (< 8 hours total) ​

Medium-term Improvements (8-16 hours) ​

Long-term Vision (16-24 hours) ​

Recommendations ​

Immediate Actions (This Week) ​

Short-term (This Month) ​

Medium-term (Next Quarter) ​

Docker Swarm Decision ​

Key Takeaways ​

Appendices ​

A. Current File Structure ​

B. Machine Startup Sequence ​

C. Testing Matrix ​

D. Key Metrics ​

EMP-Job-Queue Architecture Analysis & Docker Swarm Evaluation

Executive Summary

Current Architecture Assessment

1. Environment Management System

2. Machine/Worker Architecture

Testing Complexity Analysis

Current Pain Points

Testing Best Practices (What Works)

Docker Swarm Migration Evaluation

What Docker Swarm Would Provide

Benefits for Testing

Migration Effort Estimate

Swarm Downsides

Critical Question: What Problem Does Swarm Solve?

Testing Improvements (Without Swarm)

Quick Wins (< 8 hours total)

Medium-term Improvements (8-16 hours)

Long-term Vision (16-24 hours)

Recommendations

Immediate Actions (This Week)

Short-term (This Month)

Medium-term (Next Quarter)

Docker Swarm Decision

Key Takeaways

Appendices

A. Current File Structure

B. Machine Startup Sequence

C. Testing Matrix

D. Key Metrics