Skip to content

Infrastructure Reference

The EMP Job Queue infrastructure is designed for elastic scaling across ephemeral machines (SALAD, vast.ai) with no shared storage.

Core Constraints

Infrastructure Reality

  • Distributed Machines: SALAD/vast.ai - geographically distributed, no shared storage
  • Ephemeral Scaling: 10 → 50 → 10 machines daily, spot instances
  • No Persistence: Machines spin up/down constantly
  • Fast Startup Required: Seconds not minutes (baked containers)

Current vs Target Architecture

Current StateNorth Star Target
Uniform machinesSpecialized Pools (Fast Lane/Standard/Heavy)
Reactive model downloadsPredictive Model Placement
Python asset downloaderTypeScript Model Intelligence Service
Manual routingMulti-dimensional Job Router
Runtime installationsBaked Container Images per Pool

Architecture Components

Machine Types

Current deployment and future pool specifications.

  • Current: basic_machine (PM2-managed containers)
  • Future: Fast Lane (CPU-optimized) / Standard (Balanced GPU) / Heavy (High-end GPU)

Docker Images

Container build strategy and model baking.

  • Multi-stage builds with dependency caching
  • ComfyUI custom node installation
  • Model pre-loading strategies

Deployment Patterns

SALAD/vast.ai deployment and environment configuration.

  • Service mapping system
  • Environment profiles (local-dev, production, testrunner)
  • Health check and readiness probes

Infrastructure Services

Redis (Primary Queue Storage)

  • Purpose: Job queue, worker registry, ephemeral state
  • Provider: Upstash Redis (serverless, Redis-compatible)
  • URL: Configured via REDIS_SERVER_URL + REDIS_SERVER_TOKEN
  • Key Features:
    • Atomic job matching via Lua functions
    • Pub/sub for real-time events
    • TTL-based cleanup of completed jobs

PostgreSQL (Persistent State)

  • Purpose: Job records, user data, collection definitions (EmProps API)
  • Provider: Neon PostgreSQL (serverless)
  • URL: DATABASE_URL environment variable
  • Migration: Prisma CLI (npx prisma migrate deploy)

Storage (Asset Files)

  • Purpose: Model files, generated outputs, instruction sets
  • Providers:
    • AWS S3 (production)
    • R2 Cloudflare (alternative, configured via STORAGE_PROVIDER)
  • Access: Pre-signed URLs for secure downloads

Monitoring Database

  • Purpose: Direct SQL queries for monitoring UI
  • Provider: Neon PostgreSQL (same as EmProps API)
  • Connection: Minimal pool (max: 2) for read-only analytics

Deployment Targets

Local Development

bash
# Environment: local-dev
REDIS_SERVER_URL=redis://localhost:6379
DATABASE_URL=postgresql://localhost:5432/emprops_dev
EMPROPS_API_URL=http://localhost:3001

Services:

  • Redis: Local Docker container
  • PostgreSQL: Local Docker container
  • API: pnpm dev:local-redis (apps/api)
  • EmProps API: pnpm dev (apps/emprops-api)
  • Monitor: pnpm dev (apps/monitor)

Production (Railway)

bash
# Environment: production
REDIS_SERVER_URL=redis://upstash-redis.railway.app:6379
DATABASE_URL=postgresql://neon.tech:5432/emprops_prod
EMPROPS_API_URL=https://api.emprops.com

Deployment:

  • API: Railway service (auto-deploy from master)
  • Workers: SALAD/vast.ai containers
  • Machines: Ephemeral, auto-scale based on queue depth

Test Runner

bash
# Environment: testrunner
# Isolated environment for integration tests

Scaling Strategy

Current (Uniform Machines)

  • All machines identical capabilities
  • Compete for same resources
  • 1-second Ollama jobs alongside 10-minute video jobs

Phase 1: Pool Separation (Months 1-2)

Goal: Eliminate performance heterogeneity

  • Fast Lane Pool:

    • CPU-optimized (minimal GPU)
    • 20-40GB storage
    • Text/simple image processing
    • <10 second job duration
  • Standard Pool:

    • Balanced GPU (RTX 3060-4070)
    • 80-120GB storage
    • Typical ComfyUI workflows
    • 10 second - 5 minute jobs
  • Heavy Pool:

    • High-end GPU (RTX 4090+)
    • 150-300GB storage
    • Video/complex processing
    • 5+ minute jobs

Routing: Duration-based prediction, pool-specific containers

Phase 2: Model Intelligence (Months 3-4)

Goal: Eliminate first-user wait times

  • Predictive model placement
  • Bake common models into containers
  • TypeScript model manager service
  • 80% reduction in download wait times

Phase 3: Advanced Optimization (Months 5-6)

Goal: Resource optimization and specialization

  • ML-based demand prediction
  • Specialty routing (LoRA, ControlNet, etc.)
  • 95% optimal job routing

File Locations

  • Machines: /apps/machines/basic_machine/
  • Docker: /apps/api/Dockerfile, /apps/emprops-api/Dockerfile
  • Environment Config: /config/environments/
  • Infrastructure Scripts: /apps/api/entrypoint-api.sh

See Also

Released under the MIT License.