Infrastructure Reference

The EMP Job Queue infrastructure is designed for elastic scaling across ephemeral machines (SALAD, vast.ai) with no shared storage.

Core Constraints

Infrastructure Reality

Distributed Machines: SALAD/vast.ai - geographically distributed, no shared storage
Ephemeral Scaling: 10 → 50 → 10 machines daily, spot instances
No Persistence: Machines spin up/down constantly
Fast Startup Required: Seconds not minutes (baked containers)

Current vs Target Architecture

Current State	North Star Target
Uniform machines	Specialized Pools (Fast Lane/Standard/Heavy)
Reactive model downloads	Predictive Model Placement
Python asset downloader	TypeScript Model Intelligence Service
Manual routing	Multi-dimensional Job Router
Runtime installations	Baked Container Images per Pool

Architecture Components

Machine Types

Current deployment and future pool specifications.

Current: basic_machine (PM2-managed containers)
Future: Fast Lane (CPU-optimized) / Standard (Balanced GPU) / Heavy (High-end GPU)

Docker Images

Container build strategy and model baking.

Multi-stage builds with dependency caching
ComfyUI custom node installation
Model pre-loading strategies

Deployment Patterns

SALAD/vast.ai deployment and environment configuration.

Service mapping system
Environment profiles (local-dev, production, testrunner)
Health check and readiness probes

Infrastructure Services

Redis (Primary Queue Storage)

Purpose: Job queue, worker registry, ephemeral state
Provider: Upstash Redis (serverless, Redis-compatible)
URL: Configured via REDIS_SERVER_URL + REDIS_SERVER_TOKEN
Key Features:
- Atomic job matching via Lua functions
- Pub/sub for real-time events
- TTL-based cleanup of completed jobs

PostgreSQL (Persistent State)

Purpose: Job records, user data, collection definitions (EmProps API)
Provider: Neon PostgreSQL (serverless)
URL: DATABASE_URL environment variable
Migration: Prisma CLI (npx prisma migrate deploy)

Storage (Asset Files)

Purpose: Model files, generated outputs, instruction sets
Providers:
- AWS S3 (production)
- R2 Cloudflare (alternative, configured via STORAGE_PROVIDER)
Access: Pre-signed URLs for secure downloads

Monitoring Database

Purpose: Direct SQL queries for monitoring UI
Provider: Neon PostgreSQL (same as EmProps API)
Connection: Minimal pool (max: 2) for read-only analytics

Deployment Targets

Local Development

bash

# Environment: local-dev
REDIS_SERVER_URL=redis://localhost:6379
DATABASE_URL=postgresql://localhost:5432/emprops_dev
EMPROPS_API_URL=http://localhost:3001

Services:

Redis: Local Docker container
PostgreSQL: Local Docker container
API: pnpm dev:local-redis (apps/api)
EmProps API: pnpm dev (apps/emprops-api)
Monitor: pnpm dev (apps/monitor)

Production (Railway)

bash

# Environment: production
REDIS_SERVER_URL=redis://upstash-redis.railway.app:6379
DATABASE_URL=postgresql://neon.tech:5432/emprops_prod
EMPROPS_API_URL=https://api.emprops.com

Deployment:

API: Railway service (auto-deploy from master)
Workers: SALAD/vast.ai containers
Machines: Ephemeral, auto-scale based on queue depth

Test Runner

bash

# Environment: testrunner
# Isolated environment for integration tests

Scaling Strategy

Current (Uniform Machines)

All machines identical capabilities
Compete for same resources
1-second Ollama jobs alongside 10-minute video jobs

Phase 1: Pool Separation (Months 1-2)

Goal: Eliminate performance heterogeneity

Fast Lane Pool:
- CPU-optimized (minimal GPU)
- 20-40GB storage
- Text/simple image processing
- <10 second job duration
Standard Pool:
- Balanced GPU (RTX 3060-4070)
- 80-120GB storage
- Typical ComfyUI workflows
- 10 second - 5 minute jobs
Heavy Pool:
- High-end GPU (RTX 4090+)
- 150-300GB storage
- Video/complex processing
- 5+ minute jobs

Routing: Duration-based prediction, pool-specific containers

Phase 2: Model Intelligence (Months 3-4)

Goal: Eliminate first-user wait times

Predictive model placement
Bake common models into containers
TypeScript model manager service
80% reduction in download wait times

Phase 3: Advanced Optimization (Months 5-6)

Goal: Resource optimization and specialization

ML-based demand prediction
Specialty routing (LoRA, ControlNet, etc.)
95% optimal job routing

File Locations

Machines: /apps/machines/basic_machine/
Docker: /apps/api/Dockerfile, /apps/emprops-api/Dockerfile
Environment Config: /config/environments/
Infrastructure Scripts: /apps/api/entrypoint-api.sh

Infrastructure Reference ​

Core Constraints ​

Infrastructure Reality ​

Current vs Target Architecture ​

Architecture Components ​

Machine Types ​

Docker Images ​

Deployment Patterns ​

Infrastructure Services ​

Redis (Primary Queue Storage) ​

PostgreSQL (Persistent State) ​

Storage (Asset Files) ​

Monitoring Database ​

Deployment Targets ​

Local Development ​

Production (Railway) ​

Test Runner ​

Scaling Strategy ​

Current (Uniform Machines) ​

Phase 1: Pool Separation (Months 1-2) ​

Phase 2: Model Intelligence (Months 3-4) ​

Phase 3: Advanced Optimization (Months 5-6) ​

File Locations ​

See Also ​

Infrastructure Reference

Core Constraints

Infrastructure Reality

Current vs Target Architecture

Architecture Components

Machine Types

Docker Images

Deployment Patterns

Infrastructure Services

Redis (Primary Queue Storage)

PostgreSQL (Persistent State)

Storage (Asset Files)

Monitoring Database

Deployment Targets

Local Development

Production (Railway)

Test Runner

Scaling Strategy

Current (Uniform Machines)

Phase 1: Pool Separation (Months 1-2)

Phase 2: Model Intelligence (Months 3-4)

Phase 3: Advanced Optimization (Months 5-6)

File Locations

See Also