Infrastructure Reference
The EMP Job Queue infrastructure is designed for elastic scaling across ephemeral machines (SALAD, vast.ai) with no shared storage.
Core Constraints
Infrastructure Reality
- Distributed Machines: SALAD/vast.ai - geographically distributed, no shared storage
- Ephemeral Scaling: 10 → 50 → 10 machines daily, spot instances
- No Persistence: Machines spin up/down constantly
- Fast Startup Required: Seconds not minutes (baked containers)
Current vs Target Architecture
| Current State | North Star Target |
|---|---|
| Uniform machines | Specialized Pools (Fast Lane/Standard/Heavy) |
| Reactive model downloads | Predictive Model Placement |
| Python asset downloader | TypeScript Model Intelligence Service |
| Manual routing | Multi-dimensional Job Router |
| Runtime installations | Baked Container Images per Pool |
Architecture Components
Machine Types
Current deployment and future pool specifications.
- Current: basic_machine (PM2-managed containers)
- Future: Fast Lane (CPU-optimized) / Standard (Balanced GPU) / Heavy (High-end GPU)
Docker Images
Container build strategy and model baking.
- Multi-stage builds with dependency caching
- ComfyUI custom node installation
- Model pre-loading strategies
Deployment Patterns
SALAD/vast.ai deployment and environment configuration.
- Service mapping system
- Environment profiles (local-dev, production, testrunner)
- Health check and readiness probes
Infrastructure Services
Redis (Primary Queue Storage)
- Purpose: Job queue, worker registry, ephemeral state
- Provider: Upstash Redis (serverless, Redis-compatible)
- URL: Configured via
REDIS_SERVER_URL+REDIS_SERVER_TOKEN - Key Features:
- Atomic job matching via Lua functions
- Pub/sub for real-time events
- TTL-based cleanup of completed jobs
PostgreSQL (Persistent State)
- Purpose: Job records, user data, collection definitions (EmProps API)
- Provider: Neon PostgreSQL (serverless)
- URL:
DATABASE_URLenvironment variable - Migration: Prisma CLI (
npx prisma migrate deploy)
Storage (Asset Files)
- Purpose: Model files, generated outputs, instruction sets
- Providers:
- AWS S3 (production)
- R2 Cloudflare (alternative, configured via
STORAGE_PROVIDER)
- Access: Pre-signed URLs for secure downloads
Monitoring Database
- Purpose: Direct SQL queries for monitoring UI
- Provider: Neon PostgreSQL (same as EmProps API)
- Connection: Minimal pool (max: 2) for read-only analytics
Deployment Targets
Local Development
bash
# Environment: local-dev
REDIS_SERVER_URL=redis://localhost:6379
DATABASE_URL=postgresql://localhost:5432/emprops_dev
EMPROPS_API_URL=http://localhost:3001Services:
- Redis: Local Docker container
- PostgreSQL: Local Docker container
- API:
pnpm dev:local-redis(apps/api) - EmProps API:
pnpm dev(apps/emprops-api) - Monitor:
pnpm dev(apps/monitor)
Production (Railway)
bash
# Environment: production
REDIS_SERVER_URL=redis://upstash-redis.railway.app:6379
DATABASE_URL=postgresql://neon.tech:5432/emprops_prod
EMPROPS_API_URL=https://api.emprops.comDeployment:
- API: Railway service (auto-deploy from master)
- Workers: SALAD/vast.ai containers
- Machines: Ephemeral, auto-scale based on queue depth
Test Runner
bash
# Environment: testrunner
# Isolated environment for integration testsScaling Strategy
Current (Uniform Machines)
- All machines identical capabilities
- Compete for same resources
- 1-second Ollama jobs alongside 10-minute video jobs
Phase 1: Pool Separation (Months 1-2)
Goal: Eliminate performance heterogeneity
Fast Lane Pool:
- CPU-optimized (minimal GPU)
- 20-40GB storage
- Text/simple image processing
- <10 second job duration
Standard Pool:
- Balanced GPU (RTX 3060-4070)
- 80-120GB storage
- Typical ComfyUI workflows
- 10 second - 5 minute jobs
Heavy Pool:
- High-end GPU (RTX 4090+)
- 150-300GB storage
- Video/complex processing
- 5+ minute jobs
Routing: Duration-based prediction, pool-specific containers
Phase 2: Model Intelligence (Months 3-4)
Goal: Eliminate first-user wait times
- Predictive model placement
- Bake common models into containers
- TypeScript model manager service
- 80% reduction in download wait times
Phase 3: Advanced Optimization (Months 5-6)
Goal: Resource optimization and specialization
- ML-based demand prediction
- Specialty routing (LoRA, ControlNet, etc.)
- 95% optimal job routing
File Locations
- Machines:
/apps/machines/basic_machine/ - Docker:
/apps/api/Dockerfile,/apps/emprops-api/Dockerfile - Environment Config:
/config/environments/ - Infrastructure Scripts:
/apps/api/entrypoint-api.sh
See Also
- Machine Types - Detailed machine specifications
- Docker Images - Container build process
- Deployment Patterns - Deployment strategies
- Architecture Overview - System design
