Architecture Decision Records (ADRs) β
Architecture Decision Records document significant architectural decisions made in the EMP-Job-Queue system, including the context, decision rationale, consequences, and alternatives considered.
Active ADRs β
ADR-001: Encrypted Environment Variables for Ephemeral Containers β
Status: β Accepted (Machine Services) | π€ Proposed (Universal Adoption) Date: 2025-10-08
Documents the encrypted environment variable system used to securely deploy containerized services to ephemeral hosting platforms (SALAD, vast.ai, RunPod) where runtime secret injection is impractical. Covers build-time encryption with AES-256-CBC, runtime decryption, security analysis, and recommendations for universal adoption.
Key Topics:
- Build-time encryption with AES-256-CBC + HMAC authentication
- Runtime decryption in container entrypoints
- Security analysis and threat model
- Service-level adoption matrix (Machine β | API/Webhook β)
- Universal adoption analysis and recommendations
ADR-002: Pre-Release Testing Strategy for Production Deployments β
Status: π€ Proposed Date: 2025-10-08
Comprehensive pre-release testing strategy to prevent production breakage from untested code. Implements test pyramid model with CI/CD gates blocking Railway.app deployments on test failures. Includes unit tests, integration tests, E2E tests, and build verification with < 35 minute total execution time.
Key Topics:
- Test pyramid strategy (unit β integration β E2E β build verification)
- GitHub Actions CI/CD integration with test gates
- Redis function testing, API integration, worker execution validation
- Phased rollout plan (4 weeks to full coverage)
- Success metrics and production impact analysis
ADR-003: Service Heartbeat and Liveness Telemetry β
Status: π€ Proposed Date: 2025-10-09
Standardized service heartbeat system using OpenTelemetry Asynchronous Gauge metrics to enable automatic service discovery, uptime monitoring, and service map visualization in Dash0. Services send periodic "I am here" signals every 15 seconds via OTLP, allowing immediate service presence detection and proactive failure monitoring.
Key Topics:
- Asynchronous Gauge metrics for heartbeat (every 15s)
- Automatic service map population via
service.nameattributes - Uptime tracking and liveness monitoring
- Dash0 integration and visualization patterns
- Universal rollout strategy across all services (API, webhook, workers, machines)
ADR-008: EmProps Open Interface UI Integration into Monorepo β
Status: π€ Proposed Date: 2025-10-18
Proposes integrating the emprops-open-interface (Next.js UI application) from its standalone repository into the emp-job-queue Turborepo monorepo as apps/emprops-studio. Integration consolidates the entire EmProps stack - from job queue and ComfyUI workers to the user-facing studio interface - into a single, unified codebase for improved type safety, developer experience, and deployment simplification.
Key Topics:
- Full-stack type safety with shared
@emp/coretypes - Migration from standalone repo to
apps/emprops-studio - Package manager conversion (Yarn β pnpm)
- API client integration and WebSocket connections
- Dependency analysis and conflict resolution
- Six-phase implementation strategy
ADR-009: Database-Driven Error Pattern Classification β
Status: π€ Proposed Date: 2025-01-08
Implements a database-driven error pattern classification system to replace hardcoded error catalogs in worker connectors. Enables hot-fixing production error classifications without code deployment, supports non-engineer pattern management via admin UI, and provides connector-specific pattern isolation. Maintains sub-millisecond performance through in-memory pattern caching with background refresh.
Key Topics:
- Database schema for connector-specific error patterns
- Performance-optimized in-memory pattern cache (zero DB queries during jobs)
- Hot-fix production error classifications (5-minute propagation)
- Admin UI for non-engineer pattern management
- Pattern analytics and match frequency tracking
- Four-phase implementation plan (database β worker β observability β UI)
ADR-010: LoRA User Storage and Affinity Routing β
Status: π€ Proposed Date: 2025-11-14
Enable users to upload and store custom LoRA models in Azure Blob Storage with intelligent just-in-time downloading and cache-aware job routing. Combines user storage infrastructure with affinity-based job claiming to minimize model download times and improve job execution performance. Implements scoring-based routing that prefers workers with cached LoRAs while maintaining non-blocking fallback.
Key Topics:
- User-owned LoRA storage in Azure Blob Storage via flat_file table
- Just-in-time downloads with LRU + time-based cache eviction (50GB, 7-day TTL)
- Scoring-based affinity routing in Redis Lua function (+10 user LoRAs, +5 shared)
- Non-blocking design with graceful degradation
- Six-phase implementation plan (5-6 weeks total)
- North Star Phase 2: Model Intelligence advancement
gRPC Transition for Service-to-Service Communication β
Status: π€ Proposed Date: 2025-12-01
Proposes adopting gRPC as the primary protocol for service-to-service communication, replacing HTTP REST for internal APIs. gRPC provides strong typing via Protocol Buffers, native bidirectional streaming, better performance through binary serialization and HTTP/2 multiplexing, and compile-time type safety. Redis remains the event bus for asynchronous events; gRPC handles synchronous commands.
Key Topics:
- Protocol Buffer definitions for Job, Machine, and Worker services
- Bidirectional streaming for worker communication
- Server streaming for real-time job progress
- Code generation pipeline with buf and Connect
- Hybrid architecture: gRPC for commands, Redis for events
- Five-phase implementation plan (5-6 weeks total)
- Performance targets: ~30% latency reduction, ~5x smaller payloads
Analysis Documents β
Docker Swarm Migration Analysis β
Status: Deferred Date: 2025-10-08
Comprehensive analysis of current architecture, testing complexity, and Docker Swarm migration evaluation. Recommends deferring Docker Swarm migration indefinitely due to marginal benefits and high migration cost (40-60 hours). Proposes quick wins through documentation, test isolation utilities, and developer experience improvements.
Key Findings:
- Environment management system is well-designed
- PM2 orchestration is production-ready
- Testing complexity stems from distribution, not tooling
- Docker Swarm doesn't solve actual pain points
ADR Index by Status β
β Accepted β
- ADR-001: Encrypted Environment Variables - Machine services only
π€ Proposed β
- ADR-001: Universal Encryption Adoption - Should all services use encryption?
- ADR-002: Pre-Release Testing Strategy - Automated testing gates for production deployments
- ADR-003: Service Heartbeat and Liveness Telemetry - OpenTelemetry-based service discovery and uptime monitoring
- ADR-008: EmProps UI Monorepo Integration - Integrate emprops-open-interface as
apps/emprops-studio - ADR-009: Database-Driven Error Pattern Classification - Hot-fixable error patterns with in-memory caching
- ADR-010: LoRA User Storage and Affinity Routing - User-owned LoRAs with cache-aware job routing
- gRPC Transition - gRPC for service-to-service communication
βΈοΈ Deferred β
- Docker Swarm Migration - Deferred indefinitely
ADR Process β
When to Create an ADR β
Create an ADR for:
- β Significant architectural decisions affecting multiple services
- β Technology choices with long-term implications
- β Security or compliance decisions
- β Trade-offs between competing approaches
- β Decisions that require team consensus
Do not create ADRs for:
- β Minor implementation details
- β Reversible code changes
- β Obvious or uncontroversial choices
ADR Template β
# ADR-XXX: [Title]
**Date:** YYYY-MM-DD
**Status:** Proposed | Accepted | Deprecated | Superseded
**Decision Makers:** [Team/Role]
## Context
[What is the issue we're facing? What factors influence this decision?]
## Decision
[What decision did we make? What approach did we choose?]
## Consequences
[What are the positive and negative consequences of this decision?]
## Alternatives Considered
[What other approaches did we consider? Why were they rejected?]Status Definitions β
| Status | Meaning |
|---|---|
| Proposed | Under discussion, not yet decided |
| Accepted | Approved and actively implemented |
| Deprecated | No longer recommended, being phased out |
| Superseded | Replaced by a newer ADR |
| Deferred | Postponed for later consideration |
Related Documentation β
- Environment Management Guide - Component-based configuration system
- CLAUDE.md - North star architecture and development workflow
- Testing Procedures - Standard testing procedures and commands
Contributing β
To propose a new ADR:
- Copy the ADR template above
- Create a new file:
adr/adr-XXX-title.md(use next available number) - Fill in the template with context, decision, consequences, and alternatives
- Submit a pull request for team review
- Update this index after approval
Questions? Contact the Architecture Team or post in #architecture Slack channel.
