Skip to content

Architecture Decision Records (ADRs) ​

Architecture Decision Records document significant architectural decisions made in the EMP-Job-Queue system, including the context, decision rationale, consequences, and alternatives considered.

Active ADRs ​

ADR-001: Encrypted Environment Variables for Ephemeral Containers ​

Status: βœ… Accepted (Machine Services) | πŸ€” Proposed (Universal Adoption) Date: 2025-10-08

Documents the encrypted environment variable system used to securely deploy containerized services to ephemeral hosting platforms (SALAD, vast.ai, RunPod) where runtime secret injection is impractical. Covers build-time encryption with AES-256-CBC, runtime decryption, security analysis, and recommendations for universal adoption.

Read ADR-001 β†’

Key Topics:

  • Build-time encryption with AES-256-CBC + HMAC authentication
  • Runtime decryption in container entrypoints
  • Security analysis and threat model
  • Service-level adoption matrix (Machine βœ… | API/Webhook ❌)
  • Universal adoption analysis and recommendations

ADR-002: Pre-Release Testing Strategy for Production Deployments ​

Status: πŸ€” Proposed Date: 2025-10-08

Comprehensive pre-release testing strategy to prevent production breakage from untested code. Implements test pyramid model with CI/CD gates blocking Railway.app deployments on test failures. Includes unit tests, integration tests, E2E tests, and build verification with < 35 minute total execution time.

Read ADR-002 β†’

Key Topics:

  • Test pyramid strategy (unit β†’ integration β†’ E2E β†’ build verification)
  • GitHub Actions CI/CD integration with test gates
  • Redis function testing, API integration, worker execution validation
  • Phased rollout plan (4 weeks to full coverage)
  • Success metrics and production impact analysis

ADR-003: Service Heartbeat and Liveness Telemetry ​

Status: πŸ€” Proposed Date: 2025-10-09

Standardized service heartbeat system using OpenTelemetry Asynchronous Gauge metrics to enable automatic service discovery, uptime monitoring, and service map visualization in Dash0. Services send periodic "I am here" signals every 15 seconds via OTLP, allowing immediate service presence detection and proactive failure monitoring.

Read ADR-003 β†’

Key Topics:

  • Asynchronous Gauge metrics for heartbeat (every 15s)
  • Automatic service map population via service.name attributes
  • Uptime tracking and liveness monitoring
  • Dash0 integration and visualization patterns
  • Universal rollout strategy across all services (API, webhook, workers, machines)

ADR-008: EmProps Open Interface UI Integration into Monorepo ​

Status: πŸ€” Proposed Date: 2025-10-18

Proposes integrating the emprops-open-interface (Next.js UI application) from its standalone repository into the emp-job-queue Turborepo monorepo as apps/emprops-studio. Integration consolidates the entire EmProps stack - from job queue and ComfyUI workers to the user-facing studio interface - into a single, unified codebase for improved type safety, developer experience, and deployment simplification.

Read ADR-008 β†’

Key Topics:

  • Full-stack type safety with shared @emp/core types
  • Migration from standalone repo to apps/emprops-studio
  • Package manager conversion (Yarn β†’ pnpm)
  • API client integration and WebSocket connections
  • Dependency analysis and conflict resolution
  • Six-phase implementation strategy

ADR-009: Database-Driven Error Pattern Classification ​

Status: πŸ€” Proposed Date: 2025-01-08

Implements a database-driven error pattern classification system to replace hardcoded error catalogs in worker connectors. Enables hot-fixing production error classifications without code deployment, supports non-engineer pattern management via admin UI, and provides connector-specific pattern isolation. Maintains sub-millisecond performance through in-memory pattern caching with background refresh.

Read ADR-009 β†’

Key Topics:

  • Database schema for connector-specific error patterns
  • Performance-optimized in-memory pattern cache (zero DB queries during jobs)
  • Hot-fix production error classifications (5-minute propagation)
  • Admin UI for non-engineer pattern management
  • Pattern analytics and match frequency tracking
  • Four-phase implementation plan (database β†’ worker β†’ observability β†’ UI)

ADR-010: LoRA User Storage and Affinity Routing ​

Status: πŸ€” Proposed Date: 2025-11-14

Enable users to upload and store custom LoRA models in Azure Blob Storage with intelligent just-in-time downloading and cache-aware job routing. Combines user storage infrastructure with affinity-based job claiming to minimize model download times and improve job execution performance. Implements scoring-based routing that prefers workers with cached LoRAs while maintaining non-blocking fallback.

Read ADR-010 β†’

Key Topics:

  • User-owned LoRA storage in Azure Blob Storage via flat_file table
  • Just-in-time downloads with LRU + time-based cache eviction (50GB, 7-day TTL)
  • Scoring-based affinity routing in Redis Lua function (+10 user LoRAs, +5 shared)
  • Non-blocking design with graceful degradation
  • Six-phase implementation plan (5-6 weeks total)
  • North Star Phase 2: Model Intelligence advancement

gRPC Transition for Service-to-Service Communication ​

Status: πŸ€” Proposed Date: 2025-12-01

Proposes adopting gRPC as the primary protocol for service-to-service communication, replacing HTTP REST for internal APIs. gRPC provides strong typing via Protocol Buffers, native bidirectional streaming, better performance through binary serialization and HTTP/2 multiplexing, and compile-time type safety. Redis remains the event bus for asynchronous events; gRPC handles synchronous commands.

Read gRPC Transition ADR β†’

Key Topics:

  • Protocol Buffer definitions for Job, Machine, and Worker services
  • Bidirectional streaming for worker communication
  • Server streaming for real-time job progress
  • Code generation pipeline with buf and Connect
  • Hybrid architecture: gRPC for commands, Redis for events
  • Five-phase implementation plan (5-6 weeks total)
  • Performance targets: ~30% latency reduction, ~5x smaller payloads

Analysis Documents ​

Docker Swarm Migration Analysis ​

Status: Deferred Date: 2025-10-08

Comprehensive analysis of current architecture, testing complexity, and Docker Swarm migration evaluation. Recommends deferring Docker Swarm migration indefinitely due to marginal benefits and high migration cost (40-60 hours). Proposes quick wins through documentation, test isolation utilities, and developer experience improvements.

Read Swarm Analysis β†’

Key Findings:

  • Environment management system is well-designed
  • PM2 orchestration is production-ready
  • Testing complexity stems from distribution, not tooling
  • Docker Swarm doesn't solve actual pain points

ADR Index by Status ​

βœ… Accepted ​

πŸ€” Proposed ​

⏸️ Deferred ​


ADR Process ​

When to Create an ADR ​

Create an ADR for:

  • βœ… Significant architectural decisions affecting multiple services
  • βœ… Technology choices with long-term implications
  • βœ… Security or compliance decisions
  • βœ… Trade-offs between competing approaches
  • βœ… Decisions that require team consensus

Do not create ADRs for:

  • ❌ Minor implementation details
  • ❌ Reversible code changes
  • ❌ Obvious or uncontroversial choices

ADR Template ​

markdown
# ADR-XXX: [Title]

**Date:** YYYY-MM-DD
**Status:** Proposed | Accepted | Deprecated | Superseded
**Decision Makers:** [Team/Role]

## Context
[What is the issue we're facing? What factors influence this decision?]

## Decision
[What decision did we make? What approach did we choose?]

## Consequences
[What are the positive and negative consequences of this decision?]

## Alternatives Considered
[What other approaches did we consider? Why were they rejected?]

Status Definitions ​

StatusMeaning
ProposedUnder discussion, not yet decided
AcceptedApproved and actively implemented
DeprecatedNo longer recommended, being phased out
SupersededReplaced by a newer ADR
DeferredPostponed for later consideration


Contributing ​

To propose a new ADR:

  1. Copy the ADR template above
  2. Create a new file: adr/adr-XXX-title.md (use next available number)
  3. Fill in the template with context, decision, consequences, and alternatives
  4. Submit a pull request for team review
  5. Update this index after approval

Questions? Contact the Architecture Team or post in #architecture Slack channel.

Released under the MIT License.