Skip to content

ADR-002: Pre-Release Testing Strategy for Production Deployments

Date: 2025-10-08 Status: 🤔 Proposed Decision Makers: Engineering Team Approval Required: Before implementing CI/CD test gates Related ADRs: ADR-001: Encrypted Environment Variables, Docker Swarm Migration Analysis


Executive Summary

This ADR proposes a comprehensive pre-release testing strategy to prevent production breakage from untested code. The system will gate all production deployments (Railway.app services) behind an automated test suite running in GitHub Actions.

Current Risk:

  • 81 commits ahead of master on staging branch
  • No automated testing before releases
  • Production deployments triggered directly on git tags
  • ⚠️ Production breakage risk from untested changes

Proposed Solution:

  • 🎯 Test pyramid strategy with unit, integration, and E2E tests
  • 🛡️ CI/CD test gate blocking releases on test failures
  • < 35 minute total CI execution time
  • 📊 Comprehensive coverage across critical production paths

Impact:

  • Before: Manual testing, uncertain production readiness, reactive bug fixes
  • After: Automated validation, objective deployment criteria, proactive quality assurance

Table of Contents

  1. Context
  2. Problem Statement
  3. Decision
  4. Test Architecture
  5. Implementation Strategy
  6. Test Coverage Requirements
  7. CI/CD Integration
  8. Phased Rollout Plan
  9. Success Metrics
  10. Consequences
  11. Alternatives Considered
  12. Open Questions

Context

Current State Analysis

Existing Test Infrastructure:

✅ Vitest configured across all packages
✅ 55+ test files across monorepo
✅ Test scripts in package.json (pnpm test, pnpm test:api, etc.)
✅ Turbo.json configured with test task dependencies
✅ Release workflow exists (.github/workflows/release.yml)

Test Coverage by Package:

PackageTest FilesCoverage Areas
packages/core/Redis functions integration testsJob matching, capability matching, atomic claiming
apps/api/Integration + E2E testsJob submission, connector integration, telemetry pipeline
apps/worker/Failure classification, attestationError handling, retry logic, failure classification
apps/webhook-service/Telemetry pipeline E2EEvent delivery, retry logic, OTLP integration
apps/monitor/Component testsRedis connections, machine status, attestation system
packages/telemetry/OTLP integrationTelemetry client, event formatting, Dash0 integration
apps/emprops-api/Unit testsLinter, pseudorandom, art-gen nodes

Current Release Workflow:

yaml
# .github/workflows/release.yml
on:
  push:
    tags: ["v*"]  # Triggers on git tags
  workflow_dispatch:

jobs:
  build-and-release:
    steps:
      - Checkout code
      - Build worker bundle
      - Create GitHub release
      - Deploy to Railway (production or staging)

Critical Production Services (from release.yml):

  • Production: q-emprops-api, q-job-api, q-webhook, q-telcollect, openai-machine, openai-response, gemini
  • Staging: stg.* versions of same services

Current Deployment Model:

Git Tag (v1.2.3)

Build Worker Bundle

Create GitHub Release

Deploy to Railway.app

PRODUCTION (no test gate!)

Infrastructure Reality

Distributed Architecture:

  • API Server: Lightweight Redis orchestration (Railway.app)
  • Webhook Service: Event delivery, telemetry pipeline (Railway.app)
  • Telemetry Collector: OTLP bridge, Dash0 integration (Railway.app)
  • EmProps API: Art generation, job evaluation (Railway.app)
  • Worker Machines: Ephemeral GPU compute (SALAD, vast.ai, RunPod)

Production Constraints:

  • Real-time systems: WebSocket-based monitoring, job progress updates
  • Paying customers: Production breakage affects revenue and reputation
  • Distributed state: Redis-based coordination, event-driven architecture
  • External dependencies: OpenAI, Anthropic, ComfyUI, Dash0

Business Impact

Current Risk:

  • 81 commits ahead of master without comprehensive testing
  • Production deployments lack objective quality criteria
  • Debugging production issues wastes engineering time
  • Customer-facing outages damage trust and revenue

Cost of No Testing:

  • Time: Hours spent debugging production issues
  • 💰 Revenue: Customer churn from unreliable service
  • 😰 Stress: On-call firefighting instead of feature development
  • 📉 Velocity: Fear of shipping slows innovation

Problem Statement

Requirements

R1: Pre-Release Validation

  • All production releases must pass automated tests before deployment
  • Tests must validate critical user flows and system integrity
  • Test failures must block deployment automatically

R2: Comprehensive Coverage

  • Unit tests for business logic (< 5 minutes)
  • Integration tests for component interactions (< 10 minutes)
  • E2E tests for critical user flows (< 15 minutes)
  • Build verification for type safety and linting (< 5 minutes)

R3: Fast Feedback

  • Total CI execution time < 35 minutes (acceptable for release gate)
  • Parallel test execution where possible
  • Clear failure reporting with actionable errors

R4: Maintainability

  • Tests run reliably in CI environment
  • No flaky tests blocking releases
  • Clear documentation for adding new tests

R5: Developer Experience

  • Tests run locally before commit (pre-commit hooks)
  • Visual test UI available (vitest --ui)
  • Incremental adoption without blocking current workflow

Non-Requirements

  • 100% code coverage: Focus on critical paths, not every line
  • Mutation testing: Overkill for current maturity level
  • Performance benchmarks: Separate concern, not blocking releases
  • Cross-browser testing: Backend services, not applicable

Decision

Adopt Test Pyramid Strategy with CI/CD Gates

We will implement a comprehensive pre-release testing strategy based on the test pyramid model, with automated gates in GitHub Actions blocking production deployments on test failures.

Core Principles:

  1. Test Pyramid: Many fast unit tests, fewer integration tests, minimal E2E tests
  2. CI/CD Gates: All tests must pass before deployment proceeds
  3. Incremental Adoption: Phase in coverage over 4 weeks
  4. Fast Feedback: Total CI time < 35 minutes
  5. Actionable Failures: Clear error messages guide debugging

Test Levels:

        /\
       /E2E\         ← Slow (15 min), few tests, high confidence
      /------\          Critical user flows, telemetry pipeline
     /Integration\   ← Medium (10 min), moderate coverage
    /------------\      Redis functions, API endpoints, connectors
   /  Unit Tests  \  ← Fast (5 min), many tests, quick feedback
  /----------------\    Business logic, failure classification, utilities

Implementation Approach:

  • Option A (RECOMMENDED): Pre-release job in release.yml blocking deployment
  • Phase 1 (Week 1): Foundation - Unit tests + build verification
  • Phase 2 (Week 2): Integration tests for Redis and API
  • Phase 3 (Week 3): E2E tests for critical user flows
  • Phase 4 (Week 4): Optimization and documentation

Test Architecture

Test Pyramid Breakdown

Level 1: Unit Tests (< 5 minutes total)

Purpose: Fast feedback on business logic correctness

Scope:

  • Failure classification logic (apps/worker/src/__tests__/failure-classification.test.ts)
  • Attestation generation (apps/worker/src/__tests__/failure-attestation.test.ts)
  • Retry count extraction (apps/worker/src/__tests__/retry-count-extraction.test.ts)
  • EmProps API nodes (apps/emprops-api/src/modules/art-gen/nodes/*.test.ts)
  • Utility functions (linter, pseudorandom, etc.)

Characteristics:

  • No external dependencies (mocked Redis, HTTP clients)
  • Deterministic inputs and outputs
  • Isolated test cases (no shared state)
  • Execute in parallel across packages

Example Test:

typescript
// apps/worker/src/__tests__/failure-classification.test.ts
import { describe, it, expect } from 'vitest';
import { FailureClassifier, FailureType } from '../types/failure-classification.js';

describe('Failure Classification System', () => {
  it('should classify HTTP authentication errors correctly', () => {
    const error = 'Request failed with status code 401 - Invalid API key';
    const context = { httpStatus: 401, serviceType: 'openai_responses' };

    const result = FailureClassifier.classify(error, context);

    expect(result.failure_type).toBe(FailureType.AUTH_ERROR);
    expect(result.failure_reason).toBe(FailureReason.INVALID_API_KEY);
  });
});

Turbo Command:

bash
turbo run test --filter='!apps/api' --filter='!apps/webhook-service'

Level 2: Integration Tests (< 10 minutes total)

Purpose: Validate component interactions with real dependencies

Scope:

  • Redis function integration (packages/core/src/redis-functions/__tests__/integration.test.ts)
  • API job submission (apps/api/src/__tests__/connector-integration.e2e.test.ts)
  • Worker job processing (apps/worker/ integration tests)
  • Webhook delivery flow (apps/webhook-service/__tests__/webhook-server.test.ts)

Characteristics:

  • Real Redis instance (Docker container or in-memory)
  • Real HTTP servers (test instances)
  • Controlled external dependencies (mock external APIs)
  • Shared test infrastructure (setup/teardown)

Example Test:

typescript
// packages/core/src/redis-functions/__tests__/integration.test.ts
import { describe, it, expect, beforeAll, afterAll } from 'vitest';
import Redis from 'ioredis';
import { RedisFunctionInstaller } from '../installer.js';

describe('Redis Function Integration Tests', () => {
  let redis: Redis;

  beforeAll(async () => {
    redis = new Redis(process.env.REDIS_URL || 'redis://localhost:6379');
    const installer = new RedisFunctionInstaller(redis);
    await installer.installOrUpdate();
  });

  it('should match worker with compatible service', async () => {
    // Setup job and worker in Redis
    const jobId = 'test-job-1';
    await redis.hmset(`job:${jobId}`, {
      id: jobId,
      service_required: 'comfyui',
      priority: '100',
      status: 'pending'
    });
    await redis.zadd('jobs:pending', 100, jobId);

    const worker = {
      worker_id: 'worker-1',
      job_service_required_map: ['comfyui', 'a1111']
    };

    // Test Redis function
    const result = await redis.fcall('findMatchingJob', 0, JSON.stringify(worker), '10');

    expect(result).not.toBeNull();
    expect(JSON.parse(result).jobId).toBe(jobId);
  });
});

Turbo Command:

bash
turbo run test:integration

Level 3: E2E Tests (< 15 minutes total)

Purpose: Validate critical user flows end-to-end

Scope:

  • Job submission → Worker execution → Webhook delivery
  • Telemetry pipeline: API → Telemetry Collector → Dash0
  • Machine registration → Job claiming → Progress updates
  • Workflow execution with multiple steps

Characteristics:

  • Full system integration (API + Worker + Webhook + Telemetry)
  • Real external services (mocked APIs via nock)
  • Real-time event verification (WebSocket, Redis streams)
  • Longer execution times (10-15 seconds per test)

Example Test:

typescript
// apps/webhook-service/src/__tests__/telemetry-pipeline.e2e.test.ts
import { describe, it, expect } from 'vitest';
import fetch from 'node-fetch';

describe('Telemetry Pipeline E2E', () => {
  it('should deliver job.received event to Dash0', async () => {
    const workflowId = `e2e-test-${Date.now()}`;

    // STEP 1: Submit job to API
    const response = await fetch(`${API_URL}/api/jobs`, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({
        workflow_id: workflowId,
        service_required: 'comfyui',
        payload: { prompt: 'test' }
      })
    });

    expect(response.ok).toBe(true);

    // STEP 2: Wait for telemetry processing
    await new Promise(resolve => setTimeout(resolve, 5000));

    // STEP 3: Verify event in Dash0
    const dash0Response = await fetch('https://api.us-west-2.aws.dash0.com/api/spans', {
      method: 'POST',
      headers: {
        'authorization': `Bearer ${DASH0_AUTH_TOKEN}`,
        'content-type': 'application/json'
      },
      body: JSON.stringify({
        timeRange: { from: startTime, to: endTime },
        dataset: DASH0_DATASET
      })
    });

    const events = await dash0Response.json();
    const jobReceivedEvent = findEventByWorkflowId(events, workflowId);

    expect(jobReceivedEvent).toBeDefined();
    expect(jobReceivedEvent.name).toBe('job.received');
  });
});

Turbo Command:

bash
turbo run test:e2e

Level 4: Build Verification (< 5 minutes total)

Purpose: Ensure code compiles and meets quality standards

Scope:

  • TypeScript compilation: turbo run typecheck
  • Lint checks: turbo run lint
  • Production builds: turbo run build

Characteristics:

  • No runtime execution
  • Fast parallel execution
  • Catches type errors and style violations

Turbo Commands:

bash
turbo run typecheck  # ~2 minutes
turbo run lint       # ~1 minute
turbo run build      # ~2 minutes

Total CI Time Budget

Test LevelTime BudgetParallelization
Unit Tests< 5 min✅ Parallel across packages
Integration Tests< 10 min⚠️ Sequential (shared Redis)
E2E Tests< 15 min⚠️ Sequential (shared services)
Build Verification< 5 min✅ Parallel (typecheck, lint, build)
TOTAL< 35 minMixed strategy

Optimization Strategies:

  • Cache pnpm dependencies between runs
  • Cache Turbo build outputs
  • Run unit tests + build verification in parallel
  • Run integration + E2E sequentially (shared dependencies)

Implementation Strategy

Approach: Add pre-release test job to existing release.yml workflow

Workflow Structure:

yaml
# .github/workflows/release.yml
name: Release Worker Bundle

on:
  push:
    tags: ["v*"]
  workflow_dispatch:

jobs:
  # NEW: Pre-release test gate
  test:
    name: Pre-Release Test Suite
    runs-on: ubuntu-latest

    services:
      redis:
        image: redis:7-alpine
        ports:
          - 6379:6379
        options: >-
          --health-cmd "redis-cli ping"
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'pnpm'

      - name: Enable pnpm
        run: corepack enable

      - name: Install dependencies
        run: pnpm install --frozen-lockfile

      - name: Typecheck
        run: pnpm typecheck

      - name: Lint
        run: pnpm lint

      - name: Unit Tests
        run: pnpm test
        env:
          REDIS_URL: redis://localhost:6379
          NODE_ENV: test

      - name: Build
        run: pnpm build

      - name: Upload test results
        if: failure()
        uses: actions/upload-artifact@v3
        with:
          name: test-results
          path: |
            **/coverage/
            **/.turbo/

    env:
      REDIS_URL: redis://localhost:6379
      NODE_ENV: test

  # MODIFIED: Block on test success
  build-and-release:
    needs: [test]  # ← Deployment blocked if tests fail
    runs-on: ubuntu-latest

    steps:
      # ... existing build and release steps
      - name: Build and package worker
        run: |
          # ... existing worker build logic

      - name: Deploy to Railway
        run: |
          # ... existing Railway deployment

Key Changes:

  1. New test job: Runs all pre-release tests
  2. Dependency: build-and-release needs test to succeed
  3. Redis service: Provides test Redis instance
  4. Environment variables: REDIS_URL, NODE_ENV for test execution
  5. Artifact upload: Preserve test results on failure

GitHub Secrets Required

Test Execution:

  • REDIS_TEST_URL - Test Redis instance (can use service container)
  • NODE_ENV=test - Environment marker

External API Mocking (Phase 2+):

  • HF_TOKEN - For model download tests (optional, can mock)
  • OPENAI_API_KEY - For OpenAI connector tests (optional, can mock)
  • CIVITAI_TOKEN - For CivitAI model tests (optional, can mock)

Telemetry Testing (Phase 3):

  • DASH0_AUTH_TOKEN - For E2E telemetry verification
  • DASH0_DATASET - Test dataset for span verification

Note: Use GitHub Actions secrets management, not repository secrets in code.

Test Infrastructure Setup

Redis Test Container

GitHub Actions Service:

yaml
services:
  redis:
    image: redis:7-alpine
    ports:
      - 6379:6379
    options: >-
      --health-cmd "redis-cli ping"
      --health-interval 10s
      --health-timeout 5s
      --health-retries 5

Local Development:

bash
# Use local Redis (already running via dev:local-redis)
pnpm dev:local-redis

# Run tests
pnpm test

Environment Configuration

CI Environment Variables:

yaml
env:
  REDIS_URL: redis://localhost:6379
  NODE_ENV: test
  CI: true

Test Configuration (vitest.config.ts):

typescript
import { defineConfig } from 'vitest/config';

export default defineConfig({
  test: {
    environment: 'node',
    globals: true,
    setupFiles: ['./test/setup.ts'],
    testTimeout: 30000,  // 30 seconds for integration tests
    hookTimeout: 30000,
    pool: 'forks',       // Isolate test processes
    poolOptions: {
      forks: {
        singleFork: false,
        isolate: true
      }
    }
  }
});

Test Coverage Requirements

Priority 0: Block Release (Must Pass)

Redis Job Matching (packages/core)

  • ✅ Atomic job claiming (no race conditions)
  • ✅ Worker capability matching (service compatibility)
  • ✅ Priority ordering (highest priority first)
  • ✅ Customer isolation (strict/loose modes)
  • ✅ Model requirements matching

API Health (apps/api)

  • ✅ Job submission endpoint (/api/jobs POST)
  • ✅ Health check endpoint (/health GET)
  • ✅ WebSocket connection stability

Worker Execution (apps/worker)

  • ✅ Job processing flow (claim → execute → complete)
  • ✅ Failure classification (auth, rate limit, generation refusal)
  • ✅ Attestation generation (workflow-aware)
  • ✅ Retry logic (transient vs permanent failures)

Build Verification

  • ✅ All packages compile (no TypeScript errors)
  • ✅ No critical lint errors
  • ✅ Production builds succeed

Priority 1: Warning (Should Pass)

Webhook Delivery (apps/webhook-service)

  • ⚠️ Event delivery to webhook URLs
  • ⚠️ Retry logic on transient failures
  • ⚠️ Telemetry event formatting

Telemetry Pipeline (apps/telemetry-collector)

  • ⚠️ Redis stream → OTLP conversion
  • ⚠️ Dash0 integration (mocked in CI)
  • ⚠️ Event validation and filtering

Priority 2: Nice to Have (Future)

Environment Management

  • 📋 Environment composition
  • 📋 Service discovery
  • 📋 Docker Compose generation

Machine Management

  • 📋 PM2 service orchestration
  • 📋 ComfyUI custom node installation
  • 📋 Health check reporting

EmProps API

  • 📋 Art generation nodes
  • 📋 Job evaluation logic
  • 📋 Pseudorandom utilities

Coverage Gap Analysis

Current Gaps (55 test files, but missing):

  1. ❌ Environment management system (newly documented)
  2. ❌ Docker build process
  3. ❌ PM2 service management
  4. ❌ ComfyUI custom node installation
  5. ❌ Machine registration flow
  6. ❌ WebSocket event flow (monitor)

Prioritization:

  • P0 (Block release): Critical production paths only
  • P1 (Warning): Important but non-blocking features
  • P2 (Nice to have): Developer experience, tooling

CI/CD Integration

Pre-Release Test Job (Detailed)

Complete GitHub Actions Configuration:

yaml
name: Pre-Release Testing
on:
  push:
    tags: ["v*"]
  workflow_dispatch:

jobs:
  test:
    name: Pre-Release Test Suite
    runs-on: ubuntu-latest
    timeout-minutes: 40  # Safety timeout

    services:
      redis:
        image: redis:7-alpine
        ports:
          - 6379:6379
        options: >-
          --health-cmd "redis-cli ping"
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5

    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'pnpm'

      - name: Enable pnpm
        run: corepack enable

      - name: Get pnpm store directory
        id: pnpm-cache
        shell: bash
        run: |
          echo "STORE_PATH=$(pnpm store path)" >> $GITHUB_OUTPUT

      - name: Setup pnpm cache
        uses: actions/cache@v3
        with:
          path: ${{ steps.pnpm-cache.outputs.STORE_PATH }}
          key: ${{ runner.os }}-pnpm-store-${{ hashFiles('**/pnpm-lock.yaml') }}
          restore-keys: |
            ${{ runner.os }}-pnpm-store-

      - name: Install dependencies
        run: pnpm install --frozen-lockfile

      - name: TypeScript Compilation Check
        run: pnpm typecheck

      - name: Lint Check
        run: pnpm lint

      - name: Unit Tests
        run: pnpm test
        env:
          REDIS_URL: redis://localhost:6379
          NODE_ENV: test

      - name: Build Verification
        run: pnpm build
        env:
          NODE_ENV: production

      - name: Upload test results
        if: always()
        uses: actions/upload-artifact@v3
        with:
          name: test-results
          path: |
            **/coverage/
            **/.turbo/
          retention-days: 7

      - name: Test Summary
        if: always()
        run: |
          echo "## Test Results" >> $GITHUB_STEP_SUMMARY
          echo "✅ All pre-release tests passed" >> $GITHUB_STEP_SUMMARY
          echo "" >> $GITHUB_STEP_SUMMARY
          echo "### Test Execution" >> $GITHUB_STEP_SUMMARY
          echo "- TypeScript compilation: ✅" >> $GITHUB_STEP_SUMMARY
          echo "- Lint checks: ✅" >> $GITHUB_STEP_SUMMARY
          echo "- Unit tests: ✅" >> $GITHUB_STEP_SUMMARY
          echo "- Build verification: ✅" >> $GITHUB_STEP_SUMMARY

    env:
      REDIS_URL: redis://localhost:6379
      NODE_ENV: test
      CI: true
      TURBO_ENV_MODE: loose  # Allow Turbo to access environment variables

  release:
    needs: [test]  # Blocks deployment if tests fail
    runs-on: ubuntu-latest
    steps:
      # ... existing release steps (unchanged)
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Build and release worker
        run: |
          # ... existing worker build logic

      - name: Deploy to Railway
        run: |
          # ... existing Railway deployment
        env:
          RAILWAY_TOKEN: ${{ secrets.RAILWAY_TOKEN }}

Test Execution Flow

Failure Handling

Test Failure Scenarios:

  1. TypeScript Compilation Fails

    • Error: Type errors in code
    • Action: Block deployment, show compilation errors
    • Fix: Address type errors before retrying
  2. Lint Check Fails

    • Error: Code style violations
    • Action: Block deployment, show lint errors
    • Fix: Run pnpm lint --fix or address manually
  3. Unit Tests Fail

    • Error: Business logic regression
    • Action: Block deployment, upload test results
    • Fix: Debug failing tests locally with pnpm test:ui
  4. Build Fails

    • Error: Production build errors
    • Action: Block deployment, show build logs
    • Fix: Address build errors (missing dependencies, etc.)

Notification Strategy:

  • GitHub Actions summary shows failure details
  • Slack notification on failure (optional Phase 2)
  • Email to release tag creator (optional Phase 2)

Phased Rollout Plan

Phase 1: Foundation (Week 1)

Goal: Establish CI infrastructure and basic test coverage

Tasks:

  1. ✅ Create pre-release test job in release.yml
  2. ✅ Configure Redis test container
  3. ✅ Add pnpm caching for faster CI
  4. ✅ Run existing unit tests (55 test files)
  5. ✅ Run typecheck and lint
  6. ✅ Run build verification

Deliverables:

  • Working CI pipeline with test gate
  • < 10 minute execution time (unit tests only)
  • Clear failure reporting

Success Criteria:

  • CI passes on current staging branch
  • No flaky tests (3 consecutive successful runs)
  • Documentation for running tests locally

Rollout:

  • Deploy to staging first (v1.2.3-staging tag)
  • Monitor for false positives
  • Enable for production after 3 successful staging releases

Phase 2: Integration Coverage (Week 2)

Goal: Add integration tests for critical components

Tasks:

  1. ✅ Run Redis function integration tests
  2. ✅ Run API integration tests (job submission)
  3. ✅ Run worker integration tests
  4. ✅ Configure test environment secrets
  5. ⚠️ Add coverage reporting (optional)

Deliverables:

  • Integration tests running in CI
  • < 20 minute total execution time
  • Coverage reports uploaded

Success Criteria:

  • Integration tests pass reliably
  • No Redis connection issues in CI
  • Clear error messages on failures

Challenges:

  • Shared Redis instance (sequential tests)
  • External API mocking (nock configuration)
  • Test data cleanup between runs

Phase 3: E2E Coverage (Week 3)

Goal: Validate critical user flows end-to-end

Tasks:

  1. ✅ Add job submission → worker execution E2E test
  2. ✅ Add telemetry pipeline E2E test
  3. ⚠️ Add webhook delivery E2E test
  4. ⚠️ Configure Dash0 test environment
  5. ⚠️ Add machine registration flow test

Deliverables:

  • E2E tests for critical paths
  • < 35 minute total execution time
  • Real-time event verification

Success Criteria:

  • E2E tests detect real regressions
  • No false positives from timing issues
  • Clear test output showing flow progression

Challenges:

  • Timing-sensitive tests (use explicit waits, not sleeps)
  • External service mocking (Dash0, OpenAI)
  • Test isolation (parallel execution conflicts)

Phase 4: Optimization (Week 4)

Goal: Improve CI speed and developer experience

Tasks:

  1. ✅ Parallelize independent test suites
  2. ✅ Add test result caching
  3. ✅ Document test writing guidelines
  4. ✅ Add visual test UI documentation
  5. ✅ Create troubleshooting guide

Deliverables:

  • Optimized CI execution (target < 30 minutes)
  • Comprehensive testing documentation
  • Developer onboarding guide

Success Criteria:

  • < 30 minute CI execution time
  • Developers can add tests without assistance
  • Zero flaky tests (1 week observation)

Success Metrics

Before Implementation

Current State:

  • ❌ 0% automated test coverage in CI
  • ❌ Manual testing before releases (inconsistent)
  • ❌ Unknown production readiness
  • ⏰ Hours spent debugging production issues
  • 😰 Fear of shipping (slows innovation)

After Implementation

Phase 1 (Foundation):

  • ✅ 100% of releases gated by unit tests
  • ✅ < 10 minute CI feedback time
  • ✅ Typecheck + lint + build verification automated
  • 📊 Baseline metrics established

Phase 2 (Integration):

  • ✅ Redis function coverage (atomic job matching)
  • ✅ API endpoint coverage (job submission)
  • ✅ Worker integration coverage (job processing)
  • 📊 < 20 minute CI execution time

Phase 3 (E2E):

  • ✅ Critical user flow coverage
  • ✅ Telemetry pipeline validation
  • ✅ Webhook delivery verification
  • 📊 < 35 minute CI execution time

Phase 4 (Optimization):

  • ✅ < 30 minute CI execution time
  • ✅ Zero flaky tests (1 week observation)
  • ✅ Developer documentation complete
  • 📊 Production incidents -50% vs baseline

Ongoing Metrics

Test Reliability:

  • Test pass rate: Target > 95% (excluding legitimate failures)
  • Flaky test rate: Target < 2% (tests that fail randomly)
  • CI execution time: Target < 30 minutes

Production Impact:

  • Incidents caused by releases: Target -50% reduction
  • Mean time to detect (MTTD): Target < 5 minutes (CI feedback)
  • Mean time to resolve (MTTR): Target -30% reduction

Developer Experience:

  • Time to add new test: Target < 30 minutes
  • Test documentation completeness: Target 100%
  • Developer satisfaction: Survey after 1 month

Dashboard Metrics (future):

  • Test execution trends (speed over time)
  • Coverage trends (% of code covered)
  • Failure trends (which tests fail most)

Consequences

Positive

1. Production Safety

  • ✅ No untested code reaches production
  • ✅ Objective pass/fail criteria for releases
  • ✅ Early detection of regressions
  • ✅ Reduced customer-facing incidents

2. Developer Confidence

  • ✅ Safe refactoring with test coverage
  • ✅ Fast feedback on changes (local + CI)
  • ✅ Clear error messages guide debugging
  • ✅ Less fear of breaking production

3. Team Velocity

  • ✅ Automated testing faster than manual
  • ✅ Parallel development without conflicts
  • ✅ Onboarding documentation (tests as examples)
  • ✅ Less time firefighting production issues

4. Engineering Culture

  • ✅ Quality-first mindset
  • ✅ Documentation as code (tests document behavior)
  • ✅ Continuous improvement (add tests for bugs)
  • ✅ Reduced technical debt

5. Business Impact

  • ✅ Customer trust from reliable service
  • ✅ Revenue protection (fewer outages)
  • ✅ Competitive advantage (ship faster safely)
  • ✅ Engineering reputation

Negative

1. Initial Investment

  • ❌ 1-2 weeks to implement (40-80 hours)
  • ❌ Learning curve for test writing
  • ❌ CI setup complexity
  • Mitigation: Phased rollout, pair programming, documentation

2. Ongoing Maintenance

  • ❌ Tests need updating with code changes
  • ❌ Flaky tests can block releases
  • ❌ CI costs (GitHub Actions minutes)
  • Mitigation: Fix flaky tests immediately, optimize CI, monitor costs

3. False Negatives

  • ❌ Flaky tests block valid releases
  • ❌ Environment differences (CI vs production)
  • ❌ Timing-sensitive tests fail randomly
  • Mitigation: Retry logic, explicit waits, test isolation

4. Developer Workflow

  • ❌ Slower releases (35 min CI vs immediate)
  • ❌ Test failures require debugging
  • ❌ Pre-commit hooks slow local workflow
  • Mitigation: Fast local tests, clear errors, optional pre-commit

5. Coverage Gaps

  • ❌ Cannot test every scenario
  • ❌ Integration tests miss production differences
  • ❌ E2E tests miss edge cases
  • Mitigation: Focus on critical paths, production monitoring, iterative improvement

Risk Mitigation Strategies

Flaky Tests:

  • Problem: Random failures block releases
  • Solution:
    • Explicit waits instead of sleep()
    • Test isolation (no shared state)
    • Retry logic for network-dependent tests
    • Quarantine flaky tests until fixed

CI Costs:

  • Problem: GitHub Actions minutes usage
  • Solution:
    • Cache dependencies aggressively
    • Parallelize independent tests
    • Monitor usage with cost alerts
    • Self-hosted runners if cost prohibitive

Developer Frustration:

  • Problem: Tests seen as blocker
  • Solution:
    • Fast local test execution
    • Clear error messages
    • Visual test UI (vitest --ui)
    • Pair programming for test writing

Alternatives Considered

Alternative 1: No Automated Testing ❌

Approach: Continue manual testing before releases

Pros:

  • No initial investment
  • No CI setup complexity
  • No flaky test issues

Cons:

  • High risk: 81 commits untested
  • Slow: Manual testing takes hours
  • Incomplete: Cannot test all scenarios
  • Unreliable: Human error, inconsistent coverage
  • Not scalable: Slows as codebase grows

Verdict: REJECTED - Too risky with significant changes pending

Alternative 2: Post-Deployment Testing Only ⚠️

Approach: Deploy to staging, run tests, promote to production

Pros:

  • Tests run in production-like environment
  • Catches environment-specific issues
  • No CI setup required

Cons:

  • ⚠️ Customers affected: Staging breakage delays production
  • ⚠️ Slower: Deploy → test → rollback → fix cycle
  • ⚠️ Manual: Requires human intervention
  • ⚠️ Partial: Doesn't catch pre-deployment issues

Verdict: REJECTED - Useful as smoke tests but insufficient alone

Alternative 3: Staging Environment Testing Only ⚠️

Approach: Require staging deployment before production

Pros:

  • Real environment testing
  • Catches configuration issues
  • Minimal CI setup

Cons:

  • ⚠️ Drift risk: Staging ≠ production (different configs)
  • ⚠️ Manual: Human validation required
  • ⚠️ Slow: Deploy → test → deploy cycle
  • ⚠️ Incomplete: Doesn't test all code paths

Verdict: PARTIAL - Good practice but not sufficient, use alongside automated tests

Alternative 4: Feature Flags + Gradual Rollout ✅

Approach: Use feature flags to control new features, gradual rollout

Pros:

  • ✅ Control blast radius (limit affected users)
  • ✅ A/B testing capability
  • ✅ Quick rollback (disable flag)
  • ✅ Production testing with real traffic

Cons:

  • ⚠️ Complexity: Flag management overhead
  • ⚠️ Technical debt: Old flags linger
  • ⚠️ Not comprehensive: Doesn't replace testing

Verdict: COMPLEMENTARY - Use alongside testing, not instead of

Alternative 5: Contract Testing (Consumer-Driven) 🤔

Approach: Define contracts between services, test independently

Pros:

  • Decoupled service testing
  • Parallel development
  • Clear interface definitions

Cons:

  • ⚠️ Complexity: Contract management overhead
  • ⚠️ Tooling: Pact.js setup required
  • ⚠️ Incomplete: Doesn't test full integration

Verdict: FUTURE - Good for microservices, overkill for current maturity

Alternative 6: Mutation Testing 🤔

Approach: Modify code (mutants) to verify tests catch bugs

Pros:

  • Verifies test quality
  • Catches weak assertions
  • Improves coverage

Cons:

  • ⚠️ Slow: 10x longer execution time
  • ⚠️ Overkill: Current maturity doesn't justify
  • ⚠️ Diminishing returns: High cost for marginal benefit

Verdict: FUTURE - Consider after baseline coverage established

Decision Matrix

AlternativeSafetySpeedCostComplexityVerdict
Test Pyramid (Chosen)✅ High✅ Fast⚠️ Medium⚠️ MediumRECOMMENDED
No Testing❌ Low✅ Fast✅ Low✅ Low❌ Rejected
Post-Deployment⚠️ Medium❌ Slow✅ Low✅ Low❌ Rejected
Staging Only⚠️ Medium❌ Slow✅ Low✅ Low⚠️ Partial
Feature Flags✅ High✅ Fast⚠️ Medium❌ High✅ Complementary
Contract Testing✅ High✅ Fast⚠️ Medium❌ High🤔 Future
Mutation Testing✅ High❌ Slow❌ High❌ High🤔 Future

Open Questions

Q1: Test Coverage Thresholds

Question: Should we enforce minimum code coverage percentages (e.g., 80%)?

Considerations:

  • Pros: Objective quality metric, forces coverage
  • Cons: Encourages low-quality tests, diminishing returns
  • Alternative: Focus on critical path coverage, not percentage

Recommendation: No coverage thresholds initially. Focus on quality over quantity. Revisit after baseline coverage established.


Q2: PR Testing vs Release Testing

Question: Should tests run on every PR to master, or only on releases?

Options:

Option A: PR Testing Only

  • Tests run on pull_request to master
  • Feedback before merge
  • No release-time testing

Option B: Release Testing Only

  • Tests run on git tags
  • Fast PR merges
  • Risk of broken master

Option C: Both PR + Release (Recommended)

  • Tests on PR (fast subset)
  • Full tests on release (comprehensive)
  • Best safety, some duplication

Recommendation: Option C - Run fast tests on PR (unit + typecheck), full suite on releases.


Q3: External API Handling

Question: How do we test integrations with external APIs (OpenAI, Anthropic, Dash0)?

Options:

Option A: Mock All External APIs

  • Use nock to intercept HTTP calls
  • Fast, deterministic tests
  • Risk: Mocks drift from reality

Option B: Test Against Real APIs

  • Use test accounts/keys
  • Real integration validation
  • Risk: Slow, flaky, costs money

Option C: Hybrid (Recommended)

  • Mock in CI (fast, deterministic)
  • Real API testing in staging
  • Best of both worlds

Recommendation: Option C - Mock external APIs in CI with nock, run real API tests in staging environment.


Q4: Test Retry Logic

Question: Should we automatically retry flaky tests in CI?

Considerations:

  • Pros: Reduces false negatives from timing issues
  • Cons: Masks real problems, slower CI
  • Alternative: Fix flaky tests immediately

Recommendation: Limited retries - Retry network-dependent tests (max 2 attempts), no retries for unit tests. Track retry rate as metric.


Q5: Parallel Test Execution

Question: Should we invest in parallel test execution now or later?

Current State:

  • Sequential execution: ~35 minutes
  • Parallel potential: ~20 minutes (estimate)
  • Cost: 1-2 days implementation

Recommendation: Later - Optimize during Phase 4 if CI time exceeds 35 minutes. Focus on coverage first, speed second.


Q6: Self-Hosted Runners

Question: Should we use GitHub self-hosted runners instead of GitHub-hosted?

Considerations:

GitHub-Hosted (Current):

  • ✅ Zero maintenance
  • ✅ Always available
  • ❌ Slower startup
  • ❌ Monthly cost

Self-Hosted:

  • ✅ Faster execution
  • ✅ Lower cost at scale
  • ❌ Maintenance overhead
  • ❌ Security concerns

Recommendation: GitHub-hosted initially - Monitor costs, switch to self-hosted if costs exceed $100/month or CI time exceeds 40 minutes.


ADRs:

Guides:

Infrastructure:


Implementation Checklist

Phase 1: Foundation (Week 1)

  • [ ] Create test job in .github/workflows/release.yml
  • [ ] Configure Redis service container
  • [ ] Add pnpm caching
  • [ ] Run existing unit tests (turbo run test)
  • [ ] Add typecheck step
  • [ ] Add lint step
  • [ ] Add build verification step
  • [ ] Test on staging branch (v1.2.3-staging)
  • [ ] Document CI setup in README
  • [ ] Update ADR index with this ADR

Phase 2: Integration (Week 2)

  • [ ] Add Redis function integration tests to CI
  • [ ] Add API integration tests to CI
  • [ ] Configure test environment secrets
  • [ ] Add coverage reporting (optional)
  • [ ] Document integration test patterns
  • [ ] Fix any flaky integration tests

Phase 3: E2E (Week 3)

  • [ ] Add job submission E2E test
  • [ ] Add telemetry pipeline E2E test
  • [ ] Add webhook delivery E2E test
  • [ ] Configure Dash0 test environment
  • [ ] Document E2E test patterns
  • [ ] Fix timing-sensitive test issues

Phase 4: Optimization (Week 4)

  • [ ] Parallelize independent test suites
  • [ ] Add test result caching
  • [ ] Document test writing guidelines
  • [ ] Create troubleshooting guide
  • [ ] Monitor and optimize CI time
  • [ ] Collect baseline metrics

Post-Implementation

  • [ ] Review success metrics after 1 month
  • [ ] Survey developer satisfaction
  • [ ] Identify coverage gaps
  • [ ] Plan next iteration improvements

Approval and Next Steps

Approval Required From:

  • [ ] Engineering Team Lead
  • [ ] DevOps/Infrastructure Lead
  • [ ] QA Lead (if applicable)

Next Steps After Approval:

  1. Create GitHub issue tracking implementation
  2. Assign Phase 1 tasks to engineer(s)
  3. Schedule kickoff meeting
  4. Begin Phase 1 implementation
  5. Review and iterate based on feedback

Questions or Feedback: Contact Architecture Team or post in #architecture Slack channel.


Document Version: 1.0 Last Updated: 2025-10-08 Author: Claude Code (AI Agent) Reviewers: (to be added after team review)

Released under the MIT License.