ADR-002: Pre-Release Testing Strategy for Production Deployments
Date: 2025-10-08 Status: 🤔 Proposed Decision Makers: Engineering Team Approval Required: Before implementing CI/CD test gates Related ADRs: ADR-001: Encrypted Environment Variables, Docker Swarm Migration Analysis
Executive Summary
This ADR proposes a comprehensive pre-release testing strategy to prevent production breakage from untested code. The system will gate all production deployments (Railway.app services) behind an automated test suite running in GitHub Actions.
Current Risk:
- ✅ 81 commits ahead of master on staging branch
- ❌ No automated testing before releases
- ❌ Production deployments triggered directly on git tags
- ⚠️ Production breakage risk from untested changes
Proposed Solution:
- 🎯 Test pyramid strategy with unit, integration, and E2E tests
- 🛡️ CI/CD test gate blocking releases on test failures
- ⚡ < 35 minute total CI execution time
- 📊 Comprehensive coverage across critical production paths
Impact:
- Before: Manual testing, uncertain production readiness, reactive bug fixes
- After: Automated validation, objective deployment criteria, proactive quality assurance
Table of Contents
- Context
- Problem Statement
- Decision
- Test Architecture
- Implementation Strategy
- Test Coverage Requirements
- CI/CD Integration
- Phased Rollout Plan
- Success Metrics
- Consequences
- Alternatives Considered
- Open Questions
Context
Current State Analysis
Existing Test Infrastructure:
✅ Vitest configured across all packages
✅ 55+ test files across monorepo
✅ Test scripts in package.json (pnpm test, pnpm test:api, etc.)
✅ Turbo.json configured with test task dependencies
✅ Release workflow exists (.github/workflows/release.yml)Test Coverage by Package:
| Package | Test Files | Coverage Areas |
|---|---|---|
packages/core/ | Redis functions integration tests | Job matching, capability matching, atomic claiming |
apps/api/ | Integration + E2E tests | Job submission, connector integration, telemetry pipeline |
apps/worker/ | Failure classification, attestation | Error handling, retry logic, failure classification |
apps/webhook-service/ | Telemetry pipeline E2E | Event delivery, retry logic, OTLP integration |
apps/monitor/ | Component tests | Redis connections, machine status, attestation system |
packages/telemetry/ | OTLP integration | Telemetry client, event formatting, Dash0 integration |
apps/emprops-api/ | Unit tests | Linter, pseudorandom, art-gen nodes |
Current Release Workflow:
# .github/workflows/release.yml
on:
push:
tags: ["v*"] # Triggers on git tags
workflow_dispatch:
jobs:
build-and-release:
steps:
- Checkout code
- Build worker bundle
- Create GitHub release
- Deploy to Railway (production or staging)Critical Production Services (from release.yml):
- Production: q-emprops-api, q-job-api, q-webhook, q-telcollect, openai-machine, openai-response, gemini
- Staging: stg.* versions of same services
Current Deployment Model:
Git Tag (v1.2.3)
↓
Build Worker Bundle
↓
Create GitHub Release
↓
Deploy to Railway.app
↓
PRODUCTION (no test gate!)Infrastructure Reality
Distributed Architecture:
- API Server: Lightweight Redis orchestration (Railway.app)
- Webhook Service: Event delivery, telemetry pipeline (Railway.app)
- Telemetry Collector: OTLP bridge, Dash0 integration (Railway.app)
- EmProps API: Art generation, job evaluation (Railway.app)
- Worker Machines: Ephemeral GPU compute (SALAD, vast.ai, RunPod)
Production Constraints:
- Real-time systems: WebSocket-based monitoring, job progress updates
- Paying customers: Production breakage affects revenue and reputation
- Distributed state: Redis-based coordination, event-driven architecture
- External dependencies: OpenAI, Anthropic, ComfyUI, Dash0
Business Impact
Current Risk:
- 81 commits ahead of master without comprehensive testing
- Production deployments lack objective quality criteria
- Debugging production issues wastes engineering time
- Customer-facing outages damage trust and revenue
Cost of No Testing:
- ⏰ Time: Hours spent debugging production issues
- 💰 Revenue: Customer churn from unreliable service
- 😰 Stress: On-call firefighting instead of feature development
- 📉 Velocity: Fear of shipping slows innovation
Problem Statement
Requirements
R1: Pre-Release Validation
- All production releases must pass automated tests before deployment
- Tests must validate critical user flows and system integrity
- Test failures must block deployment automatically
R2: Comprehensive Coverage
- Unit tests for business logic (< 5 minutes)
- Integration tests for component interactions (< 10 minutes)
- E2E tests for critical user flows (< 15 minutes)
- Build verification for type safety and linting (< 5 minutes)
R3: Fast Feedback
- Total CI execution time < 35 minutes (acceptable for release gate)
- Parallel test execution where possible
- Clear failure reporting with actionable errors
R4: Maintainability
- Tests run reliably in CI environment
- No flaky tests blocking releases
- Clear documentation for adding new tests
R5: Developer Experience
- Tests run locally before commit (pre-commit hooks)
- Visual test UI available (vitest --ui)
- Incremental adoption without blocking current workflow
Non-Requirements
- ❌ 100% code coverage: Focus on critical paths, not every line
- ❌ Mutation testing: Overkill for current maturity level
- ❌ Performance benchmarks: Separate concern, not blocking releases
- ❌ Cross-browser testing: Backend services, not applicable
Decision
Adopt Test Pyramid Strategy with CI/CD Gates
We will implement a comprehensive pre-release testing strategy based on the test pyramid model, with automated gates in GitHub Actions blocking production deployments on test failures.
Core Principles:
- Test Pyramid: Many fast unit tests, fewer integration tests, minimal E2E tests
- CI/CD Gates: All tests must pass before deployment proceeds
- Incremental Adoption: Phase in coverage over 4 weeks
- Fast Feedback: Total CI time < 35 minutes
- Actionable Failures: Clear error messages guide debugging
Test Levels:
/\
/E2E\ ← Slow (15 min), few tests, high confidence
/------\ Critical user flows, telemetry pipeline
/Integration\ ← Medium (10 min), moderate coverage
/------------\ Redis functions, API endpoints, connectors
/ Unit Tests \ ← Fast (5 min), many tests, quick feedback
/----------------\ Business logic, failure classification, utilitiesImplementation Approach:
- Option A (RECOMMENDED): Pre-release job in release.yml blocking deployment
- Phase 1 (Week 1): Foundation - Unit tests + build verification
- Phase 2 (Week 2): Integration tests for Redis and API
- Phase 3 (Week 3): E2E tests for critical user flows
- Phase 4 (Week 4): Optimization and documentation
Test Architecture
Test Pyramid Breakdown
Level 1: Unit Tests (< 5 minutes total)
Purpose: Fast feedback on business logic correctness
Scope:
- Failure classification logic (
apps/worker/src/__tests__/failure-classification.test.ts) - Attestation generation (
apps/worker/src/__tests__/failure-attestation.test.ts) - Retry count extraction (
apps/worker/src/__tests__/retry-count-extraction.test.ts) - EmProps API nodes (
apps/emprops-api/src/modules/art-gen/nodes/*.test.ts) - Utility functions (linter, pseudorandom, etc.)
Characteristics:
- No external dependencies (mocked Redis, HTTP clients)
- Deterministic inputs and outputs
- Isolated test cases (no shared state)
- Execute in parallel across packages
Example Test:
// apps/worker/src/__tests__/failure-classification.test.ts
import { describe, it, expect } from 'vitest';
import { FailureClassifier, FailureType } from '../types/failure-classification.js';
describe('Failure Classification System', () => {
it('should classify HTTP authentication errors correctly', () => {
const error = 'Request failed with status code 401 - Invalid API key';
const context = { httpStatus: 401, serviceType: 'openai_responses' };
const result = FailureClassifier.classify(error, context);
expect(result.failure_type).toBe(FailureType.AUTH_ERROR);
expect(result.failure_reason).toBe(FailureReason.INVALID_API_KEY);
});
});Turbo Command:
turbo run test --filter='!apps/api' --filter='!apps/webhook-service'Level 2: Integration Tests (< 10 minutes total)
Purpose: Validate component interactions with real dependencies
Scope:
- Redis function integration (
packages/core/src/redis-functions/__tests__/integration.test.ts) - API job submission (
apps/api/src/__tests__/connector-integration.e2e.test.ts) - Worker job processing (
apps/worker/integration tests) - Webhook delivery flow (
apps/webhook-service/__tests__/webhook-server.test.ts)
Characteristics:
- Real Redis instance (Docker container or in-memory)
- Real HTTP servers (test instances)
- Controlled external dependencies (mock external APIs)
- Shared test infrastructure (setup/teardown)
Example Test:
// packages/core/src/redis-functions/__tests__/integration.test.ts
import { describe, it, expect, beforeAll, afterAll } from 'vitest';
import Redis from 'ioredis';
import { RedisFunctionInstaller } from '../installer.js';
describe('Redis Function Integration Tests', () => {
let redis: Redis;
beforeAll(async () => {
redis = new Redis(process.env.REDIS_URL || 'redis://localhost:6379');
const installer = new RedisFunctionInstaller(redis);
await installer.installOrUpdate();
});
it('should match worker with compatible service', async () => {
// Setup job and worker in Redis
const jobId = 'test-job-1';
await redis.hmset(`job:${jobId}`, {
id: jobId,
service_required: 'comfyui',
priority: '100',
status: 'pending'
});
await redis.zadd('jobs:pending', 100, jobId);
const worker = {
worker_id: 'worker-1',
job_service_required_map: ['comfyui', 'a1111']
};
// Test Redis function
const result = await redis.fcall('findMatchingJob', 0, JSON.stringify(worker), '10');
expect(result).not.toBeNull();
expect(JSON.parse(result).jobId).toBe(jobId);
});
});Turbo Command:
turbo run test:integrationLevel 3: E2E Tests (< 15 minutes total)
Purpose: Validate critical user flows end-to-end
Scope:
- Job submission → Worker execution → Webhook delivery
- Telemetry pipeline: API → Telemetry Collector → Dash0
- Machine registration → Job claiming → Progress updates
- Workflow execution with multiple steps
Characteristics:
- Full system integration (API + Worker + Webhook + Telemetry)
- Real external services (mocked APIs via nock)
- Real-time event verification (WebSocket, Redis streams)
- Longer execution times (10-15 seconds per test)
Example Test:
// apps/webhook-service/src/__tests__/telemetry-pipeline.e2e.test.ts
import { describe, it, expect } from 'vitest';
import fetch from 'node-fetch';
describe('Telemetry Pipeline E2E', () => {
it('should deliver job.received event to Dash0', async () => {
const workflowId = `e2e-test-${Date.now()}`;
// STEP 1: Submit job to API
const response = await fetch(`${API_URL}/api/jobs`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
workflow_id: workflowId,
service_required: 'comfyui',
payload: { prompt: 'test' }
})
});
expect(response.ok).toBe(true);
// STEP 2: Wait for telemetry processing
await new Promise(resolve => setTimeout(resolve, 5000));
// STEP 3: Verify event in Dash0
const dash0Response = await fetch('https://api.us-west-2.aws.dash0.com/api/spans', {
method: 'POST',
headers: {
'authorization': `Bearer ${DASH0_AUTH_TOKEN}`,
'content-type': 'application/json'
},
body: JSON.stringify({
timeRange: { from: startTime, to: endTime },
dataset: DASH0_DATASET
})
});
const events = await dash0Response.json();
const jobReceivedEvent = findEventByWorkflowId(events, workflowId);
expect(jobReceivedEvent).toBeDefined();
expect(jobReceivedEvent.name).toBe('job.received');
});
});Turbo Command:
turbo run test:e2eLevel 4: Build Verification (< 5 minutes total)
Purpose: Ensure code compiles and meets quality standards
Scope:
- TypeScript compilation:
turbo run typecheck - Lint checks:
turbo run lint - Production builds:
turbo run build
Characteristics:
- No runtime execution
- Fast parallel execution
- Catches type errors and style violations
Turbo Commands:
turbo run typecheck # ~2 minutes
turbo run lint # ~1 minute
turbo run build # ~2 minutesTotal CI Time Budget
| Test Level | Time Budget | Parallelization |
|---|---|---|
| Unit Tests | < 5 min | ✅ Parallel across packages |
| Integration Tests | < 10 min | ⚠️ Sequential (shared Redis) |
| E2E Tests | < 15 min | ⚠️ Sequential (shared services) |
| Build Verification | < 5 min | ✅ Parallel (typecheck, lint, build) |
| TOTAL | < 35 min | Mixed strategy |
Optimization Strategies:
- Cache pnpm dependencies between runs
- Cache Turbo build outputs
- Run unit tests + build verification in parallel
- Run integration + E2E sequentially (shared dependencies)
Implementation Strategy
CI/CD Integration (Option A: Recommended)
Approach: Add pre-release test job to existing release.yml workflow
Workflow Structure:
# .github/workflows/release.yml
name: Release Worker Bundle
on:
push:
tags: ["v*"]
workflow_dispatch:
jobs:
# NEW: Pre-release test gate
test:
name: Pre-Release Test Suite
runs-on: ubuntu-latest
services:
redis:
image: redis:7-alpine
ports:
- 6379:6379
options: >-
--health-cmd "redis-cli ping"
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: 20
cache: 'pnpm'
- name: Enable pnpm
run: corepack enable
- name: Install dependencies
run: pnpm install --frozen-lockfile
- name: Typecheck
run: pnpm typecheck
- name: Lint
run: pnpm lint
- name: Unit Tests
run: pnpm test
env:
REDIS_URL: redis://localhost:6379
NODE_ENV: test
- name: Build
run: pnpm build
- name: Upload test results
if: failure()
uses: actions/upload-artifact@v3
with:
name: test-results
path: |
**/coverage/
**/.turbo/
env:
REDIS_URL: redis://localhost:6379
NODE_ENV: test
# MODIFIED: Block on test success
build-and-release:
needs: [test] # ← Deployment blocked if tests fail
runs-on: ubuntu-latest
steps:
# ... existing build and release steps
- name: Build and package worker
run: |
# ... existing worker build logic
- name: Deploy to Railway
run: |
# ... existing Railway deploymentKey Changes:
- New
testjob: Runs all pre-release tests - Dependency:
build-and-releaseneedstestto succeed - Redis service: Provides test Redis instance
- Environment variables: REDIS_URL, NODE_ENV for test execution
- Artifact upload: Preserve test results on failure
GitHub Secrets Required
Test Execution:
REDIS_TEST_URL- Test Redis instance (can use service container)NODE_ENV=test- Environment marker
External API Mocking (Phase 2+):
HF_TOKEN- For model download tests (optional, can mock)OPENAI_API_KEY- For OpenAI connector tests (optional, can mock)CIVITAI_TOKEN- For CivitAI model tests (optional, can mock)
Telemetry Testing (Phase 3):
DASH0_AUTH_TOKEN- For E2E telemetry verificationDASH0_DATASET- Test dataset for span verification
Note: Use GitHub Actions secrets management, not repository secrets in code.
Test Infrastructure Setup
Redis Test Container
GitHub Actions Service:
services:
redis:
image: redis:7-alpine
ports:
- 6379:6379
options: >-
--health-cmd "redis-cli ping"
--health-interval 10s
--health-timeout 5s
--health-retries 5Local Development:
# Use local Redis (already running via dev:local-redis)
pnpm dev:local-redis
# Run tests
pnpm testEnvironment Configuration
CI Environment Variables:
env:
REDIS_URL: redis://localhost:6379
NODE_ENV: test
CI: trueTest Configuration (vitest.config.ts):
import { defineConfig } from 'vitest/config';
export default defineConfig({
test: {
environment: 'node',
globals: true,
setupFiles: ['./test/setup.ts'],
testTimeout: 30000, // 30 seconds for integration tests
hookTimeout: 30000,
pool: 'forks', // Isolate test processes
poolOptions: {
forks: {
singleFork: false,
isolate: true
}
}
}
});Test Coverage Requirements
Priority 0: Block Release (Must Pass)
Redis Job Matching (packages/core)
- ✅ Atomic job claiming (no race conditions)
- ✅ Worker capability matching (service compatibility)
- ✅ Priority ordering (highest priority first)
- ✅ Customer isolation (strict/loose modes)
- ✅ Model requirements matching
API Health (apps/api)
- ✅ Job submission endpoint (/api/jobs POST)
- ✅ Health check endpoint (/health GET)
- ✅ WebSocket connection stability
Worker Execution (apps/worker)
- ✅ Job processing flow (claim → execute → complete)
- ✅ Failure classification (auth, rate limit, generation refusal)
- ✅ Attestation generation (workflow-aware)
- ✅ Retry logic (transient vs permanent failures)
Build Verification
- ✅ All packages compile (no TypeScript errors)
- ✅ No critical lint errors
- ✅ Production builds succeed
Priority 1: Warning (Should Pass)
Webhook Delivery (apps/webhook-service)
- ⚠️ Event delivery to webhook URLs
- ⚠️ Retry logic on transient failures
- ⚠️ Telemetry event formatting
Telemetry Pipeline (apps/telemetry-collector)
- ⚠️ Redis stream → OTLP conversion
- ⚠️ Dash0 integration (mocked in CI)
- ⚠️ Event validation and filtering
Priority 2: Nice to Have (Future)
Environment Management
- 📋 Environment composition
- 📋 Service discovery
- 📋 Docker Compose generation
Machine Management
- 📋 PM2 service orchestration
- 📋 ComfyUI custom node installation
- 📋 Health check reporting
EmProps API
- 📋 Art generation nodes
- 📋 Job evaluation logic
- 📋 Pseudorandom utilities
Coverage Gap Analysis
Current Gaps (55 test files, but missing):
- ❌ Environment management system (newly documented)
- ❌ Docker build process
- ❌ PM2 service management
- ❌ ComfyUI custom node installation
- ❌ Machine registration flow
- ❌ WebSocket event flow (monitor)
Prioritization:
- P0 (Block release): Critical production paths only
- P1 (Warning): Important but non-blocking features
- P2 (Nice to have): Developer experience, tooling
CI/CD Integration
Pre-Release Test Job (Detailed)
Complete GitHub Actions Configuration:
name: Pre-Release Testing
on:
push:
tags: ["v*"]
workflow_dispatch:
jobs:
test:
name: Pre-Release Test Suite
runs-on: ubuntu-latest
timeout-minutes: 40 # Safety timeout
services:
redis:
image: redis:7-alpine
ports:
- 6379:6379
options: >-
--health-cmd "redis-cli ping"
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: 20
cache: 'pnpm'
- name: Enable pnpm
run: corepack enable
- name: Get pnpm store directory
id: pnpm-cache
shell: bash
run: |
echo "STORE_PATH=$(pnpm store path)" >> $GITHUB_OUTPUT
- name: Setup pnpm cache
uses: actions/cache@v3
with:
path: ${{ steps.pnpm-cache.outputs.STORE_PATH }}
key: ${{ runner.os }}-pnpm-store-${{ hashFiles('**/pnpm-lock.yaml') }}
restore-keys: |
${{ runner.os }}-pnpm-store-
- name: Install dependencies
run: pnpm install --frozen-lockfile
- name: TypeScript Compilation Check
run: pnpm typecheck
- name: Lint Check
run: pnpm lint
- name: Unit Tests
run: pnpm test
env:
REDIS_URL: redis://localhost:6379
NODE_ENV: test
- name: Build Verification
run: pnpm build
env:
NODE_ENV: production
- name: Upload test results
if: always()
uses: actions/upload-artifact@v3
with:
name: test-results
path: |
**/coverage/
**/.turbo/
retention-days: 7
- name: Test Summary
if: always()
run: |
echo "## Test Results" >> $GITHUB_STEP_SUMMARY
echo "✅ All pre-release tests passed" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### Test Execution" >> $GITHUB_STEP_SUMMARY
echo "- TypeScript compilation: ✅" >> $GITHUB_STEP_SUMMARY
echo "- Lint checks: ✅" >> $GITHUB_STEP_SUMMARY
echo "- Unit tests: ✅" >> $GITHUB_STEP_SUMMARY
echo "- Build verification: ✅" >> $GITHUB_STEP_SUMMARY
env:
REDIS_URL: redis://localhost:6379
NODE_ENV: test
CI: true
TURBO_ENV_MODE: loose # Allow Turbo to access environment variables
release:
needs: [test] # Blocks deployment if tests fail
runs-on: ubuntu-latest
steps:
# ... existing release steps (unchanged)
- name: Checkout code
uses: actions/checkout@v4
- name: Build and release worker
run: |
# ... existing worker build logic
- name: Deploy to Railway
run: |
# ... existing Railway deployment
env:
RAILWAY_TOKEN: ${{ secrets.RAILWAY_TOKEN }}Test Execution Flow
Failure Handling
Test Failure Scenarios:
TypeScript Compilation Fails
- Error: Type errors in code
- Action: Block deployment, show compilation errors
- Fix: Address type errors before retrying
Lint Check Fails
- Error: Code style violations
- Action: Block deployment, show lint errors
- Fix: Run
pnpm lint --fixor address manually
Unit Tests Fail
- Error: Business logic regression
- Action: Block deployment, upload test results
- Fix: Debug failing tests locally with
pnpm test:ui
Build Fails
- Error: Production build errors
- Action: Block deployment, show build logs
- Fix: Address build errors (missing dependencies, etc.)
Notification Strategy:
- GitHub Actions summary shows failure details
- Slack notification on failure (optional Phase 2)
- Email to release tag creator (optional Phase 2)
Phased Rollout Plan
Phase 1: Foundation (Week 1)
Goal: Establish CI infrastructure and basic test coverage
Tasks:
- ✅ Create pre-release test job in release.yml
- ✅ Configure Redis test container
- ✅ Add pnpm caching for faster CI
- ✅ Run existing unit tests (55 test files)
- ✅ Run typecheck and lint
- ✅ Run build verification
Deliverables:
- Working CI pipeline with test gate
- < 10 minute execution time (unit tests only)
- Clear failure reporting
Success Criteria:
- CI passes on current staging branch
- No flaky tests (3 consecutive successful runs)
- Documentation for running tests locally
Rollout:
- Deploy to staging first (v1.2.3-staging tag)
- Monitor for false positives
- Enable for production after 3 successful staging releases
Phase 2: Integration Coverage (Week 2)
Goal: Add integration tests for critical components
Tasks:
- ✅ Run Redis function integration tests
- ✅ Run API integration tests (job submission)
- ✅ Run worker integration tests
- ✅ Configure test environment secrets
- ⚠️ Add coverage reporting (optional)
Deliverables:
- Integration tests running in CI
- < 20 minute total execution time
- Coverage reports uploaded
Success Criteria:
- Integration tests pass reliably
- No Redis connection issues in CI
- Clear error messages on failures
Challenges:
- Shared Redis instance (sequential tests)
- External API mocking (nock configuration)
- Test data cleanup between runs
Phase 3: E2E Coverage (Week 3)
Goal: Validate critical user flows end-to-end
Tasks:
- ✅ Add job submission → worker execution E2E test
- ✅ Add telemetry pipeline E2E test
- ⚠️ Add webhook delivery E2E test
- ⚠️ Configure Dash0 test environment
- ⚠️ Add machine registration flow test
Deliverables:
- E2E tests for critical paths
- < 35 minute total execution time
- Real-time event verification
Success Criteria:
- E2E tests detect real regressions
- No false positives from timing issues
- Clear test output showing flow progression
Challenges:
- Timing-sensitive tests (use explicit waits, not sleeps)
- External service mocking (Dash0, OpenAI)
- Test isolation (parallel execution conflicts)
Phase 4: Optimization (Week 4)
Goal: Improve CI speed and developer experience
Tasks:
- ✅ Parallelize independent test suites
- ✅ Add test result caching
- ✅ Document test writing guidelines
- ✅ Add visual test UI documentation
- ✅ Create troubleshooting guide
Deliverables:
- Optimized CI execution (target < 30 minutes)
- Comprehensive testing documentation
- Developer onboarding guide
Success Criteria:
- < 30 minute CI execution time
- Developers can add tests without assistance
- Zero flaky tests (1 week observation)
Success Metrics
Before Implementation
Current State:
- ❌ 0% automated test coverage in CI
- ❌ Manual testing before releases (inconsistent)
- ❌ Unknown production readiness
- ⏰ Hours spent debugging production issues
- 😰 Fear of shipping (slows innovation)
After Implementation
Phase 1 (Foundation):
- ✅ 100% of releases gated by unit tests
- ✅ < 10 minute CI feedback time
- ✅ Typecheck + lint + build verification automated
- 📊 Baseline metrics established
Phase 2 (Integration):
- ✅ Redis function coverage (atomic job matching)
- ✅ API endpoint coverage (job submission)
- ✅ Worker integration coverage (job processing)
- 📊 < 20 minute CI execution time
Phase 3 (E2E):
- ✅ Critical user flow coverage
- ✅ Telemetry pipeline validation
- ✅ Webhook delivery verification
- 📊 < 35 minute CI execution time
Phase 4 (Optimization):
- ✅ < 30 minute CI execution time
- ✅ Zero flaky tests (1 week observation)
- ✅ Developer documentation complete
- 📊 Production incidents -50% vs baseline
Ongoing Metrics
Test Reliability:
- Test pass rate: Target > 95% (excluding legitimate failures)
- Flaky test rate: Target < 2% (tests that fail randomly)
- CI execution time: Target < 30 minutes
Production Impact:
- Incidents caused by releases: Target -50% reduction
- Mean time to detect (MTTD): Target < 5 minutes (CI feedback)
- Mean time to resolve (MTTR): Target -30% reduction
Developer Experience:
- Time to add new test: Target < 30 minutes
- Test documentation completeness: Target 100%
- Developer satisfaction: Survey after 1 month
Dashboard Metrics (future):
- Test execution trends (speed over time)
- Coverage trends (% of code covered)
- Failure trends (which tests fail most)
Consequences
Positive
1. Production Safety
- ✅ No untested code reaches production
- ✅ Objective pass/fail criteria for releases
- ✅ Early detection of regressions
- ✅ Reduced customer-facing incidents
2. Developer Confidence
- ✅ Safe refactoring with test coverage
- ✅ Fast feedback on changes (local + CI)
- ✅ Clear error messages guide debugging
- ✅ Less fear of breaking production
3. Team Velocity
- ✅ Automated testing faster than manual
- ✅ Parallel development without conflicts
- ✅ Onboarding documentation (tests as examples)
- ✅ Less time firefighting production issues
4. Engineering Culture
- ✅ Quality-first mindset
- ✅ Documentation as code (tests document behavior)
- ✅ Continuous improvement (add tests for bugs)
- ✅ Reduced technical debt
5. Business Impact
- ✅ Customer trust from reliable service
- ✅ Revenue protection (fewer outages)
- ✅ Competitive advantage (ship faster safely)
- ✅ Engineering reputation
Negative
1. Initial Investment
- ❌ 1-2 weeks to implement (40-80 hours)
- ❌ Learning curve for test writing
- ❌ CI setup complexity
- Mitigation: Phased rollout, pair programming, documentation
2. Ongoing Maintenance
- ❌ Tests need updating with code changes
- ❌ Flaky tests can block releases
- ❌ CI costs (GitHub Actions minutes)
- Mitigation: Fix flaky tests immediately, optimize CI, monitor costs
3. False Negatives
- ❌ Flaky tests block valid releases
- ❌ Environment differences (CI vs production)
- ❌ Timing-sensitive tests fail randomly
- Mitigation: Retry logic, explicit waits, test isolation
4. Developer Workflow
- ❌ Slower releases (35 min CI vs immediate)
- ❌ Test failures require debugging
- ❌ Pre-commit hooks slow local workflow
- Mitigation: Fast local tests, clear errors, optional pre-commit
5. Coverage Gaps
- ❌ Cannot test every scenario
- ❌ Integration tests miss production differences
- ❌ E2E tests miss edge cases
- Mitigation: Focus on critical paths, production monitoring, iterative improvement
Risk Mitigation Strategies
Flaky Tests:
- Problem: Random failures block releases
- Solution:
- Explicit waits instead of sleep()
- Test isolation (no shared state)
- Retry logic for network-dependent tests
- Quarantine flaky tests until fixed
CI Costs:
- Problem: GitHub Actions minutes usage
- Solution:
- Cache dependencies aggressively
- Parallelize independent tests
- Monitor usage with cost alerts
- Self-hosted runners if cost prohibitive
Developer Frustration:
- Problem: Tests seen as blocker
- Solution:
- Fast local test execution
- Clear error messages
- Visual test UI (vitest --ui)
- Pair programming for test writing
Alternatives Considered
Alternative 1: No Automated Testing ❌
Approach: Continue manual testing before releases
Pros:
- No initial investment
- No CI setup complexity
- No flaky test issues
Cons:
- ❌ High risk: 81 commits untested
- ❌ Slow: Manual testing takes hours
- ❌ Incomplete: Cannot test all scenarios
- ❌ Unreliable: Human error, inconsistent coverage
- ❌ Not scalable: Slows as codebase grows
Verdict: REJECTED - Too risky with significant changes pending
Alternative 2: Post-Deployment Testing Only ⚠️
Approach: Deploy to staging, run tests, promote to production
Pros:
- Tests run in production-like environment
- Catches environment-specific issues
- No CI setup required
Cons:
- ⚠️ Customers affected: Staging breakage delays production
- ⚠️ Slower: Deploy → test → rollback → fix cycle
- ⚠️ Manual: Requires human intervention
- ⚠️ Partial: Doesn't catch pre-deployment issues
Verdict: REJECTED - Useful as smoke tests but insufficient alone
Alternative 3: Staging Environment Testing Only ⚠️
Approach: Require staging deployment before production
Pros:
- Real environment testing
- Catches configuration issues
- Minimal CI setup
Cons:
- ⚠️ Drift risk: Staging ≠ production (different configs)
- ⚠️ Manual: Human validation required
- ⚠️ Slow: Deploy → test → deploy cycle
- ⚠️ Incomplete: Doesn't test all code paths
Verdict: PARTIAL - Good practice but not sufficient, use alongside automated tests
Alternative 4: Feature Flags + Gradual Rollout ✅
Approach: Use feature flags to control new features, gradual rollout
Pros:
- ✅ Control blast radius (limit affected users)
- ✅ A/B testing capability
- ✅ Quick rollback (disable flag)
- ✅ Production testing with real traffic
Cons:
- ⚠️ Complexity: Flag management overhead
- ⚠️ Technical debt: Old flags linger
- ⚠️ Not comprehensive: Doesn't replace testing
Verdict: COMPLEMENTARY - Use alongside testing, not instead of
Alternative 5: Contract Testing (Consumer-Driven) 🤔
Approach: Define contracts between services, test independently
Pros:
- Decoupled service testing
- Parallel development
- Clear interface definitions
Cons:
- ⚠️ Complexity: Contract management overhead
- ⚠️ Tooling: Pact.js setup required
- ⚠️ Incomplete: Doesn't test full integration
Verdict: FUTURE - Good for microservices, overkill for current maturity
Alternative 6: Mutation Testing 🤔
Approach: Modify code (mutants) to verify tests catch bugs
Pros:
- Verifies test quality
- Catches weak assertions
- Improves coverage
Cons:
- ⚠️ Slow: 10x longer execution time
- ⚠️ Overkill: Current maturity doesn't justify
- ⚠️ Diminishing returns: High cost for marginal benefit
Verdict: FUTURE - Consider after baseline coverage established
Decision Matrix
| Alternative | Safety | Speed | Cost | Complexity | Verdict |
|---|---|---|---|---|---|
| Test Pyramid (Chosen) | ✅ High | ✅ Fast | ⚠️ Medium | ⚠️ Medium | ✅ RECOMMENDED |
| No Testing | ❌ Low | ✅ Fast | ✅ Low | ✅ Low | ❌ Rejected |
| Post-Deployment | ⚠️ Medium | ❌ Slow | ✅ Low | ✅ Low | ❌ Rejected |
| Staging Only | ⚠️ Medium | ❌ Slow | ✅ Low | ✅ Low | ⚠️ Partial |
| Feature Flags | ✅ High | ✅ Fast | ⚠️ Medium | ❌ High | ✅ Complementary |
| Contract Testing | ✅ High | ✅ Fast | ⚠️ Medium | ❌ High | 🤔 Future |
| Mutation Testing | ✅ High | ❌ Slow | ❌ High | ❌ High | 🤔 Future |
Open Questions
Q1: Test Coverage Thresholds
Question: Should we enforce minimum code coverage percentages (e.g., 80%)?
Considerations:
- Pros: Objective quality metric, forces coverage
- Cons: Encourages low-quality tests, diminishing returns
- Alternative: Focus on critical path coverage, not percentage
Recommendation: No coverage thresholds initially. Focus on quality over quantity. Revisit after baseline coverage established.
Q2: PR Testing vs Release Testing
Question: Should tests run on every PR to master, or only on releases?
Options:
Option A: PR Testing Only
- Tests run on pull_request to master
- Feedback before merge
- No release-time testing
Option B: Release Testing Only
- Tests run on git tags
- Fast PR merges
- Risk of broken master
Option C: Both PR + Release (Recommended)
- Tests on PR (fast subset)
- Full tests on release (comprehensive)
- Best safety, some duplication
Recommendation: Option C - Run fast tests on PR (unit + typecheck), full suite on releases.
Q3: External API Handling
Question: How do we test integrations with external APIs (OpenAI, Anthropic, Dash0)?
Options:
Option A: Mock All External APIs
- Use nock to intercept HTTP calls
- Fast, deterministic tests
- Risk: Mocks drift from reality
Option B: Test Against Real APIs
- Use test accounts/keys
- Real integration validation
- Risk: Slow, flaky, costs money
Option C: Hybrid (Recommended)
- Mock in CI (fast, deterministic)
- Real API testing in staging
- Best of both worlds
Recommendation: Option C - Mock external APIs in CI with nock, run real API tests in staging environment.
Q4: Test Retry Logic
Question: Should we automatically retry flaky tests in CI?
Considerations:
- Pros: Reduces false negatives from timing issues
- Cons: Masks real problems, slower CI
- Alternative: Fix flaky tests immediately
Recommendation: Limited retries - Retry network-dependent tests (max 2 attempts), no retries for unit tests. Track retry rate as metric.
Q5: Parallel Test Execution
Question: Should we invest in parallel test execution now or later?
Current State:
- Sequential execution: ~35 minutes
- Parallel potential: ~20 minutes (estimate)
- Cost: 1-2 days implementation
Recommendation: Later - Optimize during Phase 4 if CI time exceeds 35 minutes. Focus on coverage first, speed second.
Q6: Self-Hosted Runners
Question: Should we use GitHub self-hosted runners instead of GitHub-hosted?
Considerations:
GitHub-Hosted (Current):
- ✅ Zero maintenance
- ✅ Always available
- ❌ Slower startup
- ❌ Monthly cost
Self-Hosted:
- ✅ Faster execution
- ✅ Lower cost at scale
- ❌ Maintenance overhead
- ❌ Security concerns
Recommendation: GitHub-hosted initially - Monitor costs, switch to self-hosted if costs exceed $100/month or CI time exceeds 40 minutes.
Related Documentation
ADRs:
- ADR-001: Encrypted Environment Variables - Environment management for tests
- Docker Swarm Migration Analysis - Testing complexity analysis
Guides:
- Testing Procedures - Standard testing commands and procedures
- Environment Management Guide - Component-based configuration
- CLAUDE.md - Development workflow and QA agent guidelines
Infrastructure:
- .github/workflows/release.yml - Current release workflow
- turbo.json - Turbo build configuration
- package.json - Test scripts and dependencies
Implementation Checklist
Phase 1: Foundation (Week 1)
- [ ] Create test job in .github/workflows/release.yml
- [ ] Configure Redis service container
- [ ] Add pnpm caching
- [ ] Run existing unit tests (turbo run test)
- [ ] Add typecheck step
- [ ] Add lint step
- [ ] Add build verification step
- [ ] Test on staging branch (v1.2.3-staging)
- [ ] Document CI setup in README
- [ ] Update ADR index with this ADR
Phase 2: Integration (Week 2)
- [ ] Add Redis function integration tests to CI
- [ ] Add API integration tests to CI
- [ ] Configure test environment secrets
- [ ] Add coverage reporting (optional)
- [ ] Document integration test patterns
- [ ] Fix any flaky integration tests
Phase 3: E2E (Week 3)
- [ ] Add job submission E2E test
- [ ] Add telemetry pipeline E2E test
- [ ] Add webhook delivery E2E test
- [ ] Configure Dash0 test environment
- [ ] Document E2E test patterns
- [ ] Fix timing-sensitive test issues
Phase 4: Optimization (Week 4)
- [ ] Parallelize independent test suites
- [ ] Add test result caching
- [ ] Document test writing guidelines
- [ ] Create troubleshooting guide
- [ ] Monitor and optimize CI time
- [ ] Collect baseline metrics
Post-Implementation
- [ ] Review success metrics after 1 month
- [ ] Survey developer satisfaction
- [ ] Identify coverage gaps
- [ ] Plan next iteration improvements
Approval and Next Steps
Approval Required From:
- [ ] Engineering Team Lead
- [ ] DevOps/Infrastructure Lead
- [ ] QA Lead (if applicable)
Next Steps After Approval:
- Create GitHub issue tracking implementation
- Assign Phase 1 tasks to engineer(s)
- Schedule kickoff meeting
- Begin Phase 1 implementation
- Review and iterate based on feedback
Questions or Feedback: Contact Architecture Team or post in #architecture Slack channel.
Document Version: 1.0 Last Updated: 2025-10-08 Author: Claude Code (AI Agent) Reviewers: (to be added after team review)
