ADR-012: Health Check Monitoring System
Date: 2025-10-29 Status: 🤔 Proposed Decision Makers: Engineering Team Approval Required: Before production implementation Related ADRs: ADR-003 (Service Heartbeat Telemetry)
Executive Summary
This ADR proposes a dedicated health check monitoring service that actively probes critical infrastructure components and reports availability metrics to Dash0 via OTLP. This provides active monitoring complementary to passive heartbeat telemetry.
Current Gap:
- ❌ No active service monitoring - Only passive heartbeats (services self-report health)
- ❌ No external availability validation - Can't detect hung services that still send heartbeats
- ❌ No CDN/external service monitoring - CDN, databases, Redis only discovered when used
- ❌ Unclear alerting thresholds - "Missing heartbeat" is ambiguous vs. "3 failed health checks"
Proposed Solution:
- 🏥 Dedicated health-check service running on telemetry-collector machine
- 🎯 Active probing of all critical services every 30-60 seconds
- 📊 OTLP metrics sent to collector → Dash0 for alerting
- 🚨 Clear alert thresholds (e.g., "3 consecutive failures = down")
Impact:
- Before: Service appears healthy (heartbeat) but isn't serving traffic (hung)
- After: Health checks catch hung services, external service issues, network problems
Table of Contents
- Context
- Problem Statement
- Decision
- Technical Design
- CDN Health Check Strategy
- Implementation Pattern
- Alerting Strategy
- Rollout Strategy
- Consequences
- Alternatives Considered
Context
Current Monitoring: Passive Heartbeats
What we have (ADR-003):
// Services send periodic heartbeats
setInterval(() => {
telemetryClient.recordHeartbeat();
}, 15000);Limitations:
- Self-reported - Service must be healthy enough to send heartbeat
- Can't detect hung services - Process alive but not serving requests
- No external service monitoring - CDN, external APIs, databases not checked
- Telemetry-dependent - If OTLP pipeline breaks, we lose all monitoring
The Missing Piece: Active Health Checks
What we need:
┌──────────────────────────────────────────────────────┐
│ Telemetry Collector Machine │
│ │
│ ┌─────────────────────────────────┐ │
│ │ Health Check Service │ │
│ │ - Polls /health endpoints │ ◄─ ACTIVE │
│ │ - Tests CDN asset delivery │ PROBING │
│ │ - Checks Redis connectivity │ │
│ │ - Reports to Dash0 via OTLP │ │
│ └─────────────────────────────────┘ │
└──────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────┐
│ Services (API, Webhook, EmProps) │
│ │
│ ┌─────────────────────────────────┐ │
│ │ Service Heartbeats │ │
│ │ - Self-reported health │ ◄─ PASSIVE │
│ │ - Telemetry continuity │ REPORTING │
│ │ - Resource metrics │ │
│ └─────────────────────────────────┘ │
└──────────────────────────────────────────────────────┘Complementary approach:
- Heartbeats: Prove telemetry pipeline works + service is alive
- Health checks: Prove service is reachable + responding correctly
Problem Statement
Real-World Failure Scenarios
Scenario 1: Hung Service
❌ Problem: API process alive, sending heartbeats, but HTTP server frozen
✅ Heartbeat: "Service alive" (misleading)
✅ Health Check: "Service unreachable" (accurate)Scenario 2: Network Partition
❌ Problem: Service running but not reachable from external network
✅ Heartbeat: "Service alive" (from internal perspective)
✅ Health Check: "Service unreachable" (from external perspective)Scenario 3: CDN Degradation
❌ Problem: CDN serving stale content or experiencing high latency
✅ Heartbeat: N/A (CDN doesn't send heartbeats)
✅ Health Check: "CDN slow" or "CDN serving wrong content"Scenario 4: OTLP Pipeline Break
❌ Problem: Collector down, no telemetry reaching Dash0
✅ Heartbeat: Lost (we're blind)
✅ Health Check: Still working (independent monitoring path)Why Both Are Needed
| Aspect | Heartbeats | Health Checks |
|---|---|---|
| Perspective | Internal (self-report) | External (independent) |
| Detects hung services | ❌ No | ✅ Yes |
| Tests request/response | ❌ No | ✅ Yes |
| Monitors external services | ❌ No | ✅ Yes |
| Independent of telemetry | ❌ No | ✅ Yes |
| Alert clarity | "No data in X minutes" | "3 consecutive failures" |
Decision
We will implement a dedicated health-check service that:
- Runs on telemetry-collector machine (or separate monitor)
- Actively probes all critical services every 30-60 seconds
- Sends results as OTLP metrics to Dash0 via collector
- Enables clear alerting (e.g., "3 consecutive failures")
- Starts with CDN monitoring then expands to all services
Scope (Initial Implementation):
- ✅ CDN health checks (asset-based)
- ✅ API /health endpoint
- ✅ Webhook service /health endpoint
- ✅ EmProps API /health endpoint
- ✅ Redis connectivity check
- 🔮 Future: Database, external APIs, worker connectivity
Technical Design
Health Check Service Architecture
// Location: apps/health-check-service/src/index.ts
interface HealthCheckTarget {
name: string;
type: 'http' | 'cdn-asset' | 'redis' | 'postgres';
url?: string;
expectedStatus?: number;
expectedContent?: string; // For CDN content verification
timeout: number;
interval: number; // Check every X seconds
}
const targets: HealthCheckTarget[] = [
{
name: 'cdn-test-asset',
type: 'cdn-asset',
url: 'https://cdn.emprops.ai/.well-known/health-check.json',
expectedStatus: 200,
expectedContent: '{"status":"ok","version":"1.0.0"}',
timeout: 5000,
interval: 60, // Check every minute
},
{
name: 'api-health',
type: 'http',
url: 'http://api-service:3331/health',
expectedStatus: 200,
timeout: 3000,
interval: 30,
},
{
name: 'webhook-health',
type: 'http',
url: 'http://webhook-service:3332/health',
expectedStatus: 200,
timeout: 3000,
interval: 30,
},
{
name: 'emprops-api-health',
type: 'http',
url: 'http://emprops-api:3335/health',
expectedStatus: 200,
timeout: 3000,
interval: 30,
},
];
class HealthCheckService {
private telemetryClient: EmpTelemetryClient;
private checkHistory: Map<string, boolean[]>; // Last N results per target
async checkTarget(target: HealthCheckTarget): Promise<HealthCheckResult> {
const startTime = Date.now();
try {
if (target.type === 'cdn-asset') {
return await this.checkCdnAsset(target, startTime);
} else if (target.type === 'http') {
return await this.checkHttp(target, startTime);
}
// ... other types
} catch (error) {
return {
name: target.name,
success: false,
latency: Date.now() - startTime,
error: error.message,
};
}
}
async checkCdnAsset(target: HealthCheckTarget, startTime: number) {
const response = await fetch(target.url!, {
signal: AbortSignal.timeout(target.timeout),
});
const latency = Date.now() - startTime;
const content = await response.text();
const contentMatches = content === target.expectedContent;
return {
name: target.name,
success: response.status === target.expectedStatus && contentMatches,
latency,
statusCode: response.status,
contentValid: contentMatches,
};
}
reportToOtel(result: HealthCheckResult) {
// Send as OTLP metric
this.telemetryClient.recordMetric('health_check.status', {
value: result.success ? 1 : 0,
attributes: {
'check.name': result.name,
'check.type': 'active_probe',
},
});
this.telemetryClient.recordMetric('health_check.latency', {
value: result.latency,
attributes: {
'check.name': result.name,
},
});
}
}CDN Health Check Strategy
Why Asset-Based Checks (Not Pings)
❌ Ping Approach:
# This doesn't work for CDNs
ping cdn.emprops.ai
# Problems:
# - CDNs often don't respond to ICMP ping
# - Doesn't test actual content delivery
# - Doesn't verify CDN is serving correct content✅ Asset-Based Approach:
// 1. Store a small, immutable test asset on CDN
// Location: https://cdn.emprops.ai/.well-known/health-check.json
{
"status": "ok",
"version": "1.0.0",
"timestamp": "2025-10-29T00:00:00Z"
}
// 2. Fetch it regularly
const response = await fetch('https://cdn.emprops.ai/.well-known/health-check.json');
// 3. Verify:
// - Status code: 200
// - Content matches expected
// - Response time < 2000ms
// - Headers correct (cache-control, etc.)Test Asset Requirements
File: /.well-known/health-check.jsonSize: ~100 bytes (small, fast) Location: Root of CDN Content: Predictable JSON for validation Immutability: Never changes (can cache indefinitely)
Why .well-known/?
- Standard RFC 8615 location for site metadata
- Easy to remember and document
- Won't conflict with user content
CDN Check Metrics
interface CdnHealthMetrics {
'cdn.availability': 0 | 1; // Asset reachable
'cdn.latency.ms': number; // Response time
'cdn.content_valid': 0 | 1; // Content matches expected
'cdn.status_code': number; // HTTP status
'cdn.cache_status': 'HIT' | 'MISS'; // CDN cache status
}Alert Conditions
Critical: CDN Down
cdn.availability = 0 for 3 consecutive checks (3 minutes)
→ Alert: "CDN unreachable - user impact likely"Warning: CDN Slow
cdn.latency.ms > 2000 for 5 consecutive checks (5 minutes)
→ Alert: "CDN degraded performance"Warning: Content Mismatch
cdn.content_valid = 0 for 1 check
→ Alert: "CDN serving incorrect content - cache corruption?"Implementation Pattern
Phase 1: CDN Monitoring (Week 1)
Steps:
- Create
apps/health-check-service/with TypeScript + Express - Add CDN_URL to environment configuration
- Upload
.well-known/health-check.jsonto CDN - Implement CDN asset check with content verification
- Send metrics to OTLP collector
- Create Dash0 dashboard + alerts
Deliverables:
- ✅ Health check service running
- ✅ CDN monitored every 60 seconds
- ✅ Dash0 alert: "CDN down for 3 minutes"
Phase 2: Service Health Endpoints (Week 2)
Add /health endpoints to:
- API service
- Webhook service
- EmProps API service
Endpoint spec:
// GET /health
{
"status": "ok" | "degraded" | "down",
"timestamp": "2025-10-29T18:30:00Z",
"checks": {
"redis": "ok",
"database": "ok",
"telemetry": "ok"
},
"version": "1.0.0",
"uptime": 3600 // seconds
}Phase 3: Extended Checks (Week 3-4)
Add checks for:
- Redis connectivity (can read/write)
- Database connectivity (can query)
- Worker connectivity (workers registered)
- External APIs (OpenAI, etc.)
Alerting Strategy
Alert Definitions
Service Down (Critical)
health_check.status{check.name="api-health"} = 0
for 3 consecutive checks (90 seconds)
Action: Page on-call engineerCDN Down (Critical)
health_check.status{check.name="cdn-test-asset"} = 0
for 3 consecutive checks (3 minutes)
Action: Page on-call engineerService Slow (Warning)
health_check.latency{check.name="api-health"} > 1000ms
for 5 consecutive checks (150 seconds)
Action: Notify team Slack channelTelemetry Pipeline Broken (Warning)
No health_check metrics received for 5 minutes
Action: Notify DevOps channel (health-check service down)Alert Recovery
Auto-resolve when:
health_check.status = 1 for 2 consecutive checks
→ "Service recovered"Rollout Strategy
Week 1: CDN Monitoring
Day 1-2:
- [ ] Create health-check-service boilerplate
- [ ] Add CDN_URL to telemetry-collector environment
- [ ] Implement CDN asset check
Day 3-4:
- [ ] Upload test asset to CDN
- [ ] Send OTLP metrics to collector
- [ ] Verify metrics in Dash0
Day 5:
- [ ] Create Dash0 dashboard for CDN health
- [ ] Configure alerts
Week 2: Service Endpoints
Day 1-3:
- [ ] Add
/healthto API, webhook, emprops-api - [ ] Health check service polls endpoints
- [ ] Configure service alerts
Day 4-5:
- [ ] Test failure scenarios
- [ ] Document runbooks for alerts
Week 3-4: Extended Checks
- [ ] Redis connectivity checks
- [ ] Database connectivity checks
- [ ] Worker registry checks
Consequences
Positive
✅ Catch hung services - Active probing detects services that are "alive" but not responding ✅ External perspective - Independent view of service availability ✅ Clear alerts - "3 consecutive failures" is unambiguous ✅ Monitor external dependencies - CDN, external APIs, databases ✅ Decoupled from telemetry - Health checks work even if OTLP breaks ✅ Better user experience - Catch issues before users report them
Negative
⚠️ Additional service - One more component to maintain ⚠️ Network overhead - Constant probing generates traffic ⚠️ False positives possible - Network blips could trigger alerts ⚠️ Configuration complexity - Need to maintain check targets
Mitigation Strategies
False Positives:
- Require 3 consecutive failures before alerting
- Increase timeout for slower services
- Add retry logic with backoff
Network Overhead:
- Small test assets (< 1KB)
- Reasonable intervals (30-60s, not 1s)
- Only check critical services
Maintenance Burden:
- Auto-discovery of services (future)
- Configuration via environment variables
- Clear runbooks for common issues
Alternatives Considered
Alternative 1: Rely Only on Heartbeats
Rejected because:
- ❌ Can't detect hung services
- ❌ No external perspective
- ❌ Can't monitor CDN/external services
- ❌ Telemetry-dependent
Alternative 2: Use External Monitoring Service (Pingdom, Datadog)
Pros:
- ✅ Proven solution
- ✅ Multiple geographic check points
- ✅ Managed service
Cons:
- ❌ Additional cost
- ❌ Can't check internal services (Redis, workers)
- ❌ Less control over check logic
- ❌ Need to integrate alerts with existing systems
Decision: Start with internal health checks, consider external service for public endpoints later.
Alternative 3: Health Checks in Each Service
Each service checks its dependencies:
// In API service:
setInterval(async () => {
const redisOk = await checkRedis();
const dbOk = await checkDatabase();
telemetry.record({ redisOk, dbOk });
}, 30000);Rejected because:
- ❌ Distributed logic (harder to maintain)
- ❌ Can't check if service HTTP server is hung
- ❌ Duplicated code across services
- ❌ No centralized dashboard
Decision: Centralized health-check service is cleaner.
Success Metrics
Week 1 Success:
- [ ] CDN monitored every 60 seconds
- [ ] Dash0 dashboard showing CDN metrics
- [ ] Alert configured and tested
Week 4 Success:
- [ ] All critical services monitored
- [ ] Alerts firing correctly
- [ ] Zero undetected outages
- [ ] Mean time to detection (MTTD) < 2 minutes
Related Documentation
Questions & Decisions
Q1: Should health checks run on telemetry-collector or separate service?
Decision: Start on telemetry-collector machine, move to separate if needed.
Reasoning:
- ✅ Simpler deployment initially
- ✅ Telemetry-collector already has network access to all services
- ⚠️ If health-check service crashes, it takes collector with it (acceptable risk for now)
Q2: What's the right check interval?
Decision:
- CDN: Every 60 seconds (less critical, stable)
- Services: Every 30 seconds (more critical, dynamic)
Reasoning:
- Balance detection speed vs. network overhead
- 30s = 2-minute worst-case detection with 3-failure threshold
- Can tune per-service based on SLA requirements
Q3: Should we verify CDN content hash or exact string match?
Decision: Exact string match for now.
Reasoning:
- ✅ Simpler implementation
- ✅ Faster check (no hash computation)
- ✅ Detects content corruption
- 🔮 Future: Add hash verification if exact match too brittle
Next Steps:
- Get team approval on ADR
- Create
apps/health-check-service/boilerplate - Implement CDN asset check
- Deploy to telemetry-collector
- Configure Dash0 alerts
