ADR-012: Health Check Monitoring System

Date: 2025-10-29 Status: 🤔 Proposed Decision Makers: Engineering Team Approval Required: Before production implementation Related ADRs: ADR-003 (Service Heartbeat Telemetry)

Executive Summary

This ADR proposes a dedicated health check monitoring service that actively probes critical infrastructure components and reports availability metrics to Dash0 via OTLP. This provides active monitoring complementary to passive heartbeat telemetry.

Current Gap:

❌ No active service monitoring - Only passive heartbeats (services self-report health)
❌ No external availability validation - Can't detect hung services that still send heartbeats
❌ No CDN/external service monitoring - CDN, databases, Redis only discovered when used
❌ Unclear alerting thresholds - "Missing heartbeat" is ambiguous vs. "3 failed health checks"

Proposed Solution:

🏥 Dedicated health-check service running on telemetry-collector machine
🎯 Active probing of all critical services every 30-60 seconds
📊 OTLP metrics sent to collector → Dash0 for alerting
🚨 Clear alert thresholds (e.g., "3 consecutive failures = down")

Impact:

Before: Service appears healthy (heartbeat) but isn't serving traffic (hung)
After: Health checks catch hung services, external service issues, network problems

Context
Problem Statement
Decision
Technical Design
CDN Health Check Strategy
Implementation Pattern
Alerting Strategy
Rollout Strategy
Consequences
Alternatives Considered

Context

Current Monitoring: Passive Heartbeats

What we have (ADR-003):

typescript

// Services send periodic heartbeats
setInterval(() => {
  telemetryClient.recordHeartbeat();
}, 15000);

Limitations:

Self-reported - Service must be healthy enough to send heartbeat
Can't detect hung services - Process alive but not serving requests
No external service monitoring - CDN, external APIs, databases not checked
Telemetry-dependent - If OTLP pipeline breaks, we lose all monitoring

The Missing Piece: Active Health Checks

What we need:

┌──────────────────────────────────────────────────────┐
│  Telemetry Collector Machine                         │
│                                                       │
│  ┌─────────────────────────────────┐                │
│  │  Health Check Service           │                │
│  │  - Polls /health endpoints      │  ◄─ ACTIVE    │
│  │  - Tests CDN asset delivery     │     PROBING   │
│  │  - Checks Redis connectivity    │                │
│  │  - Reports to Dash0 via OTLP    │                │
│  └─────────────────────────────────┘                │
└──────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────┐
│  Services (API, Webhook, EmProps)                    │
│                                                       │
│  ┌─────────────────────────────────┐                │
│  │  Service Heartbeats             │                │
│  │  - Self-reported health         │  ◄─ PASSIVE   │
│  │  - Telemetry continuity         │     REPORTING │
│  │  - Resource metrics             │                │
│  └─────────────────────────────────┘                │
└──────────────────────────────────────────────────────┘

Complementary approach:

Heartbeats: Prove telemetry pipeline works + service is alive
Health checks: Prove service is reachable + responding correctly

Problem Statement

Real-World Failure Scenarios

Scenario 1: Hung Service

❌ Problem: API process alive, sending heartbeats, but HTTP server frozen
✅ Heartbeat: "Service alive" (misleading)
✅ Health Check: "Service unreachable" (accurate)

Scenario 2: Network Partition

❌ Problem: Service running but not reachable from external network
✅ Heartbeat: "Service alive" (from internal perspective)
✅ Health Check: "Service unreachable" (from external perspective)

Scenario 3: CDN Degradation

❌ Problem: CDN serving stale content or experiencing high latency
✅ Heartbeat: N/A (CDN doesn't send heartbeats)
✅ Health Check: "CDN slow" or "CDN serving wrong content"

Scenario 4: OTLP Pipeline Break

❌ Problem: Collector down, no telemetry reaching Dash0
✅ Heartbeat: Lost (we're blind)
✅ Health Check: Still working (independent monitoring path)

Why Both Are Needed

Aspect	Heartbeats	Health Checks
Perspective	Internal (self-report)	External (independent)
Detects hung services	❌ No	✅ Yes
Tests request/response	❌ No	✅ Yes
Monitors external services	❌ No	✅ Yes
Independent of telemetry	❌ No	✅ Yes
Alert clarity	"No data in X minutes"	"3 consecutive failures"

Decision

We will implement a dedicated health-check service that:

Runs on telemetry-collector machine (or separate monitor)
Actively probes all critical services every 30-60 seconds
Sends results as OTLP metrics to Dash0 via collector
Enables clear alerting (e.g., "3 consecutive failures")
Starts with CDN monitoring then expands to all services

Scope (Initial Implementation):

✅ CDN health checks (asset-based)
✅ API /health endpoint
✅ Webhook service /health endpoint
✅ EmProps API /health endpoint
✅ Redis connectivity check
🔮 Future: Database, external APIs, worker connectivity

Technical Design

Health Check Service Architecture

typescript

// Location: apps/health-check-service/src/index.ts

interface HealthCheckTarget {
  name: string;
  type: 'http' | 'cdn-asset' | 'redis' | 'postgres';
  url?: string;
  expectedStatus?: number;
  expectedContent?: string; // For CDN content verification
  timeout: number;
  interval: number; // Check every X seconds
}

const targets: HealthCheckTarget[] = [
  {
    name: 'cdn-test-asset',
    type: 'cdn-asset',
    url: 'https://cdn.emprops.ai/.well-known/health-check.json',
    expectedStatus: 200,
    expectedContent: '{"status":"ok","version":"1.0.0"}',
    timeout: 5000,
    interval: 60, // Check every minute
  },
  {
    name: 'api-health',
    type: 'http',
    url: 'http://api-service:3331/health',
    expectedStatus: 200,
    timeout: 3000,
    interval: 30,
  },
  {
    name: 'webhook-health',
    type: 'http',
    url: 'http://webhook-service:3332/health',
    expectedStatus: 200,
    timeout: 3000,
    interval: 30,
  },
  {
    name: 'emprops-api-health',
    type: 'http',
    url: 'http://emprops-api:3335/health',
    expectedStatus: 200,
    timeout: 3000,
    interval: 30,
  },
];

class HealthCheckService {
  private telemetryClient: EmpTelemetryClient;
  private checkHistory: Map<string, boolean[]>; // Last N results per target

  async checkTarget(target: HealthCheckTarget): Promise<HealthCheckResult> {
    const startTime = Date.now();

    try {
      if (target.type === 'cdn-asset') {
        return await this.checkCdnAsset(target, startTime);
      } else if (target.type === 'http') {
        return await this.checkHttp(target, startTime);
      }
      // ... other types
    } catch (error) {
      return {
        name: target.name,
        success: false,
        latency: Date.now() - startTime,
        error: error.message,
      };
    }
  }

  async checkCdnAsset(target: HealthCheckTarget, startTime: number) {
    const response = await fetch(target.url!, {
      signal: AbortSignal.timeout(target.timeout),
    });

    const latency = Date.now() - startTime;
    const content = await response.text();
    const contentMatches = content === target.expectedContent;

    return {
      name: target.name,
      success: response.status === target.expectedStatus && contentMatches,
      latency,
      statusCode: response.status,
      contentValid: contentMatches,
    };
  }

  reportToOtel(result: HealthCheckResult) {
    // Send as OTLP metric
    this.telemetryClient.recordMetric('health_check.status', {
      value: result.success ? 1 : 0,
      attributes: {
        'check.name': result.name,
        'check.type': 'active_probe',
      },
    });

    this.telemetryClient.recordMetric('health_check.latency', {
      value: result.latency,
      attributes: {
        'check.name': result.name,
      },
    });
  }
}

CDN Health Check Strategy

Why Asset-Based Checks (Not Pings)

❌ Ping Approach:

bash

# This doesn't work for CDNs
ping cdn.emprops.ai
# Problems:
# - CDNs often don't respond to ICMP ping
# - Doesn't test actual content delivery
# - Doesn't verify CDN is serving correct content

✅ Asset-Based Approach:

typescript

// 1. Store a small, immutable test asset on CDN
// Location: https://cdn.emprops.ai/.well-known/health-check.json
{
  "status": "ok",
  "version": "1.0.0",
  "timestamp": "2025-10-29T00:00:00Z"
}

// 2. Fetch it regularly
const response = await fetch('https://cdn.emprops.ai/.well-known/health-check.json');

// 3. Verify:
// - Status code: 200
// - Content matches expected
// - Response time < 2000ms
// - Headers correct (cache-control, etc.)

Test Asset Requirements

File: /.well-known/health-check.jsonSize: ~100 bytes (small, fast) Location: Root of CDN Content: Predictable JSON for validation Immutability: Never changes (can cache indefinitely)

Why .well-known/?

Standard RFC 8615 location for site metadata
Easy to remember and document
Won't conflict with user content

CDN Check Metrics

typescript

interface CdnHealthMetrics {
  'cdn.availability': 0 | 1;           // Asset reachable
  'cdn.latency.ms': number;            // Response time
  'cdn.content_valid': 0 | 1;          // Content matches expected
  'cdn.status_code': number;           // HTTP status
  'cdn.cache_status': 'HIT' | 'MISS';  // CDN cache status
}

Alert Conditions

Critical: CDN Down

cdn.availability = 0 for 3 consecutive checks (3 minutes)
→ Alert: "CDN unreachable - user impact likely"

Warning: CDN Slow

cdn.latency.ms > 2000 for 5 consecutive checks (5 minutes)
→ Alert: "CDN degraded performance"

Warning: Content Mismatch

cdn.content_valid = 0 for 1 check
→ Alert: "CDN serving incorrect content - cache corruption?"

Implementation Pattern

Phase 1: CDN Monitoring (Week 1)

Steps:

Create apps/health-check-service/ with TypeScript + Express
Add CDN_URL to environment configuration
Upload .well-known/health-check.json to CDN
Implement CDN asset check with content verification
Send metrics to OTLP collector
Create Dash0 dashboard + alerts

Deliverables:

✅ Health check service running
✅ CDN monitored every 60 seconds
✅ Dash0 alert: "CDN down for 3 minutes"

Phase 2: Service Health Endpoints (Week 2)

Add /health endpoints to:

API service
Webhook service
EmProps API service

Endpoint spec:

typescript

// GET /health
{
  "status": "ok" | "degraded" | "down",
  "timestamp": "2025-10-29T18:30:00Z",
  "checks": {
    "redis": "ok",
    "database": "ok",
    "telemetry": "ok"
  },
  "version": "1.0.0",
  "uptime": 3600 // seconds
}

Phase 3: Extended Checks (Week 3-4)

Add checks for:

Redis connectivity (can read/write)
Database connectivity (can query)
Worker connectivity (workers registered)
External APIs (OpenAI, etc.)

Alerting Strategy

Alert Definitions

Service Down (Critical)

health_check.status{check.name="api-health"} = 0
for 3 consecutive checks (90 seconds)

Action: Page on-call engineer

CDN Down (Critical)

health_check.status{check.name="cdn-test-asset"} = 0
for 3 consecutive checks (3 minutes)

Action: Page on-call engineer

Service Slow (Warning)

health_check.latency{check.name="api-health"} > 1000ms
for 5 consecutive checks (150 seconds)

Action: Notify team Slack channel

Telemetry Pipeline Broken (Warning)

No health_check metrics received for 5 minutes

Action: Notify DevOps channel (health-check service down)

Alert Recovery

Auto-resolve when:

health_check.status = 1 for 2 consecutive checks
→ "Service recovered"

Rollout Strategy

Week 1: CDN Monitoring

Day 1-2:

[ ] Create health-check-service boilerplate
[ ] Add CDN_URL to telemetry-collector environment
[ ] Implement CDN asset check

Day 3-4:

[ ] Upload test asset to CDN
[ ] Send OTLP metrics to collector
[ ] Verify metrics in Dash0

Day 5:

[ ] Create Dash0 dashboard for CDN health
[ ] Configure alerts

Week 2: Service Endpoints

Day 1-3:

[ ] Add /health to API, webhook, emprops-api
[ ] Health check service polls endpoints
[ ] Configure service alerts

Day 4-5:

[ ] Test failure scenarios
[ ] Document runbooks for alerts

Week 3-4: Extended Checks

[ ] Redis connectivity checks
[ ] Database connectivity checks
[ ] Worker registry checks

Consequences

Positive

✅ Catch hung services - Active probing detects services that are "alive" but not responding ✅ External perspective - Independent view of service availability ✅ Clear alerts - "3 consecutive failures" is unambiguous ✅ Monitor external dependencies - CDN, external APIs, databases ✅ Decoupled from telemetry - Health checks work even if OTLP breaks ✅ Better user experience - Catch issues before users report them

Negative

⚠️ Additional service - One more component to maintain ⚠️ Network overhead - Constant probing generates traffic ⚠️ False positives possible - Network blips could trigger alerts ⚠️ Configuration complexity - Need to maintain check targets

Mitigation Strategies

False Positives:

Require 3 consecutive failures before alerting
Increase timeout for slower services
Add retry logic with backoff

Network Overhead:

Small test assets (< 1KB)
Reasonable intervals (30-60s, not 1s)
Only check critical services

Maintenance Burden:

Auto-discovery of services (future)
Configuration via environment variables
Clear runbooks for common issues

Alternatives Considered

Alternative 1: Rely Only on Heartbeats

Rejected because:

❌ Can't detect hung services
❌ No external perspective
❌ Can't monitor CDN/external services
❌ Telemetry-dependent

Alternative 2: Use External Monitoring Service (Pingdom, Datadog)

Pros:

✅ Proven solution
✅ Multiple geographic check points
✅ Managed service

Cons:

❌ Additional cost
❌ Can't check internal services (Redis, workers)
❌ Less control over check logic
❌ Need to integrate alerts with existing systems

Decision: Start with internal health checks, consider external service for public endpoints later.

Alternative 3: Health Checks in Each Service

Each service checks its dependencies:

typescript

// In API service:
setInterval(async () => {
  const redisOk = await checkRedis();
  const dbOk = await checkDatabase();
  telemetry.record({ redisOk, dbOk });
}, 30000);

Rejected because:

❌ Distributed logic (harder to maintain)
❌ Can't check if service HTTP server is hung
❌ Duplicated code across services
❌ No centralized dashboard

Decision: Centralized health-check service is cleaner.

Success Metrics

Week 1 Success:

[ ] CDN monitored every 60 seconds
[ ] Dash0 dashboard showing CDN metrics
[ ] Alert configured and tested

Week 4 Success:

[ ] All critical services monitored
[ ] Alerts firing correctly
[ ] Zero undetected outages
[ ] Mean time to detection (MTTD) < 2 minutes

Questions & Decisions

Q1: Should health checks run on telemetry-collector or separate service?

Decision: Start on telemetry-collector machine, move to separate if needed.

Reasoning:

✅ Simpler deployment initially
✅ Telemetry-collector already has network access to all services
⚠️ If health-check service crashes, it takes collector with it (acceptable risk for now)

Q2: What's the right check interval?

Decision:

CDN: Every 60 seconds (less critical, stable)
Services: Every 30 seconds (more critical, dynamic)

Reasoning:

Balance detection speed vs. network overhead
30s = 2-minute worst-case detection with 3-failure threshold
Can tune per-service based on SLA requirements

Q3: Should we verify CDN content hash or exact string match?

Decision: Exact string match for now.

Reasoning:

✅ Simpler implementation
✅ Faster check (no hash computation)
✅ Detects content corruption
🔮 Future: Add hash verification if exact match too brittle

Next Steps:

Get team approval on ADR
Create apps/health-check-service/ boilerplate
Implement CDN asset check
Deploy to telemetry-collector
Configure Dash0 alerts

ADR-012: Health Check Monitoring System ​

Executive Summary ​

Table of Contents ​

Context ​

Current Monitoring: Passive Heartbeats ​

The Missing Piece: Active Health Checks ​

Problem Statement ​

Real-World Failure Scenarios ​

Why Both Are Needed ​

Decision ​

Technical Design ​

Health Check Service Architecture ​

CDN Health Check Strategy ​

Why Asset-Based Checks (Not Pings) ​

Test Asset Requirements ​

CDN Check Metrics ​

Alert Conditions ​

Implementation Pattern ​

Phase 1: CDN Monitoring (Week 1) ​

Phase 2: Service Health Endpoints (Week 2) ​

Phase 3: Extended Checks (Week 3-4) ​

Alerting Strategy ​

Alert Definitions ​

Alert Recovery ​

Rollout Strategy ​

Week 1: CDN Monitoring ​

Week 2: Service Endpoints ​

Week 3-4: Extended Checks ​

Consequences ​

Positive ​

Negative ​

Mitigation Strategies ​

Alternatives Considered ​

Alternative 1: Rely Only on Heartbeats ​

Alternative 2: Use External Monitoring Service (Pingdom, Datadog) ​

Alternative 3: Health Checks in Each Service ​

Success Metrics ​

Related Documentation ​

Questions & Decisions ​

Q1: Should health checks run on telemetry-collector or separate service? ​

Q2: What's the right check interval? ​

Q3: Should we verify CDN content hash or exact string match? ​

ADR-012: Health Check Monitoring System

Executive Summary

Table of Contents

Context

Current Monitoring: Passive Heartbeats

The Missing Piece: Active Health Checks

Problem Statement

Real-World Failure Scenarios

Why Both Are Needed

Decision

Technical Design

Health Check Service Architecture

CDN Health Check Strategy

Why Asset-Based Checks (Not Pings)

Test Asset Requirements

CDN Check Metrics

Alert Conditions

Implementation Pattern

Phase 1: CDN Monitoring (Week 1)

Phase 2: Service Health Endpoints (Week 2)

Phase 3: Extended Checks (Week 3-4)

Alerting Strategy

Alert Definitions

Alert Recovery

Rollout Strategy

Week 1: CDN Monitoring

Week 2: Service Endpoints

Week 3-4: Extended Checks

Consequences

Positive

Negative

Mitigation Strategies

Alternatives Considered

Alternative 1: Rely Only on Heartbeats

Alternative 2: Use External Monitoring Service (Pingdom, Datadog)

Alternative 3: Health Checks in Each Service

Success Metrics

Related Documentation

Questions & Decisions

Q1: Should health checks run on telemetry-collector or separate service?

Q2: What's the right check interval?

Q3: Should we verify CDN content hash or exact string match?