ADR-003: Service Heartbeat and Liveness Telemetry

Date: 2025-10-09 Status: 🤔 Proposed Decision Makers: Engineering Team Approval Required: Before universal implementation across services Related ADRs: None

Executive Summary

This ADR proposes a standardized service heartbeat and liveness telemetry system using OpenTelemetry metrics to enable automatic service discovery, uptime monitoring, and service map visualization in Dash0.

Current Gap:

❌ No automatic service discovery - Services exist but aren't visible until they process requests
❌ No uptime tracking - Can't distinguish between "idle" and "down" services
❌ No service map - No visual topology of running services and their dependencies
❌ Delayed failure detection - Service crashes only noticed when users report issues

Proposed Solution:

🎯 Periodic heartbeat metrics (every 15 seconds) sent via OpenTelemetry
🗺️ Automatic service map population using service.name resource attributes
⏱️ Uptime tracking with asynchronous gauge metrics
📊 Dash0 visualization showing live/dead services in real-time

Impact:

Before: Services are invisible until used, failures discovered reactively
After: All services visible immediately on startup, proactive failure detection

Context
Problem Statement
Decision
Technical Design
Implementation Pattern
Service Map Architecture
Dash0 Integration
Rollout Strategy
Consequences
Alternatives Considered

Context

Current State

Existing Telemetry Infrastructure:

✅ @emp/telemetry package with EmpTelemetryClient
✅ OTLP gRPC collector (port 4317)
✅ Collector → Dash0 export pipeline
✅ Trace-based instrumentation in API, webhook-service
✅ Service.name resource attributes configured

Current Service Discovery Model:

Service starts
  ↓
No signal sent (invisible)
  ↓
First request arrives
  ↓
Trace created → Dash0
  ↓
Service appears on map (REACTIVE)

Problem:

Services are "dark" until they receive traffic
Idle services appear as "down" (indistinguishable from crashed services)
No way to track service uptime or availability
Service map incomplete without active traffic

OpenTelemetry Research Findings

Service Discovery via Traces: OpenTelemetry platforms automatically build service maps from trace data by:

Analyzing parent-child span relationships
Using service.name resource attributes to identify services
Inferring service dependencies from request flows
Building real-time topology graphs

Heartbeat Patterns: From OpenTelemetry best practices and industry research:

Asynchronous Gauge Metrics (Recommended for periodic values)
- Designed for values that aren't measured continuously
- Perfect for periodic health checks (every 5-60 seconds)
- Low overhead - callback-based, no active polling
Health Check Extension (Collector-level)
- Provides HTTP endpoint for liveness/readiness checks
- Used for monitoring the collector itself
- Not suitable for application-level service discovery
Custom Metrics with PeriodicExportingMetricReader
- Send metrics on fixed intervals (15-60 seconds recommended)
- Includes uptime, process stats, custom health indicators
- Standard OpenTelemetry pattern for service monitoring

Service Map in Dash0

Service maps in OpenTelemetry-compatible platforms (Dash0, Grafana Tempo, Jaeger) work by:

Span Analysis: Parent-child relationships show request flows
Service Identification: service.name attribute identifies each service
Dependency Inference: Service A → Service B connections from spans
Real-time Updates: Maps update as new telemetry arrives

Key Insight: Service maps are automatically generated from telemetry data. No manual registration required - just send signals with proper resource attributes.

Problem Statement

User Story

As a platform operator, I want to see all running services on a service map and know their health status, so I can detect failures proactively and understand system topology at a glance.

Requirements

Functional:

✅ Services appear on Dash0 service map within 15 seconds of startup
✅ Service uptime tracked and visible in metrics
✅ Service liveness status (alive/dead) determinable from recent signals
✅ Service dependencies inferred from actual request traces

Non-Functional:

✅ Minimal overhead (< 0.1% CPU, < 5MB memory per service)
✅ Works with existing OTLP collector infrastructure
✅ No changes to Dash0 configuration required
✅ Graceful degradation if collector is temporarily unavailable

Decision

We will implement a standardized service heartbeat system using OpenTelemetry Asynchronous Gauge metrics, sent every 15 seconds to the OTLP collector.

Core Components

Heartbeat Metric (service.heartbeat)
- Type: Asynchronous Gauge
- Value: Always 1 (alive indicator)
- Attributes: service.name, service.version, environment
- Frequency: Every 15 seconds
Uptime Metric (service.uptime_seconds)
- Type: Asynchronous Gauge
- Value: Seconds since service started
- Attributes: Same as heartbeat
- Frequency: Every 15 seconds (same callback)
Resource Attributes (Already configured)
- service.name: e.g., "emp-api", "emp-webhook-service"
- service.version: From package.json or env var
- service.namespace: Optional (e.g., "production", "staging")
- deployment.environment: From NODE_ENV

Why Asynchronous Gauge?

OpenTelemetry documentation states:

"Use Asynchronous Gauge to collect metrics that you can't (or don't want) to monitor continuously, for example, if you only want to check a server's CPU usage every five minutes."

This perfectly matches our use case:

✅ Periodic signal (not continuous monitoring)
✅ Callback-based (no polling overhead)
✅ Automatic export with configured interval
✅ Standard OpenTelemetry pattern

Technical Design

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                     Service (emp-api)                        │
│  ┌────────────────────────────────────────────────────────┐ │
│  │         EmpTelemetryClient (initialized)               │ │
│  │  ┌──────────────────────────────────────────────────┐  │ │
│  │  │  PeriodicExportingMetricReader (15s interval)    │  │ │
│  │  │    ↓                                              │  │ │
│  │  │  Asynchronous Gauge Callback                     │  │ │
│  │  │    - service.heartbeat = 1                       │  │ │
│  │  │    - service.uptime_seconds = (now - startTime)  │  │ │
│  │  └──────────────────────────────────────────────────┘  │ │
│  └────────────────────────────────────────────────────────┘ │
│                          ↓ gRPC (every 15s)                 │
└──────────────────────────┼──────────────────────────────────┘
                           ↓
┌──────────────────────────┼──────────────────────────────────┐
│            OTLP Collector (localhost:4317)                  │
│                          ↓                                   │
│              ┌───────────────────────┐                       │
│              │  Metrics Pipeline     │                       │
│              │  - Batch processor    │                       │
│              │  - Attribute enrichment│                      │
│              └───────────┬───────────┘                       │
└──────────────────────────┼──────────────────────────────────┘
                           ↓ OTLP/gRPC
┌──────────────────────────┼──────────────────────────────────┐
│                   Dash0 (SaaS)                              │
│  ┌────────────────────────────────────────────────────────┐ │
│  │              Service Map Visualization                  │ │
│  │  ┌──────────┐     ┌──────────┐     ┌──────────┐       │ │
│  │  │ emp-api  │────→│ webhook  │────→│ machines │       │ │
│  │  │  (live)  │     │  (live)  │     │  (live)  │       │ │
│  │  └──────────┘     └──────────┘     └──────────┘       │ │
│  │                                                          │ │
│  │  Uptime Tracking:                                       │ │
│  │  - emp-api: 3h 24m (last heartbeat: 12s ago)          │ │
│  │  - webhook: 1h 15m (last heartbeat: 10s ago)          │ │
│  └────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

Service Lifecycle

Service Startup
  ↓
Initialize EmpTelemetryClient
  ↓
Register heartbeat gauge callback
  ↓
Start service (Express, Fastify, etc.)
  ↓
[Every 15 seconds]
  ↓
Callback invoked
  ↓
Metrics exported to collector
  ↓
Dash0 receives metrics
  ↓
Service appears on map (within 15s)

Implementation Pattern

Standard Service Initialization

typescript

// apps/api/src/index.ts (EXAMPLE - already has telemetryClient)

import { createTelemetryClient } from '@emp/telemetry';

// Service start time for uptime calculation
const serviceStartTime = Date.now();

// Initialize telemetry with heartbeat
const telemetryClient = createTelemetryClient({
  serviceName: 'emp-api',
  serviceVersion: '1.0.0',
  collectorEndpoint: process.env.OTEL_COLLECTOR_ENDPOINT!,
  environment: process.env.NODE_ENV!,
  enableHeartbeat: true,  // NEW: Enable automatic heartbeat
  heartbeatInterval: 15000 // NEW: 15 seconds (optional, default)
});

// The telemetry client will automatically:
// 1. Register asynchronous gauge for heartbeat
// 2. Register asynchronous gauge for uptime
// 3. Export metrics every 15 seconds via PeriodicExportingMetricReader
// 4. Include service.name resource attributes automatically

Telemetry Client Enhancement

typescript

// packages/telemetry/src/index.ts (CHANGES NEEDED)

export interface TelemetryConfig {
  serviceName: string;
  serviceVersion: string;
  collectorEndpoint: string;
  environment: string;
  enableHeartbeat?: boolean;      // NEW: Default true
  heartbeatInterval?: number;     // NEW: Default 15000ms
}

export class EmpTelemetryClient {
  private serviceStartTime: number;
  private meter: Meter;

  constructor(config: TelemetryConfig) {
    this.serviceStartTime = Date.now();

    // ... existing initialization ...

    // NEW: Setup heartbeat if enabled
    if (config.enableHeartbeat !== false) {
      this.setupHeartbeat(config.heartbeatInterval || 15000);
    }
  }

  private setupHeartbeat(interval: number) {
    const meter = this.meterProvider.getMeter('emp-heartbeat');

    // Heartbeat gauge - always 1 when alive
    const heartbeatGauge = meter.createObservableGauge('service.heartbeat', {
      description: 'Service liveness indicator (1 = alive)',
      unit: '1'
    });

    // Uptime gauge - seconds since start
    const uptimeGauge = meter.createObservableGauge('service.uptime_seconds', {
      description: 'Service uptime in seconds',
      unit: 's'
    });

    // Register batch callback for both metrics
    meter.addBatchObservableCallback(
      async (observableResult) => {
        const uptime = (Date.now() - this.serviceStartTime) / 1000;

        observableResult.observe(heartbeatGauge, 1, {
          'service.status': 'running'
        });

        observableResult.observe(uptimeGauge, uptime, {
          'service.status': 'running'
        });
      },
      [heartbeatGauge, uptimeGauge]
    );

    logger.info(`Service heartbeat enabled (interval: ${interval}ms)`);
  }
}

Metric Reader Configuration

The PeriodicExportingMetricReader already configured in @emp/telemetry will handle:

Automatic metric collection every exportIntervalMillis (15s recommended)
Batching and export to OTLP collector
Resource attribute injection (service.name, etc.)
Retry logic for failed exports

No additional configuration needed - just register the callbacks.

Service Map Architecture

How Service Maps Work

OpenTelemetry service maps are automatically generated from telemetry data:

Service Identification
- Each service has unique service.name resource attribute
- Example: emp-api, emp-webhook-service, emp-worker-gpu0
Dependency Detection (from traces)
- Parent span (service A) → Child span (service B) = A calls B
- HTTP client spans show outbound requests
- Server spans show inbound requests
- Example: emp-api → emp-webhook-service (HTTP POST /webhook)
Service Presence (from metrics/traces)
- Service sends ANY telemetry → appears on map
- Heartbeat metrics provide guaranteed presence even when idle
- Last signal timestamp determines alive/dead status
Topology Visualization
- Dash0 builds graph from service relationships
- Nodes = services (sized by traffic volume)
- Edges = dependencies (sized by request count)
- Color = health status (green=healthy, red=errors, gray=idle)

Example Service Map Flow

Startup: emp-api sends heartbeat
  ↓
Dash0: Creates node "emp-api" (green, idle)
  ↓
User request: POST /jobs
  ↓
emp-api creates trace with parent span
  ↓
emp-api calls Redis (child span with service.name="redis")
  ↓
Dash0: Creates node "redis", edge "emp-api → redis"
  ↓
15s later: emp-api sends heartbeat again
  ↓
Dash0: Updates "emp-api" last-seen timestamp

Result: Service map shows emp-api and redis, even during idle periods.

Dash0 Integration

Metric Visualization

Dash0 will automatically receive and display:

Service Inventory
- All services with service.heartbeat = 1 in last 60 seconds
- Service name, version, environment visible in resource attributes
- Uptime displayed from service.uptime_seconds
Service Map
- Nodes auto-created from service.name attributes
- Edges auto-created from trace parent-child relationships
- Heartbeat ensures nodes persist even without traffic
Alerts (can be configured in Dash0)
- Alert if service.heartbeat not received for > 45 seconds (3x interval)
- Alert if service.uptime_seconds drops (service restart)
- Alert if error rate > threshold (from trace spans)

Query Examples

promql

# Check which services are alive (last 60s)
service.heartbeat{service.name=~"emp-.*"} > 0

# Get service uptime
service.uptime_seconds{service.name="emp-api"}

# Detect service restarts (uptime decreases)
decrease(service.uptime_seconds{service.name="emp-api"}[5m])

Rollout Strategy

Phase 1: API Service (Week 1)

Goal: Prove heartbeat concept and validate Dash0 visualization

✅ Enhance @emp/telemetry with heartbeat support
✅ Enable heartbeat in emp-api service
✅ Verify service appears on Dash0 map within 15s of startup
✅ Confirm uptime metric accuracy
✅ Test failure detection (kill service, verify alert)

Success Criteria:

emp-api visible on Dash0 map immediately on startup
Uptime metric matches actual service runtime
Service disappears from "live" list within 60s of crash

Phase 2: Core Services (Week 2)

Goal: Extend to all production services

Services to instrument:

✅ emp-webhook-service
✅ emp-telemetry-collector (self-monitoring)
✅ emp-worker (redis-direct-worker)
✅ emp-machines (basic-machine supervisor)

Success Criteria:

Full service topology visible in Dash0
Dependencies correctly inferred from traces
All services show accurate uptime

Phase 3: Specialized Services (Week 3)

Goal: Complete coverage

✅ emp-monitor (internal dashboard)
✅ emp-emprops-api (legacy API)
✅ Any future services

Success Criteria:

100% service coverage
Standardized heartbeat pattern across all services
Documentation updated with implementation guide

Consequences

Positive

✅ Proactive Failure Detection

Services visible immediately on startup
Crashes detected within 45 seconds (3x heartbeat interval)
No reliance on user reports

✅ Service Map Visualization

Automatic topology mapping
Real-time dependency graph
Easier debugging of distributed system issues

✅ Uptime Tracking

Historical uptime data in Dash0
Service reliability metrics
Restart detection and alerting

✅ Operational Visibility

Know which services are running at any time
Distinguish idle vs. crashed services
Service inventory for capacity planning

✅ Standard OpenTelemetry Pattern

Industry best practice
Well-documented approach
Compatible with any OTLP backend (not locked to Dash0)

✅ Minimal Overhead

Callback-based (no polling threads)
15-second interval (low frequency)
Estimated < 0.1% CPU, < 5MB memory per service

Negative

⚠️ Slightly Increased Telemetry Volume

~4 heartbeat signals per minute per service
With 10 services: 40 metric data points/min = 57,600/day
Mitigation: Metrics are small (just gauge value + attributes), negligible cost

⚠️ False Positives During Deploys

Service restart = brief "dead" period (15-45 seconds)
Mitigation: Configure Dash0 alerts with 60s grace period
Mitigation: Blue-green deployment minimizes downtime

⚠️ Dependency on Collector

Heartbeat requires collector availability
Mitigation: Telemetry client has retry logic and queuing
Mitigation: Service functions normally even if collector is down

Neutral

➡️ New Metric Type

Adds asynchronous gauges to telemetry stack
Developers must understand gauge vs. counter semantics
Action: Update telemetry documentation

➡️ Dash0 Configuration

May need to configure custom dashboards for heartbeat metrics
Alert rules for service liveness
Action: Provide Dash0 dashboard templates

Alternatives Considered

Alternative 1: Trace-Only Service Discovery

Approach: Rely solely on traces (spans) for service discovery, no heartbeat metrics.

Pros:

✅ No additional implementation needed
✅ Zero metric overhead
✅ Services appear on map from actual traffic

Cons:

❌ Services invisible until first request
❌ Idle services appear "dead"
❌ No uptime tracking
❌ Slow failure detection (only noticed when requests fail)

Verdict: ❌ Rejected - Reactive discovery insufficient for operational needs.

Alternative 2: HTTP Health Check Endpoints

Approach: Each service exposes /health endpoint, external monitoring system polls it.

Pros:

✅ Simple implementation (Express middleware)
✅ Standard pattern (Kubernetes liveness probes)
✅ Works without telemetry infrastructure

Cons:

❌ Separate infrastructure needed (Prometheus, Datadog agent, etc.)
❌ Duplicate observability stack (OTLP + HTTP polling)
❌ No automatic service map integration
❌ Additional network overhead (HTTP requests)
❌ Doesn't integrate with Dash0

Verdict: ❌ Rejected - Adds complexity, doesn't leverage existing OTLP pipeline.

Alternative 3: Log-Based Liveness

Approach: Services write "heartbeat" log lines every 15s, log aggregator detects liveness.

Pros:

✅ Uses existing logging infrastructure
✅ Easy to implement (setInterval + logger.info)

Cons:

❌ Logs != metrics (wrong abstraction)
❌ Expensive to query at scale (grep through logs)
❌ No structured data for dashboards
❌ Doesn't populate service map
❌ Log volume increases significantly

Verdict: ❌ Rejected - Logs are for diagnostics, not operational metrics.

Alternative 4: Synchronous Counter (Anti-Pattern)

Approach: Use synchronous counter incremented every 15s via setInterval.

Pros:

✅ Simple implementation

Cons:

❌ Wrong metric type - Counters are for cumulative values, not periodic signals
❌ Requires active polling (setInterval thread)
❌ Not idiomatic OpenTelemetry
❌ Confusing semantics (counter should always increase)

Verdict: ❌ Rejected - Violates OpenTelemetry best practices.

Alternative 5: Manual Service Registration

Approach: Services POST to /register endpoint on startup, separate service maintains registry.

Pros:

✅ Explicit service inventory
✅ Can include metadata (version, capabilities)

Cons:

❌ Separate service registry infrastructure
❌ Single point of failure
❌ Stale data if service crashes without deregistering
❌ Doesn't integrate with Dash0
❌ Duplicate of what OTLP already provides

Verdict: ❌ Rejected - Reinventing OTLP service discovery.

Open Questions

Q1: Should heartbeat interval be configurable per service?

Answer: Yes, but default to 15 seconds.

API/webhook: 15s (user-facing, need fast detection)
Workers: 30s (background processing, less critical)
Machines: 60s (long-running, stable)

Action: Add heartbeatInterval to TelemetryConfig (optional).

Q2: What happens if the collector is down?

Answer: Graceful degradation:

Telemetry client queues metrics locally (in-memory buffer)
Retries export with exponential backoff
Drops oldest metrics if buffer fills (prevents memory leak)
Service continues functioning normally

Action: Document retry behavior, add monitoring for telemetry export failures.

Q3: Should we send additional health metrics?

Ideas:

Memory usage
CPU usage
Request rate
Error rate

Answer: Future enhancement, not in initial implementation.

Heartbeat + uptime are sufficient for service discovery
System metrics (CPU/memory) can be added via OpenTelemetry Host Metrics
Request/error rates already captured in trace spans

Action: Create follow-up ADR for comprehensive service health metrics.

Q4: How do we handle service restarts?

Behavior:

Service restarts → service.uptime_seconds resets to 0
Old service instance stops sending heartbeats
New instance starts sending heartbeats
Dash0 shows brief gap (15-45 seconds)

Action: Configure Dash0 alerts with 60s threshold to avoid false positives during deploys.

Q5: Does this work for worker containers that scale 0→50→0?

Answer: Yes, perfectly suited for ephemeral workloads:

Worker starts → heartbeat appears → visible on map
Worker processes jobs → traces show activity
Worker stops → heartbeat stops → disappears from map within 60s
Automatic inventory of active workers

Action: Test with SALAD/vast.ai worker scaling.

Approval Checklist

Before accepting this ADR:

[ ] Review technical design with team
[ ] Validate metric schema with Dash0 documentation
[ ] Confirm performance overhead is acceptable (< 0.1% CPU)
[ ] Test heartbeat implementation in local development
[ ] Verify service map appears correctly in Dash0
[ ] Document rollout plan and timelines
[ ] Get sign-off from operations team

Change Log

Date	Change	Author
2025-10-09	Initial proposal	Claude Code

ADR-003: Service Heartbeat and Liveness Telemetry ​

Executive Summary ​

Table of Contents ​

Context ​

Current State ​

OpenTelemetry Research Findings ​

Service Map in Dash0 ​

Problem Statement ​

User Story ​

Requirements ​

Decision ​

Core Components ​

Why Asynchronous Gauge? ​

Technical Design ​

Architecture Diagram ​

Service Lifecycle ​

Implementation Pattern ​

Standard Service Initialization ​

Telemetry Client Enhancement ​

Metric Reader Configuration ​

Service Map Architecture ​

How Service Maps Work ​

Example Service Map Flow ​

Dash0 Integration ​

Metric Visualization ​

Query Examples ​

Rollout Strategy ​

Phase 1: API Service (Week 1) ​

Phase 2: Core Services (Week 2) ​

Phase 3: Specialized Services (Week 3) ​

Consequences ​

Positive ​

Negative ​

Neutral ​

Alternatives Considered ​

Alternative 1: Trace-Only Service Discovery ​

Alternative 2: HTTP Health Check Endpoints ​

Alternative 3: Log-Based Liveness ​

Alternative 4: Synchronous Counter (Anti-Pattern) ​

Alternative 5: Manual Service Registration ​

Open Questions ​

Q1: Should heartbeat interval be configurable per service? ​

Q2: What happens if the collector is down? ​

Q3: Should we send additional health metrics? ​

Q4: How do we handle service restarts? ​

Q5: Does this work for worker containers that scale 0→50→0? ​

Related Documentation ​

Approval Checklist ​

Change Log ​

ADR-003: Service Heartbeat and Liveness Telemetry

Executive Summary

Table of Contents

Context

Current State

OpenTelemetry Research Findings

Service Map in Dash0

Problem Statement

User Story

Requirements

Decision

Core Components

Why Asynchronous Gauge?

Technical Design

Architecture Diagram

Service Lifecycle

Implementation Pattern

Standard Service Initialization

Telemetry Client Enhancement

Metric Reader Configuration

Service Map Architecture

How Service Maps Work

Example Service Map Flow

Dash0 Integration

Metric Visualization

Query Examples

Rollout Strategy

Phase 1: API Service (Week 1)

Phase 2: Core Services (Week 2)

Phase 3: Specialized Services (Week 3)

Consequences

Positive

Negative

Neutral

Alternatives Considered

Alternative 1: Trace-Only Service Discovery

Alternative 2: HTTP Health Check Endpoints

Alternative 3: Log-Based Liveness

Alternative 4: Synchronous Counter (Anti-Pattern)

Alternative 5: Manual Service Registration

Open Questions

Q1: Should heartbeat interval be configurable per service?

Q2: What happens if the collector is down?

Q3: Should we send additional health metrics?

Q4: How do we handle service restarts?

Q5: Does this work for worker containers that scale 0→50→0?

Related Documentation

Approval Checklist

Change Log