ADR-003: Service Heartbeat and Liveness Telemetry β
Date: 2025-10-09 Status: π€ Proposed Decision Makers: Engineering Team Approval Required: Before universal implementation across services Related ADRs: None
Executive Summary β
This ADR proposes a standardized service heartbeat and liveness telemetry system using OpenTelemetry metrics to enable automatic service discovery, uptime monitoring, and service map visualization in Dash0.
Current Gap:
- β No automatic service discovery - Services exist but aren't visible until they process requests
- β No uptime tracking - Can't distinguish between "idle" and "down" services
- β No service map - No visual topology of running services and their dependencies
- β Delayed failure detection - Service crashes only noticed when users report issues
Proposed Solution:
- π― Periodic heartbeat metrics (every 15 seconds) sent via OpenTelemetry
- πΊοΈ Automatic service map population using
service.nameresource attributes - β±οΈ Uptime tracking with asynchronous gauge metrics
- π Dash0 visualization showing live/dead services in real-time
Impact:
- Before: Services are invisible until used, failures discovered reactively
- After: All services visible immediately on startup, proactive failure detection
Table of Contents β
- Context
- Problem Statement
- Decision
- Technical Design
- Implementation Pattern
- Service Map Architecture
- Dash0 Integration
- Rollout Strategy
- Consequences
- Alternatives Considered
Context β
Current State β
Existing Telemetry Infrastructure:
β
@emp/telemetry package with EmpTelemetryClient
β
OTLP gRPC collector (port 4317)
β
Collector β Dash0 export pipeline
β
Trace-based instrumentation in API, webhook-service
β
Service.name resource attributes configuredCurrent Service Discovery Model:
Service starts
β
No signal sent (invisible)
β
First request arrives
β
Trace created β Dash0
β
Service appears on map (REACTIVE)Problem:
- Services are "dark" until they receive traffic
- Idle services appear as "down" (indistinguishable from crashed services)
- No way to track service uptime or availability
- Service map incomplete without active traffic
OpenTelemetry Research Findings β
Service Discovery via Traces: OpenTelemetry platforms automatically build service maps from trace data by:
- Analyzing parent-child span relationships
- Using
service.nameresource attributes to identify services - Inferring service dependencies from request flows
- Building real-time topology graphs
Heartbeat Patterns: From OpenTelemetry best practices and industry research:
Asynchronous Gauge Metrics (Recommended for periodic values)
- Designed for values that aren't measured continuously
- Perfect for periodic health checks (every 5-60 seconds)
- Low overhead - callback-based, no active polling
Health Check Extension (Collector-level)
- Provides HTTP endpoint for liveness/readiness checks
- Used for monitoring the collector itself
- Not suitable for application-level service discovery
Custom Metrics with PeriodicExportingMetricReader
- Send metrics on fixed intervals (15-60 seconds recommended)
- Includes uptime, process stats, custom health indicators
- Standard OpenTelemetry pattern for service monitoring
Service Map in Dash0 β
Service maps in OpenTelemetry-compatible platforms (Dash0, Grafana Tempo, Jaeger) work by:
- Span Analysis: Parent-child relationships show request flows
- Service Identification:
service.nameattribute identifies each service - Dependency Inference: Service A β Service B connections from spans
- Real-time Updates: Maps update as new telemetry arrives
Key Insight: Service maps are automatically generated from telemetry data. No manual registration required - just send signals with proper resource attributes.
Problem Statement β
User Story β
As a platform operator, I want to see all running services on a service map and know their health status, so I can detect failures proactively and understand system topology at a glance.
Requirements β
Functional:
- β Services appear on Dash0 service map within 15 seconds of startup
- β Service uptime tracked and visible in metrics
- β Service liveness status (alive/dead) determinable from recent signals
- β Service dependencies inferred from actual request traces
Non-Functional:
- β Minimal overhead (< 0.1% CPU, < 5MB memory per service)
- β Works with existing OTLP collector infrastructure
- β No changes to Dash0 configuration required
- β Graceful degradation if collector is temporarily unavailable
Decision β
We will implement a standardized service heartbeat system using OpenTelemetry Asynchronous Gauge metrics, sent every 15 seconds to the OTLP collector.
Core Components β
Heartbeat Metric (
service.heartbeat)- Type: Asynchronous Gauge
- Value: Always
1(alive indicator) - Attributes:
service.name,service.version,environment - Frequency: Every 15 seconds
Uptime Metric (
service.uptime_seconds)- Type: Asynchronous Gauge
- Value: Seconds since service started
- Attributes: Same as heartbeat
- Frequency: Every 15 seconds (same callback)
Resource Attributes (Already configured)
service.name: e.g., "emp-api", "emp-webhook-service"service.version: From package.json or env varservice.namespace: Optional (e.g., "production", "staging")deployment.environment: From NODE_ENV
Why Asynchronous Gauge? β
OpenTelemetry documentation states:
"Use Asynchronous Gauge to collect metrics that you can't (or don't want) to monitor continuously, for example, if you only want to check a server's CPU usage every five minutes."
This perfectly matches our use case:
- β Periodic signal (not continuous monitoring)
- β Callback-based (no polling overhead)
- β Automatic export with configured interval
- β Standard OpenTelemetry pattern
Technical Design β
Architecture Diagram β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Service (emp-api) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β EmpTelemetryClient (initialized) β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β PeriodicExportingMetricReader (15s interval) β β β
β β β β β β β
β β β Asynchronous Gauge Callback β β β
β β β - service.heartbeat = 1 β β β
β β β - service.uptime_seconds = (now - startTime) β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β gRPC (every 15s) β
ββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββ
β OTLP Collector (localhost:4317) β
β β β
β βββββββββββββββββββββββββ β
β β Metrics Pipeline β β
β β - Batch processor β β
β β - Attribute enrichmentβ β
β βββββββββββββ¬ββββββββββββ β
ββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββ
β OTLP/gRPC
ββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββ
β Dash0 (SaaS) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Service Map Visualization β β
β β ββββββββββββ ββββββββββββ ββββββββββββ β β
β β β emp-api βββββββ webhook βββββββ machines β β β
β β β (live) β β (live) β β (live) β β β
β β ββββββββββββ ββββββββββββ ββββββββββββ β β
β β β β
β β Uptime Tracking: β β
β β - emp-api: 3h 24m (last heartbeat: 12s ago) β β
β β - webhook: 1h 15m (last heartbeat: 10s ago) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββService Lifecycle β
Service Startup
β
Initialize EmpTelemetryClient
β
Register heartbeat gauge callback
β
Start service (Express, Fastify, etc.)
β
[Every 15 seconds]
β
Callback invoked
β
Metrics exported to collector
β
Dash0 receives metrics
β
Service appears on map (within 15s)Implementation Pattern β
Standard Service Initialization β
// apps/api/src/index.ts (EXAMPLE - already has telemetryClient)
import { createTelemetryClient } from '@emp/telemetry';
// Service start time for uptime calculation
const serviceStartTime = Date.now();
// Initialize telemetry with heartbeat
const telemetryClient = createTelemetryClient({
serviceName: 'emp-api',
serviceVersion: '1.0.0',
collectorEndpoint: process.env.OTEL_COLLECTOR_ENDPOINT!,
environment: process.env.NODE_ENV!,
enableHeartbeat: true, // NEW: Enable automatic heartbeat
heartbeatInterval: 15000 // NEW: 15 seconds (optional, default)
});
// The telemetry client will automatically:
// 1. Register asynchronous gauge for heartbeat
// 2. Register asynchronous gauge for uptime
// 3. Export metrics every 15 seconds via PeriodicExportingMetricReader
// 4. Include service.name resource attributes automaticallyTelemetry Client Enhancement β
// packages/telemetry/src/index.ts (CHANGES NEEDED)
export interface TelemetryConfig {
serviceName: string;
serviceVersion: string;
collectorEndpoint: string;
environment: string;
enableHeartbeat?: boolean; // NEW: Default true
heartbeatInterval?: number; // NEW: Default 15000ms
}
export class EmpTelemetryClient {
private serviceStartTime: number;
private meter: Meter;
constructor(config: TelemetryConfig) {
this.serviceStartTime = Date.now();
// ... existing initialization ...
// NEW: Setup heartbeat if enabled
if (config.enableHeartbeat !== false) {
this.setupHeartbeat(config.heartbeatInterval || 15000);
}
}
private setupHeartbeat(interval: number) {
const meter = this.meterProvider.getMeter('emp-heartbeat');
// Heartbeat gauge - always 1 when alive
const heartbeatGauge = meter.createObservableGauge('service.heartbeat', {
description: 'Service liveness indicator (1 = alive)',
unit: '1'
});
// Uptime gauge - seconds since start
const uptimeGauge = meter.createObservableGauge('service.uptime_seconds', {
description: 'Service uptime in seconds',
unit: 's'
});
// Register batch callback for both metrics
meter.addBatchObservableCallback(
async (observableResult) => {
const uptime = (Date.now() - this.serviceStartTime) / 1000;
observableResult.observe(heartbeatGauge, 1, {
'service.status': 'running'
});
observableResult.observe(uptimeGauge, uptime, {
'service.status': 'running'
});
},
[heartbeatGauge, uptimeGauge]
);
logger.info(`Service heartbeat enabled (interval: ${interval}ms)`);
}
}Metric Reader Configuration β
The PeriodicExportingMetricReader already configured in @emp/telemetry will handle:
- Automatic metric collection every
exportIntervalMillis(15s recommended) - Batching and export to OTLP collector
- Resource attribute injection (
service.name, etc.) - Retry logic for failed exports
No additional configuration needed - just register the callbacks.
Service Map Architecture β
How Service Maps Work β
OpenTelemetry service maps are automatically generated from telemetry data:
Service Identification
- Each service has unique
service.nameresource attribute - Example:
emp-api,emp-webhook-service,emp-worker-gpu0
- Each service has unique
Dependency Detection (from traces)
- Parent span (service A) β Child span (service B) = A calls B
- HTTP client spans show outbound requests
- Server spans show inbound requests
- Example:
emp-apiβemp-webhook-service(HTTP POST /webhook)
Service Presence (from metrics/traces)
- Service sends ANY telemetry β appears on map
- Heartbeat metrics provide guaranteed presence even when idle
- Last signal timestamp determines alive/dead status
Topology Visualization
- Dash0 builds graph from service relationships
- Nodes = services (sized by traffic volume)
- Edges = dependencies (sized by request count)
- Color = health status (green=healthy, red=errors, gray=idle)
Example Service Map Flow β
Startup: emp-api sends heartbeat
β
Dash0: Creates node "emp-api" (green, idle)
β
User request: POST /jobs
β
emp-api creates trace with parent span
β
emp-api calls Redis (child span with service.name="redis")
β
Dash0: Creates node "redis", edge "emp-api β redis"
β
15s later: emp-api sends heartbeat again
β
Dash0: Updates "emp-api" last-seen timestampResult: Service map shows emp-api and redis, even during idle periods.
Dash0 Integration β
Metric Visualization β
Dash0 will automatically receive and display:
Service Inventory
- All services with
service.heartbeat = 1in last 60 seconds - Service name, version, environment visible in resource attributes
- Uptime displayed from
service.uptime_seconds
- All services with
Service Map
- Nodes auto-created from
service.nameattributes - Edges auto-created from trace parent-child relationships
- Heartbeat ensures nodes persist even without traffic
- Nodes auto-created from
Alerts (can be configured in Dash0)
- Alert if
service.heartbeatnot received for > 45 seconds (3x interval) - Alert if
service.uptime_secondsdrops (service restart) - Alert if error rate > threshold (from trace spans)
- Alert if
Query Examples β
# Check which services are alive (last 60s)
service.heartbeat{service.name=~"emp-.*"} > 0
# Get service uptime
service.uptime_seconds{service.name="emp-api"}
# Detect service restarts (uptime decreases)
decrease(service.uptime_seconds{service.name="emp-api"}[5m])Rollout Strategy β
Phase 1: API Service (Week 1) β
Goal: Prove heartbeat concept and validate Dash0 visualization
- β
Enhance
@emp/telemetrywith heartbeat support - β
Enable heartbeat in
emp-apiservice - β Verify service appears on Dash0 map within 15s of startup
- β Confirm uptime metric accuracy
- β Test failure detection (kill service, verify alert)
Success Criteria:
- emp-api visible on Dash0 map immediately on startup
- Uptime metric matches actual service runtime
- Service disappears from "live" list within 60s of crash
Phase 2: Core Services (Week 2) β
Goal: Extend to all production services
Services to instrument:
- β emp-webhook-service
- β emp-telemetry-collector (self-monitoring)
- β emp-worker (redis-direct-worker)
- β emp-machines (basic-machine supervisor)
Success Criteria:
- Full service topology visible in Dash0
- Dependencies correctly inferred from traces
- All services show accurate uptime
Phase 3: Specialized Services (Week 3) β
Goal: Complete coverage
- β emp-monitor (internal dashboard)
- β emp-emprops-api (legacy API)
- β Any future services
Success Criteria:
- 100% service coverage
- Standardized heartbeat pattern across all services
- Documentation updated with implementation guide
Consequences β
Positive β
β Proactive Failure Detection
- Services visible immediately on startup
- Crashes detected within 45 seconds (3x heartbeat interval)
- No reliance on user reports
β Service Map Visualization
- Automatic topology mapping
- Real-time dependency graph
- Easier debugging of distributed system issues
β Uptime Tracking
- Historical uptime data in Dash0
- Service reliability metrics
- Restart detection and alerting
β Operational Visibility
- Know which services are running at any time
- Distinguish idle vs. crashed services
- Service inventory for capacity planning
β Standard OpenTelemetry Pattern
- Industry best practice
- Well-documented approach
- Compatible with any OTLP backend (not locked to Dash0)
β Minimal Overhead
- Callback-based (no polling threads)
- 15-second interval (low frequency)
- Estimated < 0.1% CPU, < 5MB memory per service
Negative β
β οΈ Slightly Increased Telemetry Volume
- ~4 heartbeat signals per minute per service
- With 10 services: 40 metric data points/min = 57,600/day
- Mitigation: Metrics are small (just gauge value + attributes), negligible cost
β οΈ False Positives During Deploys
- Service restart = brief "dead" period (15-45 seconds)
- Mitigation: Configure Dash0 alerts with 60s grace period
- Mitigation: Blue-green deployment minimizes downtime
β οΈ Dependency on Collector
- Heartbeat requires collector availability
- Mitigation: Telemetry client has retry logic and queuing
- Mitigation: Service functions normally even if collector is down
Neutral β
β‘οΈ New Metric Type
- Adds asynchronous gauges to telemetry stack
- Developers must understand gauge vs. counter semantics
- Action: Update telemetry documentation
β‘οΈ Dash0 Configuration
- May need to configure custom dashboards for heartbeat metrics
- Alert rules for service liveness
- Action: Provide Dash0 dashboard templates
Alternatives Considered β
Alternative 1: Trace-Only Service Discovery β
Approach: Rely solely on traces (spans) for service discovery, no heartbeat metrics.
Pros:
- β No additional implementation needed
- β Zero metric overhead
- β Services appear on map from actual traffic
Cons:
- β Services invisible until first request
- β Idle services appear "dead"
- β No uptime tracking
- β Slow failure detection (only noticed when requests fail)
Verdict: β Rejected - Reactive discovery insufficient for operational needs.
Alternative 2: HTTP Health Check Endpoints β
Approach: Each service exposes /health endpoint, external monitoring system polls it.
Pros:
- β Simple implementation (Express middleware)
- β Standard pattern (Kubernetes liveness probes)
- β Works without telemetry infrastructure
Cons:
- β Separate infrastructure needed (Prometheus, Datadog agent, etc.)
- β Duplicate observability stack (OTLP + HTTP polling)
- β No automatic service map integration
- β Additional network overhead (HTTP requests)
- β Doesn't integrate with Dash0
Verdict: β Rejected - Adds complexity, doesn't leverage existing OTLP pipeline.
Alternative 3: Log-Based Liveness β
Approach: Services write "heartbeat" log lines every 15s, log aggregator detects liveness.
Pros:
- β Uses existing logging infrastructure
- β Easy to implement (setInterval + logger.info)
Cons:
- β Logs != metrics (wrong abstraction)
- β Expensive to query at scale (grep through logs)
- β No structured data for dashboards
- β Doesn't populate service map
- β Log volume increases significantly
Verdict: β Rejected - Logs are for diagnostics, not operational metrics.
Alternative 4: Synchronous Counter (Anti-Pattern) β
Approach: Use synchronous counter incremented every 15s via setInterval.
Pros:
- β Simple implementation
Cons:
- β Wrong metric type - Counters are for cumulative values, not periodic signals
- β Requires active polling (setInterval thread)
- β Not idiomatic OpenTelemetry
- β Confusing semantics (counter should always increase)
Verdict: β Rejected - Violates OpenTelemetry best practices.
Alternative 5: Manual Service Registration β
Approach: Services POST to /register endpoint on startup, separate service maintains registry.
Pros:
- β Explicit service inventory
- β Can include metadata (version, capabilities)
Cons:
- β Separate service registry infrastructure
- β Single point of failure
- β Stale data if service crashes without deregistering
- β Doesn't integrate with Dash0
- β Duplicate of what OTLP already provides
Verdict: β Rejected - Reinventing OTLP service discovery.
Open Questions β
Q1: Should heartbeat interval be configurable per service? β
Answer: Yes, but default to 15 seconds.
- API/webhook: 15s (user-facing, need fast detection)
- Workers: 30s (background processing, less critical)
- Machines: 60s (long-running, stable)
Action: Add heartbeatInterval to TelemetryConfig (optional).
Q2: What happens if the collector is down? β
Answer: Graceful degradation:
- Telemetry client queues metrics locally (in-memory buffer)
- Retries export with exponential backoff
- Drops oldest metrics if buffer fills (prevents memory leak)
- Service continues functioning normally
Action: Document retry behavior, add monitoring for telemetry export failures.
Q3: Should we send additional health metrics? β
Ideas:
- Memory usage
- CPU usage
- Request rate
- Error rate
Answer: Future enhancement, not in initial implementation.
- Heartbeat + uptime are sufficient for service discovery
- System metrics (CPU/memory) can be added via OpenTelemetry Host Metrics
- Request/error rates already captured in trace spans
Action: Create follow-up ADR for comprehensive service health metrics.
Q4: How do we handle service restarts? β
Behavior:
- Service restarts β
service.uptime_secondsresets to 0 - Old service instance stops sending heartbeats
- New instance starts sending heartbeats
- Dash0 shows brief gap (15-45 seconds)
Action: Configure Dash0 alerts with 60s threshold to avoid false positives during deploys.
Q5: Does this work for worker containers that scale 0β50β0? β
Answer: Yes, perfectly suited for ephemeral workloads:
- Worker starts β heartbeat appears β visible on map
- Worker processes jobs β traces show activity
- Worker stops β heartbeat stops β disappears from map within 60s
- Automatic inventory of active workers
Action: Test with SALAD/vast.ai worker scaling.
Related Documentation β
- OpenTelemetry Metrics Concepts
- Asynchronous Gauge Documentation
- Service Map Best Practices
- Dash0 OpenTelemetry Integration
- @emp/telemetry Package
Approval Checklist β
Before accepting this ADR:
- [ ] Review technical design with team
- [ ] Validate metric schema with Dash0 documentation
- [ ] Confirm performance overhead is acceptable (< 0.1% CPU)
- [ ] Test heartbeat implementation in local development
- [ ] Verify service map appears correctly in Dash0
- [ ] Document rollout plan and timelines
- [ ] Get sign-off from operations team
Change Log β
| Date | Change | Author |
|---|---|---|
| 2025-10-09 | Initial proposal | Claude Code |
