Skip to content

ADR-003: Service Heartbeat and Liveness Telemetry ​

Date: 2025-10-09 Status: πŸ€” Proposed Decision Makers: Engineering Team Approval Required: Before universal implementation across services Related ADRs: None


Executive Summary ​

This ADR proposes a standardized service heartbeat and liveness telemetry system using OpenTelemetry metrics to enable automatic service discovery, uptime monitoring, and service map visualization in Dash0.

Current Gap:

  • ❌ No automatic service discovery - Services exist but aren't visible until they process requests
  • ❌ No uptime tracking - Can't distinguish between "idle" and "down" services
  • ❌ No service map - No visual topology of running services and their dependencies
  • ❌ Delayed failure detection - Service crashes only noticed when users report issues

Proposed Solution:

  • 🎯 Periodic heartbeat metrics (every 15 seconds) sent via OpenTelemetry
  • πŸ—ΊοΈ Automatic service map population using service.name resource attributes
  • ⏱️ Uptime tracking with asynchronous gauge metrics
  • πŸ“Š Dash0 visualization showing live/dead services in real-time

Impact:

  • Before: Services are invisible until used, failures discovered reactively
  • After: All services visible immediately on startup, proactive failure detection

Table of Contents ​

  1. Context
  2. Problem Statement
  3. Decision
  4. Technical Design
  5. Implementation Pattern
  6. Service Map Architecture
  7. Dash0 Integration
  8. Rollout Strategy
  9. Consequences
  10. Alternatives Considered

Context ​

Current State ​

Existing Telemetry Infrastructure:

βœ… @emp/telemetry package with EmpTelemetryClient
βœ… OTLP gRPC collector (port 4317)
βœ… Collector β†’ Dash0 export pipeline
βœ… Trace-based instrumentation in API, webhook-service
βœ… Service.name resource attributes configured

Current Service Discovery Model:

Service starts
  ↓
No signal sent (invisible)
  ↓
First request arrives
  ↓
Trace created β†’ Dash0
  ↓
Service appears on map (REACTIVE)

Problem:

  • Services are "dark" until they receive traffic
  • Idle services appear as "down" (indistinguishable from crashed services)
  • No way to track service uptime or availability
  • Service map incomplete without active traffic

OpenTelemetry Research Findings ​

Service Discovery via Traces: OpenTelemetry platforms automatically build service maps from trace data by:

  1. Analyzing parent-child span relationships
  2. Using service.name resource attributes to identify services
  3. Inferring service dependencies from request flows
  4. Building real-time topology graphs

Heartbeat Patterns: From OpenTelemetry best practices and industry research:

  1. Asynchronous Gauge Metrics (Recommended for periodic values)

    • Designed for values that aren't measured continuously
    • Perfect for periodic health checks (every 5-60 seconds)
    • Low overhead - callback-based, no active polling
  2. Health Check Extension (Collector-level)

    • Provides HTTP endpoint for liveness/readiness checks
    • Used for monitoring the collector itself
    • Not suitable for application-level service discovery
  3. Custom Metrics with PeriodicExportingMetricReader

    • Send metrics on fixed intervals (15-60 seconds recommended)
    • Includes uptime, process stats, custom health indicators
    • Standard OpenTelemetry pattern for service monitoring

Service Map in Dash0 ​

Service maps in OpenTelemetry-compatible platforms (Dash0, Grafana Tempo, Jaeger) work by:

  • Span Analysis: Parent-child relationships show request flows
  • Service Identification: service.name attribute identifies each service
  • Dependency Inference: Service A β†’ Service B connections from spans
  • Real-time Updates: Maps update as new telemetry arrives

Key Insight: Service maps are automatically generated from telemetry data. No manual registration required - just send signals with proper resource attributes.


Problem Statement ​

User Story ​

As a platform operator, I want to see all running services on a service map and know their health status, so I can detect failures proactively and understand system topology at a glance.

Requirements ​

Functional:

  • βœ… Services appear on Dash0 service map within 15 seconds of startup
  • βœ… Service uptime tracked and visible in metrics
  • βœ… Service liveness status (alive/dead) determinable from recent signals
  • βœ… Service dependencies inferred from actual request traces

Non-Functional:

  • βœ… Minimal overhead (< 0.1% CPU, < 5MB memory per service)
  • βœ… Works with existing OTLP collector infrastructure
  • βœ… No changes to Dash0 configuration required
  • βœ… Graceful degradation if collector is temporarily unavailable

Decision ​

We will implement a standardized service heartbeat system using OpenTelemetry Asynchronous Gauge metrics, sent every 15 seconds to the OTLP collector.

Core Components ​

  1. Heartbeat Metric (service.heartbeat)

    • Type: Asynchronous Gauge
    • Value: Always 1 (alive indicator)
    • Attributes: service.name, service.version, environment
    • Frequency: Every 15 seconds
  2. Uptime Metric (service.uptime_seconds)

    • Type: Asynchronous Gauge
    • Value: Seconds since service started
    • Attributes: Same as heartbeat
    • Frequency: Every 15 seconds (same callback)
  3. Resource Attributes (Already configured)

    • service.name: e.g., "emp-api", "emp-webhook-service"
    • service.version: From package.json or env var
    • service.namespace: Optional (e.g., "production", "staging")
    • deployment.environment: From NODE_ENV

Why Asynchronous Gauge? ​

OpenTelemetry documentation states:

"Use Asynchronous Gauge to collect metrics that you can't (or don't want) to monitor continuously, for example, if you only want to check a server's CPU usage every five minutes."

This perfectly matches our use case:

  • βœ… Periodic signal (not continuous monitoring)
  • βœ… Callback-based (no polling overhead)
  • βœ… Automatic export with configured interval
  • βœ… Standard OpenTelemetry pattern

Technical Design ​

Architecture Diagram ​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Service (emp-api)                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚         EmpTelemetryClient (initialized)               β”‚ β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚ β”‚
β”‚  β”‚  β”‚  PeriodicExportingMetricReader (15s interval)    β”‚  β”‚ β”‚
β”‚  β”‚  β”‚    ↓                                              β”‚  β”‚ β”‚
β”‚  β”‚  β”‚  Asynchronous Gauge Callback                     β”‚  β”‚ β”‚
β”‚  β”‚  β”‚    - service.heartbeat = 1                       β”‚  β”‚ β”‚
β”‚  β”‚  β”‚    - service.uptime_seconds = (now - startTime)  β”‚  β”‚ β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                          ↓ gRPC (every 15s)                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            OTLP Collector (localhost:4317)                  β”‚
β”‚                          ↓                                   β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                       β”‚
β”‚              β”‚  Metrics Pipeline     β”‚                       β”‚
β”‚              β”‚  - Batch processor    β”‚                       β”‚
β”‚              β”‚  - Attribute enrichmentβ”‚                      β”‚
β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           ↓ OTLP/gRPC
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Dash0 (SaaS)                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚              Service Map Visualization                  β”‚ β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚ β”‚
β”‚  β”‚  β”‚ emp-api  │────→│ webhook  │────→│ machines β”‚       β”‚ β”‚
β”‚  β”‚  β”‚  (live)  β”‚     β”‚  (live)  β”‚     β”‚  (live)  β”‚       β”‚ β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚ β”‚
β”‚  β”‚                                                          β”‚ β”‚
β”‚  β”‚  Uptime Tracking:                                       β”‚ β”‚
β”‚  β”‚  - emp-api: 3h 24m (last heartbeat: 12s ago)          β”‚ β”‚
β”‚  β”‚  - webhook: 1h 15m (last heartbeat: 10s ago)          β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Service Lifecycle ​

Service Startup
  ↓
Initialize EmpTelemetryClient
  ↓
Register heartbeat gauge callback
  ↓
Start service (Express, Fastify, etc.)
  ↓
[Every 15 seconds]
  ↓
Callback invoked
  ↓
Metrics exported to collector
  ↓
Dash0 receives metrics
  ↓
Service appears on map (within 15s)

Implementation Pattern ​

Standard Service Initialization ​

typescript
// apps/api/src/index.ts (EXAMPLE - already has telemetryClient)

import { createTelemetryClient } from '@emp/telemetry';

// Service start time for uptime calculation
const serviceStartTime = Date.now();

// Initialize telemetry with heartbeat
const telemetryClient = createTelemetryClient({
  serviceName: 'emp-api',
  serviceVersion: '1.0.0',
  collectorEndpoint: process.env.OTEL_COLLECTOR_ENDPOINT!,
  environment: process.env.NODE_ENV!,
  enableHeartbeat: true,  // NEW: Enable automatic heartbeat
  heartbeatInterval: 15000 // NEW: 15 seconds (optional, default)
});

// The telemetry client will automatically:
// 1. Register asynchronous gauge for heartbeat
// 2. Register asynchronous gauge for uptime
// 3. Export metrics every 15 seconds via PeriodicExportingMetricReader
// 4. Include service.name resource attributes automatically

Telemetry Client Enhancement ​

typescript
// packages/telemetry/src/index.ts (CHANGES NEEDED)

export interface TelemetryConfig {
  serviceName: string;
  serviceVersion: string;
  collectorEndpoint: string;
  environment: string;
  enableHeartbeat?: boolean;      // NEW: Default true
  heartbeatInterval?: number;     // NEW: Default 15000ms
}

export class EmpTelemetryClient {
  private serviceStartTime: number;
  private meter: Meter;

  constructor(config: TelemetryConfig) {
    this.serviceStartTime = Date.now();

    // ... existing initialization ...

    // NEW: Setup heartbeat if enabled
    if (config.enableHeartbeat !== false) {
      this.setupHeartbeat(config.heartbeatInterval || 15000);
    }
  }

  private setupHeartbeat(interval: number) {
    const meter = this.meterProvider.getMeter('emp-heartbeat');

    // Heartbeat gauge - always 1 when alive
    const heartbeatGauge = meter.createObservableGauge('service.heartbeat', {
      description: 'Service liveness indicator (1 = alive)',
      unit: '1'
    });

    // Uptime gauge - seconds since start
    const uptimeGauge = meter.createObservableGauge('service.uptime_seconds', {
      description: 'Service uptime in seconds',
      unit: 's'
    });

    // Register batch callback for both metrics
    meter.addBatchObservableCallback(
      async (observableResult) => {
        const uptime = (Date.now() - this.serviceStartTime) / 1000;

        observableResult.observe(heartbeatGauge, 1, {
          'service.status': 'running'
        });

        observableResult.observe(uptimeGauge, uptime, {
          'service.status': 'running'
        });
      },
      [heartbeatGauge, uptimeGauge]
    );

    logger.info(`Service heartbeat enabled (interval: ${interval}ms)`);
  }
}

Metric Reader Configuration ​

The PeriodicExportingMetricReader already configured in @emp/telemetry will handle:

  • Automatic metric collection every exportIntervalMillis (15s recommended)
  • Batching and export to OTLP collector
  • Resource attribute injection (service.name, etc.)
  • Retry logic for failed exports

No additional configuration needed - just register the callbacks.


Service Map Architecture ​

How Service Maps Work ​

OpenTelemetry service maps are automatically generated from telemetry data:

  1. Service Identification

    • Each service has unique service.name resource attribute
    • Example: emp-api, emp-webhook-service, emp-worker-gpu0
  2. Dependency Detection (from traces)

    • Parent span (service A) β†’ Child span (service B) = A calls B
    • HTTP client spans show outbound requests
    • Server spans show inbound requests
    • Example: emp-api β†’ emp-webhook-service (HTTP POST /webhook)
  3. Service Presence (from metrics/traces)

    • Service sends ANY telemetry β†’ appears on map
    • Heartbeat metrics provide guaranteed presence even when idle
    • Last signal timestamp determines alive/dead status
  4. Topology Visualization

    • Dash0 builds graph from service relationships
    • Nodes = services (sized by traffic volume)
    • Edges = dependencies (sized by request count)
    • Color = health status (green=healthy, red=errors, gray=idle)

Example Service Map Flow ​

Startup: emp-api sends heartbeat
  ↓
Dash0: Creates node "emp-api" (green, idle)
  ↓
User request: POST /jobs
  ↓
emp-api creates trace with parent span
  ↓
emp-api calls Redis (child span with service.name="redis")
  ↓
Dash0: Creates node "redis", edge "emp-api β†’ redis"
  ↓
15s later: emp-api sends heartbeat again
  ↓
Dash0: Updates "emp-api" last-seen timestamp

Result: Service map shows emp-api and redis, even during idle periods.


Dash0 Integration ​

Metric Visualization ​

Dash0 will automatically receive and display:

  1. Service Inventory

    • All services with service.heartbeat = 1 in last 60 seconds
    • Service name, version, environment visible in resource attributes
    • Uptime displayed from service.uptime_seconds
  2. Service Map

    • Nodes auto-created from service.name attributes
    • Edges auto-created from trace parent-child relationships
    • Heartbeat ensures nodes persist even without traffic
  3. Alerts (can be configured in Dash0)

    • Alert if service.heartbeat not received for > 45 seconds (3x interval)
    • Alert if service.uptime_seconds drops (service restart)
    • Alert if error rate > threshold (from trace spans)

Query Examples ​

promql
# Check which services are alive (last 60s)
service.heartbeat{service.name=~"emp-.*"} > 0

# Get service uptime
service.uptime_seconds{service.name="emp-api"}

# Detect service restarts (uptime decreases)
decrease(service.uptime_seconds{service.name="emp-api"}[5m])

Rollout Strategy ​

Phase 1: API Service (Week 1) ​

Goal: Prove heartbeat concept and validate Dash0 visualization

  • βœ… Enhance @emp/telemetry with heartbeat support
  • βœ… Enable heartbeat in emp-api service
  • βœ… Verify service appears on Dash0 map within 15s of startup
  • βœ… Confirm uptime metric accuracy
  • βœ… Test failure detection (kill service, verify alert)

Success Criteria:

  • emp-api visible on Dash0 map immediately on startup
  • Uptime metric matches actual service runtime
  • Service disappears from "live" list within 60s of crash

Phase 2: Core Services (Week 2) ​

Goal: Extend to all production services

Services to instrument:

  • βœ… emp-webhook-service
  • βœ… emp-telemetry-collector (self-monitoring)
  • βœ… emp-worker (redis-direct-worker)
  • βœ… emp-machines (basic-machine supervisor)

Success Criteria:

  • Full service topology visible in Dash0
  • Dependencies correctly inferred from traces
  • All services show accurate uptime

Phase 3: Specialized Services (Week 3) ​

Goal: Complete coverage

  • βœ… emp-monitor (internal dashboard)
  • βœ… emp-emprops-api (legacy API)
  • βœ… Any future services

Success Criteria:

  • 100% service coverage
  • Standardized heartbeat pattern across all services
  • Documentation updated with implementation guide

Consequences ​

Positive ​

βœ… Proactive Failure Detection

  • Services visible immediately on startup
  • Crashes detected within 45 seconds (3x heartbeat interval)
  • No reliance on user reports

βœ… Service Map Visualization

  • Automatic topology mapping
  • Real-time dependency graph
  • Easier debugging of distributed system issues

βœ… Uptime Tracking

  • Historical uptime data in Dash0
  • Service reliability metrics
  • Restart detection and alerting

βœ… Operational Visibility

  • Know which services are running at any time
  • Distinguish idle vs. crashed services
  • Service inventory for capacity planning

βœ… Standard OpenTelemetry Pattern

  • Industry best practice
  • Well-documented approach
  • Compatible with any OTLP backend (not locked to Dash0)

βœ… Minimal Overhead

  • Callback-based (no polling threads)
  • 15-second interval (low frequency)
  • Estimated < 0.1% CPU, < 5MB memory per service

Negative ​

⚠️ Slightly Increased Telemetry Volume

  • ~4 heartbeat signals per minute per service
  • With 10 services: 40 metric data points/min = 57,600/day
  • Mitigation: Metrics are small (just gauge value + attributes), negligible cost

⚠️ False Positives During Deploys

  • Service restart = brief "dead" period (15-45 seconds)
  • Mitigation: Configure Dash0 alerts with 60s grace period
  • Mitigation: Blue-green deployment minimizes downtime

⚠️ Dependency on Collector

  • Heartbeat requires collector availability
  • Mitigation: Telemetry client has retry logic and queuing
  • Mitigation: Service functions normally even if collector is down

Neutral ​

➑️ New Metric Type

  • Adds asynchronous gauges to telemetry stack
  • Developers must understand gauge vs. counter semantics
  • Action: Update telemetry documentation

➑️ Dash0 Configuration

  • May need to configure custom dashboards for heartbeat metrics
  • Alert rules for service liveness
  • Action: Provide Dash0 dashboard templates

Alternatives Considered ​

Alternative 1: Trace-Only Service Discovery ​

Approach: Rely solely on traces (spans) for service discovery, no heartbeat metrics.

Pros:

  • βœ… No additional implementation needed
  • βœ… Zero metric overhead
  • βœ… Services appear on map from actual traffic

Cons:

  • ❌ Services invisible until first request
  • ❌ Idle services appear "dead"
  • ❌ No uptime tracking
  • ❌ Slow failure detection (only noticed when requests fail)

Verdict: ❌ Rejected - Reactive discovery insufficient for operational needs.


Alternative 2: HTTP Health Check Endpoints ​

Approach: Each service exposes /health endpoint, external monitoring system polls it.

Pros:

  • βœ… Simple implementation (Express middleware)
  • βœ… Standard pattern (Kubernetes liveness probes)
  • βœ… Works without telemetry infrastructure

Cons:

  • ❌ Separate infrastructure needed (Prometheus, Datadog agent, etc.)
  • ❌ Duplicate observability stack (OTLP + HTTP polling)
  • ❌ No automatic service map integration
  • ❌ Additional network overhead (HTTP requests)
  • ❌ Doesn't integrate with Dash0

Verdict: ❌ Rejected - Adds complexity, doesn't leverage existing OTLP pipeline.


Alternative 3: Log-Based Liveness ​

Approach: Services write "heartbeat" log lines every 15s, log aggregator detects liveness.

Pros:

  • βœ… Uses existing logging infrastructure
  • βœ… Easy to implement (setInterval + logger.info)

Cons:

  • ❌ Logs != metrics (wrong abstraction)
  • ❌ Expensive to query at scale (grep through logs)
  • ❌ No structured data for dashboards
  • ❌ Doesn't populate service map
  • ❌ Log volume increases significantly

Verdict: ❌ Rejected - Logs are for diagnostics, not operational metrics.


Alternative 4: Synchronous Counter (Anti-Pattern) ​

Approach: Use synchronous counter incremented every 15s via setInterval.

Pros:

  • βœ… Simple implementation

Cons:

  • ❌ Wrong metric type - Counters are for cumulative values, not periodic signals
  • ❌ Requires active polling (setInterval thread)
  • ❌ Not idiomatic OpenTelemetry
  • ❌ Confusing semantics (counter should always increase)

Verdict: ❌ Rejected - Violates OpenTelemetry best practices.


Alternative 5: Manual Service Registration ​

Approach: Services POST to /register endpoint on startup, separate service maintains registry.

Pros:

  • βœ… Explicit service inventory
  • βœ… Can include metadata (version, capabilities)

Cons:

  • ❌ Separate service registry infrastructure
  • ❌ Single point of failure
  • ❌ Stale data if service crashes without deregistering
  • ❌ Doesn't integrate with Dash0
  • ❌ Duplicate of what OTLP already provides

Verdict: ❌ Rejected - Reinventing OTLP service discovery.


Open Questions ​

Q1: Should heartbeat interval be configurable per service? ​

Answer: Yes, but default to 15 seconds.

  • API/webhook: 15s (user-facing, need fast detection)
  • Workers: 30s (background processing, less critical)
  • Machines: 60s (long-running, stable)

Action: Add heartbeatInterval to TelemetryConfig (optional).


Q2: What happens if the collector is down? ​

Answer: Graceful degradation:

  1. Telemetry client queues metrics locally (in-memory buffer)
  2. Retries export with exponential backoff
  3. Drops oldest metrics if buffer fills (prevents memory leak)
  4. Service continues functioning normally

Action: Document retry behavior, add monitoring for telemetry export failures.


Q3: Should we send additional health metrics? ​

Ideas:

  • Memory usage
  • CPU usage
  • Request rate
  • Error rate

Answer: Future enhancement, not in initial implementation.

  • Heartbeat + uptime are sufficient for service discovery
  • System metrics (CPU/memory) can be added via OpenTelemetry Host Metrics
  • Request/error rates already captured in trace spans

Action: Create follow-up ADR for comprehensive service health metrics.


Q4: How do we handle service restarts? ​

Behavior:

  • Service restarts β†’ service.uptime_seconds resets to 0
  • Old service instance stops sending heartbeats
  • New instance starts sending heartbeats
  • Dash0 shows brief gap (15-45 seconds)

Action: Configure Dash0 alerts with 60s threshold to avoid false positives during deploys.


Q5: Does this work for worker containers that scale 0β†’50β†’0? ​

Answer: Yes, perfectly suited for ephemeral workloads:

  • Worker starts β†’ heartbeat appears β†’ visible on map
  • Worker processes jobs β†’ traces show activity
  • Worker stops β†’ heartbeat stops β†’ disappears from map within 60s
  • Automatic inventory of active workers

Action: Test with SALAD/vast.ai worker scaling.



Approval Checklist ​

Before accepting this ADR:

  • [ ] Review technical design with team
  • [ ] Validate metric schema with Dash0 documentation
  • [ ] Confirm performance overhead is acceptable (< 0.1% CPU)
  • [ ] Test heartbeat implementation in local development
  • [ ] Verify service map appears correctly in Dash0
  • [ ] Document rollout plan and timelines
  • [ ] Get sign-off from operations team

Change Log ​

DateChangeAuthor
2025-10-09Initial proposalClaude Code

Released under the MIT License.