Skip to content

ADR: OpenTelemetry Semantic Conventions Implementation

Date: 2025-11-13 Status: Proposed Authors: Architecture Team

Context

Our current OpenTelemetry implementation lacks semantic conventions, preventing automatic service map generation and reducing observability effectiveness.

Current Problems

  1. No Semantic Conventions: Using custom attribute names (emp.workflow.id, emp.job.id) instead of standard OpenTelemetry conventions
  2. Missing Network Peer Attribution: No net.peer.name attributes that enable automatic service topology mapping
  3. Inconsistent SpanKind Usage: Not using SpanKind.CLIENT, SERVER, PRODUCER, CONSUMER consistently
  4. Dual Telemetry Systems: Custom WorkflowTelemetryClient alongside official OpenTelemetry SDK
  5. No-op Instrumentation: JobInstrumentation and WorkflowInstrumentation are placeholder stubs

Why This Matters

Without semantic conventions, observability platforms cannot:

  • Automatically generate service topology maps
  • Group related traces across distributed services
  • Detect service-to-service dependencies
  • Correlate database/messaging operations with services

Example of missing conventions:

typescript
// Current (custom)
{
  'emp.job.id': '123',
  'emp.workflow.id': 'abc',
  'service.name': 'api'
}

// Needed for service maps
{
  [SEMATTRS_HTTP_METHOD]: 'POST',
  [SEMATTRS_HTTP_URL]: '/jobs',
  [SEMATTRS_NET_PEER_NAME]: 'redis-cluster',  // Creates edge in service map
  [SEMATTRS_DB_SYSTEM]: 'redis',
  [SEMATTRS_DB_OPERATION]: 'RPUSH'
}

Decision

Adopt OpenTelemetry Semantic Conventions throughout the system with phased implementation.

Service Naming Hierarchy

We use a physical/functional hierarchy that reflects the distributed infrastructure:

Top-Level Services (Flat under emerge namespace)

These are primary APIs and services:

typescript
service.namespace = "emerge"
service.name = "job-api"         // emerge.job-api
service.name = "emprops-api"     // emerge.emprops-api
service.name = "webhook-service" // emerge.webhook-service
service.name = "monitor"         // emerge.monitor

Physical Hierarchy (Machine → Worker → Services)

For ephemeral machines running distributed workers:

typescript
// Machine level
service.namespace = "emerge"
service.name = "machine.{machine-id}"  // emerge.machine.salad-xyz-123

// Worker running on machine
service.namespace = "emerge.machine.{machine-id}"
service.name = "worker"  // emerge.machine.salad-xyz-123.worker

// ComfyUI service running on machine
service.namespace = "emerge.machine.{machine-id}"
service.name = "comfyui"  // emerge.machine.salad-xyz-123.comfyui

External Services (No namespace)

typescript
service.name = "redis-cluster"  // No namespace
service.name = "postgres"       // No namespace

Rationale:

  • Flat top-level APIs are easy to discover and filter
  • Physical hierarchy shows which worker/ComfyUI belongs to which machine
  • Clear ownership and debugging (trace from API → specific machine → worker → ComfyUI)
  • Namespace filtering enables "show all services on machine X"

Implementation Plan

Phase 1: Foundation - Semantic Helpers Package

Create packages/otel-helpers/ with standard attribute builders:

typescript
// packages/otel-helpers/src/attributes.ts
import {
  SEMATTRS_HTTP_METHOD,
  SEMATTRS_HTTP_URL,
  SEMATTRS_NET_PEER_NAME,
  SEMATTRS_DB_SYSTEM,
  SEMATTRS_DB_OPERATION,
  SEMATTRS_MESSAGING_SYSTEM,
  SEMATTRS_MESSAGING_DESTINATION,
  SEMATTRS_MESSAGING_OPERATION,
} from '@opentelemetry/semantic-conventions';

/**
 * HTTP client attributes for service map generation
 */
export function httpClientAttributes(
  method: string,
  url: string,
  targetService: string,
  statusCode?: number
) {
  const attrs: Record<string, string | number> = {
    [SEMATTRS_HTTP_METHOD]: method,
    [SEMATTRS_HTTP_URL]: url,
    [SEMATTRS_NET_PEER_NAME]: targetService, // KEY for service map edges
  };
  if (statusCode) attrs[SEMATTRS_HTTP_STATUS_CODE] = statusCode;
  return attrs;
}

/**
 * Redis operation attributes
 */
export function redisAttributes(
  operation: string,
  key?: string,
  targetService: string = 'redis-cluster'
) {
  return {
    [SEMATTRS_DB_SYSTEM]: 'redis',
    [SEMATTRS_DB_OPERATION]: operation,
    [SEMATTRS_NET_PEER_NAME]: targetService, // Service map edge
    ...(key && { [SEMATTRS_DB_STATEMENT]: `${operation} ${key}` })
  };
}

/**
 * Messaging attributes for job queues
 */
export function messagingAttributes(
  operation: 'publish' | 'receive' | 'process',
  destination: string,
  system: string = 'redis'
) {
  return {
    [SEMATTRS_MESSAGING_SYSTEM]: system,
    [SEMATTRS_MESSAGING_DESTINATION]: destination,
    [SEMATTRS_MESSAGING_OPERATION]: operation,
    [SEMATTRS_NET_PEER_NAME]: `${system}-queue`, // Service map node
  };
}

/**
 * EMP domain attributes (supplement semantic conventions)
 */
export function empDomainAttributes(context: {
  jobId?: string;
  workflowId?: string;
  machineId?: string;
  workerId?: string;
  userId?: string;
}) {
  const attrs: Record<string, string> = {};
  if (context.jobId) attrs['job.id'] = context.jobId;
  if (context.workflowId) attrs['workflow.id'] = context.workflowId;
  if (context.machineId) attrs['machine.id'] = context.machineId;
  if (context.workerId) attrs['worker.id'] = context.workerId;
  if (context.userId) attrs['user.id'] = context.userId;
  return attrs;
}

Phase 2: Enhanced EmpTelemetryClient

Add semantic-aware methods to @emp/telemetry:

typescript
// packages/telemetry/src/index.ts (additions)
export class EmpTelemetryClient {
  /**
   * Execute HTTP client call with proper service attribution
   */
  async withHttpCall<T>(
    method: string,
    url: string,
    targetService: string,
    fn: (span: Span) => Promise<T>
  ): Promise<T> {
    return this.withSpan(
      `HTTP ${method}`,
      async (span) => {
        const result = await fn(span);
        const statusCode = (result as any)?.status || 200;
        span.setAttributes(httpClientAttributes(method, url, targetService, statusCode));
        return result;
      },
      {
        kind: SpanKind.CLIENT, // Critical for service map
        attributes: httpClientAttributes(method, url, targetService)
      }
    );
  }

  /**
   * Execute Redis operation with proper service attribution
   */
  async withRedisOperation<T>(
    operation: string,
    key: string,
    fn: (span: Span) => Promise<T>
  ): Promise<T> {
    return this.withSpan(
      `Redis ${operation}`,
      fn,
      {
        kind: SpanKind.CLIENT,
        attributes: redisAttributes(operation, key)
      }
    );
  }

  /**
   * Publish message to queue with proper messaging semantics
   */
  async withQueuePublish<T>(
    queueName: string,
    fn: (span: Span) => Promise<T>
  ): Promise<T> {
    return this.withSpan(
      `Publish to ${queueName}`,
      fn,
      {
        kind: SpanKind.PRODUCER,
        attributes: messagingAttributes('publish', queueName)
      }
    );
  }

  /**
   * Consume message from queue
   */
  async withQueueConsume<T>(
    queueName: string,
    fn: (span: Span) => Promise<T>
  ): Promise<T> {
    return this.withSpan(
      `Consume from ${queueName}`,
      fn,
      {
        kind: SpanKind.CONSUMER,
        attributes: messagingAttributes('receive', queueName)
      }
    );
  }
}

Phase 3: Migration Examples

Before (custom telemetry):

typescript
const workflow = workflowClient.startWorkflow('job.submission', {
  jobId: job.id,
  'emp.workflow.id': workflowId
});

After (semantic conventions):

typescript
await telemetryClient.withHttpCall(
  'POST',
  '/api/jobs',
  'job-api',
  async (span) => {
    span.setAttributes(empDomainAttributes({
      jobId: job.id,
      workflowId
    }));

    // Nested Redis operation
    await telemetryClient.withRedisOperation('RPUSH', 'job-queue', async () => {
      await redis.rpush('job-queue', JSON.stringify(job));
    });
  }
);

Expected Service Map

After implementation with hierarchical naming:

┌──────────────────────┐
│  emerge.monitor      │
└──────────┬───────────┘
           │ WebSocket (CLIENT)

┌──────────────────────┐
│  emerge.job-api      │
└──────────┬───────────┘
           │ Redis RPUSH (CLIENT)

┌──────────────────────┐
│  redis-cluster       │
└──────────┬───────────┘
           │ Worker Claim (CONSUMER)

┌──────────────────────────────────────────────┐
│  emerge.machine.salad-123.worker             │
└──────────┬───────────────────────────────────┘
           │ HTTP POST (CLIENT)

┌──────────────────────────────────────────────┐
│  emerge.machine.salad-123.comfyui            │
└──────────────────────────────────────────────┘

Benefits:

  • Service map shows exact machine topology
  • Can filter to see all services on a specific machine
  • Clear ownership: worker and ComfyUI belong to specific machine instance

Consequences

Positive

  1. Automatic Service Maps: No manual configuration needed
  2. Standard Dashboards: Works with any OpenTelemetry backend
  3. Better Debugging: Proper trace context propagation
  4. Industry Standards: Easier onboarding for new developers
  5. Reduced Code: Delete ~200 lines of custom telemetry

Negative

  1. Migration Effort: Requires refactoring across services
  2. Breaking Changes: Custom attributes need mapping
  3. Learning Curve: Team learns semantic conventions

Risks & Mitigation

  1. Incomplete Migration: Use feature flags, gradual rollout
  2. Lost Context: Preserve domain attributes alongside semantic ones
  3. Performance Impact: Monitor span volume, implement sampling

Implementation Timeline

  1. Week 1: Create packages/otel-helpers with attribute builders
  2. Week 2: Enhance EmpTelemetryClient with semantic methods
  3. Week 3: Migrate API server job submission
  4. Week 4: Migrate worker job processing
  5. Week 5: Remove custom telemetry code
  6. Week 6: Documentation and training

Success Criteria

  • [ ] Service map automatically shows: Monitor → API → Redis → Worker → ComfyUI
  • [ ] All HTTP calls use semantic net.peer.name attributes
  • [ ] All Redis operations use semantic db.system attributes
  • [ ] All queue operations use semantic messaging.* attributes
  • [ ] Custom telemetry code reduced by >80%

References

Released under the MIT License.