ADR: OpenTelemetry Semantic Conventions Implementation
Date: 2025-11-13 Status: Proposed Authors: Architecture Team
Context
Our current OpenTelemetry implementation lacks semantic conventions, preventing automatic service map generation and reducing observability effectiveness.
Current Problems
- No Semantic Conventions: Using custom attribute names (
emp.workflow.id,emp.job.id) instead of standard OpenTelemetry conventions - Missing Network Peer Attribution: No
net.peer.nameattributes that enable automatic service topology mapping - Inconsistent SpanKind Usage: Not using
SpanKind.CLIENT,SERVER,PRODUCER,CONSUMERconsistently - Dual Telemetry Systems: Custom
WorkflowTelemetryClientalongside official OpenTelemetry SDK - No-op Instrumentation:
JobInstrumentationandWorkflowInstrumentationare placeholder stubs
Why This Matters
Without semantic conventions, observability platforms cannot:
- Automatically generate service topology maps
- Group related traces across distributed services
- Detect service-to-service dependencies
- Correlate database/messaging operations with services
Example of missing conventions:
// Current (custom)
{
'emp.job.id': '123',
'emp.workflow.id': 'abc',
'service.name': 'api'
}
// Needed for service maps
{
[SEMATTRS_HTTP_METHOD]: 'POST',
[SEMATTRS_HTTP_URL]: '/jobs',
[SEMATTRS_NET_PEER_NAME]: 'redis-cluster', // Creates edge in service map
[SEMATTRS_DB_SYSTEM]: 'redis',
[SEMATTRS_DB_OPERATION]: 'RPUSH'
}Decision
Adopt OpenTelemetry Semantic Conventions throughout the system with phased implementation.
Service Naming Hierarchy
We use a physical/functional hierarchy that reflects the distributed infrastructure:
Top-Level Services (Flat under emerge namespace)
These are primary APIs and services:
service.namespace = "emerge"
service.name = "job-api" // emerge.job-api
service.name = "emprops-api" // emerge.emprops-api
service.name = "webhook-service" // emerge.webhook-service
service.name = "monitor" // emerge.monitorPhysical Hierarchy (Machine → Worker → Services)
For ephemeral machines running distributed workers:
// Machine level
service.namespace = "emerge"
service.name = "machine.{machine-id}" // emerge.machine.salad-xyz-123
// Worker running on machine
service.namespace = "emerge.machine.{machine-id}"
service.name = "worker" // emerge.machine.salad-xyz-123.worker
// ComfyUI service running on machine
service.namespace = "emerge.machine.{machine-id}"
service.name = "comfyui" // emerge.machine.salad-xyz-123.comfyuiExternal Services (No namespace)
service.name = "redis-cluster" // No namespace
service.name = "postgres" // No namespaceRationale:
- Flat top-level APIs are easy to discover and filter
- Physical hierarchy shows which worker/ComfyUI belongs to which machine
- Clear ownership and debugging (trace from API → specific machine → worker → ComfyUI)
- Namespace filtering enables "show all services on machine X"
Implementation Plan
Phase 1: Foundation - Semantic Helpers Package
Create packages/otel-helpers/ with standard attribute builders:
// packages/otel-helpers/src/attributes.ts
import {
SEMATTRS_HTTP_METHOD,
SEMATTRS_HTTP_URL,
SEMATTRS_NET_PEER_NAME,
SEMATTRS_DB_SYSTEM,
SEMATTRS_DB_OPERATION,
SEMATTRS_MESSAGING_SYSTEM,
SEMATTRS_MESSAGING_DESTINATION,
SEMATTRS_MESSAGING_OPERATION,
} from '@opentelemetry/semantic-conventions';
/**
* HTTP client attributes for service map generation
*/
export function httpClientAttributes(
method: string,
url: string,
targetService: string,
statusCode?: number
) {
const attrs: Record<string, string | number> = {
[SEMATTRS_HTTP_METHOD]: method,
[SEMATTRS_HTTP_URL]: url,
[SEMATTRS_NET_PEER_NAME]: targetService, // KEY for service map edges
};
if (statusCode) attrs[SEMATTRS_HTTP_STATUS_CODE] = statusCode;
return attrs;
}
/**
* Redis operation attributes
*/
export function redisAttributes(
operation: string,
key?: string,
targetService: string = 'redis-cluster'
) {
return {
[SEMATTRS_DB_SYSTEM]: 'redis',
[SEMATTRS_DB_OPERATION]: operation,
[SEMATTRS_NET_PEER_NAME]: targetService, // Service map edge
...(key && { [SEMATTRS_DB_STATEMENT]: `${operation} ${key}` })
};
}
/**
* Messaging attributes for job queues
*/
export function messagingAttributes(
operation: 'publish' | 'receive' | 'process',
destination: string,
system: string = 'redis'
) {
return {
[SEMATTRS_MESSAGING_SYSTEM]: system,
[SEMATTRS_MESSAGING_DESTINATION]: destination,
[SEMATTRS_MESSAGING_OPERATION]: operation,
[SEMATTRS_NET_PEER_NAME]: `${system}-queue`, // Service map node
};
}
/**
* EMP domain attributes (supplement semantic conventions)
*/
export function empDomainAttributes(context: {
jobId?: string;
workflowId?: string;
machineId?: string;
workerId?: string;
userId?: string;
}) {
const attrs: Record<string, string> = {};
if (context.jobId) attrs['job.id'] = context.jobId;
if (context.workflowId) attrs['workflow.id'] = context.workflowId;
if (context.machineId) attrs['machine.id'] = context.machineId;
if (context.workerId) attrs['worker.id'] = context.workerId;
if (context.userId) attrs['user.id'] = context.userId;
return attrs;
}Phase 2: Enhanced EmpTelemetryClient
Add semantic-aware methods to @emp/telemetry:
// packages/telemetry/src/index.ts (additions)
export class EmpTelemetryClient {
/**
* Execute HTTP client call with proper service attribution
*/
async withHttpCall<T>(
method: string,
url: string,
targetService: string,
fn: (span: Span) => Promise<T>
): Promise<T> {
return this.withSpan(
`HTTP ${method}`,
async (span) => {
const result = await fn(span);
const statusCode = (result as any)?.status || 200;
span.setAttributes(httpClientAttributes(method, url, targetService, statusCode));
return result;
},
{
kind: SpanKind.CLIENT, // Critical for service map
attributes: httpClientAttributes(method, url, targetService)
}
);
}
/**
* Execute Redis operation with proper service attribution
*/
async withRedisOperation<T>(
operation: string,
key: string,
fn: (span: Span) => Promise<T>
): Promise<T> {
return this.withSpan(
`Redis ${operation}`,
fn,
{
kind: SpanKind.CLIENT,
attributes: redisAttributes(operation, key)
}
);
}
/**
* Publish message to queue with proper messaging semantics
*/
async withQueuePublish<T>(
queueName: string,
fn: (span: Span) => Promise<T>
): Promise<T> {
return this.withSpan(
`Publish to ${queueName}`,
fn,
{
kind: SpanKind.PRODUCER,
attributes: messagingAttributes('publish', queueName)
}
);
}
/**
* Consume message from queue
*/
async withQueueConsume<T>(
queueName: string,
fn: (span: Span) => Promise<T>
): Promise<T> {
return this.withSpan(
`Consume from ${queueName}`,
fn,
{
kind: SpanKind.CONSUMER,
attributes: messagingAttributes('receive', queueName)
}
);
}
}Phase 3: Migration Examples
Before (custom telemetry):
const workflow = workflowClient.startWorkflow('job.submission', {
jobId: job.id,
'emp.workflow.id': workflowId
});After (semantic conventions):
await telemetryClient.withHttpCall(
'POST',
'/api/jobs',
'job-api',
async (span) => {
span.setAttributes(empDomainAttributes({
jobId: job.id,
workflowId
}));
// Nested Redis operation
await telemetryClient.withRedisOperation('RPUSH', 'job-queue', async () => {
await redis.rpush('job-queue', JSON.stringify(job));
});
}
);Expected Service Map
After implementation with hierarchical naming:
┌──────────────────────┐
│ emerge.monitor │
└──────────┬───────────┘
│ WebSocket (CLIENT)
▼
┌──────────────────────┐
│ emerge.job-api │
└──────────┬───────────┘
│ Redis RPUSH (CLIENT)
▼
┌──────────────────────┐
│ redis-cluster │
└──────────┬───────────┘
│ Worker Claim (CONSUMER)
▼
┌──────────────────────────────────────────────┐
│ emerge.machine.salad-123.worker │
└──────────┬───────────────────────────────────┘
│ HTTP POST (CLIENT)
▼
┌──────────────────────────────────────────────┐
│ emerge.machine.salad-123.comfyui │
└──────────────────────────────────────────────┘Benefits:
- Service map shows exact machine topology
- Can filter to see all services on a specific machine
- Clear ownership: worker and ComfyUI belong to specific machine instance
Consequences
Positive
- Automatic Service Maps: No manual configuration needed
- Standard Dashboards: Works with any OpenTelemetry backend
- Better Debugging: Proper trace context propagation
- Industry Standards: Easier onboarding for new developers
- Reduced Code: Delete ~200 lines of custom telemetry
Negative
- Migration Effort: Requires refactoring across services
- Breaking Changes: Custom attributes need mapping
- Learning Curve: Team learns semantic conventions
Risks & Mitigation
- Incomplete Migration: Use feature flags, gradual rollout
- Lost Context: Preserve domain attributes alongside semantic ones
- Performance Impact: Monitor span volume, implement sampling
Implementation Timeline
- Week 1: Create
packages/otel-helperswith attribute builders - Week 2: Enhance
EmpTelemetryClientwith semantic methods - Week 3: Migrate API server job submission
- Week 4: Migrate worker job processing
- Week 5: Remove custom telemetry code
- Week 6: Documentation and training
Success Criteria
- [ ] Service map automatically shows: Monitor → API → Redis → Worker → ComfyUI
- [ ] All HTTP calls use semantic
net.peer.nameattributes - [ ] All Redis operations use semantic
db.systemattributes - [ ] All queue operations use semantic
messaging.*attributes - [ ] Custom telemetry code reduced by >80%
