ADR-001: OpenTelemetry Collector Sidecar for Reliable Log Delivery

Status: Proposed Date: 2025-11-12 Deciders: Architecture Team Related: how-to-add-telemetry-to-apps.md

Context

Problem Statement

ComfyUI instances running on ephemeral distributed machines (SALAD, vast.ai) occasionally encounter transient network errors when sending logs directly to Dash0:

Transient error StatusCode.UNAVAILABLE encountered while exporting logs to ingress.us-west-2.aws.dash0.com:4317

Current Architecture:

ComfyUI → OTLP Direct → Dash0 Ingress (ingress.us-west-2.aws.dash0.com:4317)

Critical Issues:

Log Loss: When Dash0 ingress is unreachable, logs are dropped permanently
No Retry Buffer: OTLP exporter has limited built-in retry with no persistent queue
Network Instability: Ephemeral machines have unpredictable network conditions
Blocking Operations: Direct export can block ComfyUI if network is slow
No Observability: No visibility into export failures or dropped logs

Impact on Production

Job Forensics: Users cannot debug failed jobs if logs are lost
Error Patterns: Missing logs prevent accurate error pattern detection
SLA Risk: Log loss violates observability requirements
User Trust: Incomplete logs undermine confidence in the platform

Why This Happens on Ephemeral Infrastructure

Spot Instances: SALAD/vast.ai machines can have degraded network during preemption
Geographic Distribution: Machines worldwide with varying latency to us-west-2
Shared Network: Contention with other workloads on the same host
DNS Issues: Temporary DNS resolution failures
Firewall/NAT: Dynamic network topology changes

Decision

Deploy an OpenTelemetry Collector as a sidecar container alongside each ComfyUI instance to provide:

Local buffering - Logs written to localhost (fast, reliable)
Automatic retries - Collector handles retry logic with exponential backoff
Persistent queue - Disk-backed queue survives temporary outages
Decoupling - ComfyUI never blocks on slow network operations
Observability - Collector exposes metrics on export success/failure

Target Architecture:

ComfyUI → OTLP Local → Collector (localhost:4317) → Dash0 Ingress
         (fast)         (buffered, retries)      (resilient)

Alternatives Considered

Alternative 1: Increase OTLP Retry Configuration

Rejected: Limited to in-memory buffering, no persistent queue, still blocks on network issues.

Alternative 2: Dual Logging (Redis + OTLP)

Rejected: Adds complexity, requires maintaining custom Redis→Dash0 bridge, wastes resources.

Alternative 3: Centralized Collector

Rejected: Adds network hop, doesn't solve availability issue, single point of failure.

Alternative 4: Accept Log Loss

Rejected: Violates observability requirements, breaks job forensics, poor user experience.

Implementation Plan

Phase 1: Collector Setup (Week 1)

1.1 Add Collector to Docker Compose

File: apps/machine/docker-compose.yml

yaml

services:
  # Existing ComfyUI services...

  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.91.0
    container_name: otel-collector
    restart: unless-stopped
    volumes:
      - ./otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml:ro
      - otel-collector-data:/var/lib/otelcol/file_storage
    ports:
      - "4317:4317"  # OTLP gRPC receiver
      - "4318:4318"  # OTLP HTTP receiver
      - "8888:8888"  # Prometheus metrics
      - "13133:13133"  # Health check
    environment:
      - DASH0_AUTH_TOKEN=${DASH0_AUTH_TOKEN}
      - DASH0_ENDPOINT=${DASH0_ENDPOINT:-ingress.us-west-2.aws.dash0.com:4317}
    command: ["--config=/etc/otelcol-contrib/config.yaml"]
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:13133/"]
      interval: 30s
      timeout: 10s
      retries: 3

volumes:
  otel-collector-data:
    driver: local

1.2 Create Collector Configuration

File: apps/machine/otel-collector-config.yaml

yaml

# OpenTelemetry Collector Configuration
# Purpose: Buffer and reliably deliver logs from ComfyUI to Dash0

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  # Limit memory usage to prevent OOM on machines
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

  # Batch logs for efficiency
  batch:
    send_batch_size: 100
    timeout: 10s
    send_batch_max_size: 200

  # Add resource attributes for debugging
  resource:
    attributes:
      - key: collector.version
        value: 0.91.0
        action: insert
      - key: collector.hostname
        from_attribute: host.name
        action: insert

exporters:
  # Primary exporter: Dash0
  otlp/dash0:
    endpoint: ${env:DASH0_ENDPOINT}
    headers:
      Authorization: "Bearer ${env:DASH0_AUTH_TOKEN}"
    tls:
      insecure: false
    timeout: 30s

    # Retry configuration for transient failures
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 60s
      max_elapsed_time: 600s  # Retry for up to 10 minutes

    # Persistent queue for reliability
    sending_queue:
      enabled: true
      num_consumers: 5
      queue_size: 1000
      storage: file_storage/otlp_queue

  # Logging exporter for debugging (disabled in production)
  logging:
    verbosity: detailed
    sampling_initial: 5
    sampling_thereafter: 200

  # Prometheus metrics exporter
  prometheus:
    endpoint: 0.0.0.0:8888
    namespace: otelcol

extensions:
  # Health check endpoint
  health_check:
    endpoint: 0.0.0.0:13133

  # Persistent file storage for queue
  file_storage:
    directory: /var/lib/otelcol/file_storage
    timeout: 10s
    compaction:
      directory: /var/lib/otelcol/file_storage
      on_start: true
      on_rebound: true
      rebound_needed_threshold_mib: 100
      rebound_trigger_threshold_mib: 10

service:
  extensions: [health_check, file_storage]

  pipelines:
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [otlp/dash0]  # Add 'logging' for debugging

  telemetry:
    logs:
      level: info
      initial_fields:
        service: otel-collector
    metrics:
      level: detailed
      address: 0.0.0.0:8888

Phase 2: Update ComfyUI Configuration (Week 1)

2.1 Update Environment Configuration

File: config/environments/components/telemetry.env

bash

[default]
# BEFORE: Direct to Dash0
# DASH0_ENDPOINT=ingress.us-west-2.aws.dash0.com:4317

# AFTER: Local collector (sidecar)
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
DASH0_ENDPOINT=ingress.us-west-2.aws.dash0.com:4317  # Used by collector

2.2 Update OTEL Integration

File: packages/comfyui/app/otel_integration.py

Update endpoint configuration:

python

def init_otel_logging() -> bool:
    # Get endpoint from environment (now points to local collector)
    otel_endpoint = os.getenv('OTEL_EXPORTER_OTLP_ENDPOINT', 'http://otel-collector:4317')

    # Remove 'http://' prefix for gRPC endpoint
    if otel_endpoint.startswith('http://'):
        otel_endpoint = otel_endpoint.replace('http://', '')

    logging.info(f"[OTEL] Connecting to collector at: {otel_endpoint}")

    # Configure exporter to use local collector
    exporter = OTLPLogExporter(
        endpoint=otel_endpoint,
        insecure=True,  # Local collector doesn't need TLS
        timeout=10,
    )

Phase 3: Testing & Validation (Week 1)

3.1 Local Testing

bash

# Start services
cd apps/machine
docker-compose up -d otel-collector comfyui-gpu0

# Verify collector is running
docker logs otel-collector

# Check health
curl http://localhost:13133/

# Check metrics
curl http://localhost:8888/metrics | grep otelcol

# Trigger some logs from ComfyUI
# ... run a job ...

# Verify logs flowing through collector
docker logs otel-collector | grep "Logs received"
docker logs otel-collector | grep "Logs exported"

3.2 Failure Simulation

Test 1: Network outage

bash

# Block Dash0 endpoint
sudo iptables -A OUTPUT -d ingress.us-west-2.aws.dash0.com -j DROP

# Run ComfyUI job - logs should queue in collector
# ... run job ...

# Check queue metrics
curl http://localhost:8888/metrics | grep "queue_size"

# Restore network
sudo iptables -D OUTPUT -d ingress.us-west-2.aws.dash0.com -j DROP

# Verify logs delivered
docker logs otel-collector | grep "Logs exported"

Test 2: Collector restart

bash

# Stop collector
docker stop otel-collector

# Run ComfyUI job - should see connection errors
# ... run job ...

# Start collector
docker start otel-collector

# Verify logs delivered (if in retry window)

3.3 Success Criteria

[ ] Collector starts successfully with configuration
[ ] ComfyUI connects to collector (localhost:4317)
[ ] Logs flow through collector to Dash0
[ ] Queue persists during Dash0 outage
[ ] Logs delivered after outage resolved
[ ] Prometheus metrics exposed on :8888
[ ] Health check responds on :13133
[ ] No performance degradation in ComfyUI
[ ] emerge.job_id and emerge.workflow_id preserved

Phase 4: Production Deployment (Week 2)

4.1 Deployment Checklist

Before Deployment:

[ ] Test on local-docker environment
[ ] Test on staging machine
[ ] Verify disk space for queue (1GB allocated)
[ ] Update machine deployment scripts
[ ] Document troubleshooting procedures
[ ] Set up alerting on collector failures

Deployment Steps:

Deploy to 1 test machine (vast.ai or SALAD)
Monitor for 24 hours
Validate log delivery and queue behavior
Deploy to 10% of fleet
Monitor for 48 hours
Full rollout if successful

Rollback Plan:

bash

# Revert to direct Dash0 connection
docker-compose down otel-collector
# Update OTEL_EXPORTER_OTLP_ENDPOINT back to Dash0
docker-compose up -d

4.2 Monitoring & Alerting

Key Metrics to Monitor:

# Collector health
otelcol_process_uptime
otelcol_process_memory_rss

# Log throughput
otelcol_receiver_accepted_log_records
otelcol_exporter_sent_log_records
otelcol_exporter_send_failed_log_records

# Queue behavior
otelcol_exporter_queue_size
otelcol_exporter_queue_capacity

# Retry behavior
otelcol_exporter_retry_count

Alerts to Configure:

Collector down for >5 minutes
Queue >80% capacity for >15 minutes
Export failure rate >10% for >5 minutes
Memory usage >400MB

Phase 5: Documentation & Cleanup (Week 2)

5.1 Update Documentation

Files to Update:

docs/how-to-add-telemetry-to-apps.md - Add collector setup
apps/machine/README.md - Document collector service
CLAUDE.md - Update architecture section

New Documentation:

docs/troubleshooting-otel-collector.md - Common issues and solutions

5.2 Remove Legacy Code

Remove Redis log streaming (if no longer needed)
Clean up direct Dash0 configuration
Update environment templates

Consequences

Positive

Zero Log Loss: Persistent queue ensures no logs lost during transient failures
Improved Performance: ComfyUI never blocks on slow network operations
Better Observability: Collector metrics provide visibility into export health
Industry Standard: OTLP collector is battle-tested and well-documented
Flexibility: Easy to add additional exporters (e.g., S3 backup, Datadog)
Reduced Complexity: Removes need for custom Redis→Dash0 bridge

Negative

Resource Overhead: ~50-100MB RAM, minimal CPU per machine
Disk Usage: ~1GB for persistent queue per machine
Additional Service: One more service to monitor and maintain
Startup Dependency: ComfyUI should gracefully handle collector not ready
Configuration Complexity: YAML configuration adds complexity

Neutral

No Code Changes: ComfyUI code unchanged, only environment config
No Breaking Changes: Log format and attributes remain identical
Incremental Rollout: Can deploy gradually to minimize risk

Risks & Mitigations

Risk	Impact	Likelihood	Mitigation
Collector OOM on high-volume machines	HIGH	LOW	Memory limiter processor, disk-backed queue
Queue fills up during extended outage	MEDIUM	MEDIUM	Monitor queue size, alert at 80%, increase capacity if needed
Collector crash loses in-flight logs	LOW	LOW	Persistent queue, health checks, auto-restart
Performance impact on ComfyUI	MEDIUM	LOW	Localhost communication is fast, load testing
Configuration errors break logging	HIGH	MEDIUM	Validation in CI, gradual rollout, rollback plan

Success Metrics

Week 1 (Post-Deployment):

0 log loss incidents reported
<5ms p99 latency for OTLP export to collector
<2% export failure rate to Dash0

Month 1:

99.9% log delivery success rate
<1% of logs delayed >5 minutes
0 critical incidents due to log loss

Month 3:

Complete deprecation of Redis log streaming
Collector deployment on 100% of machines
Automated monitoring and alerting operational

References

Future: ADR-002: Centralized log aggregation for long-term storage
Future: ADR-003: Telemetry for API and Worker services

ADR-001: OpenTelemetry Collector Sidecar for Reliable Log Delivery ​

Context ​

Problem Statement ​

Impact on Production ​

Why This Happens on Ephemeral Infrastructure ​

Decision ​

Alternatives Considered ​

Alternative 1: Increase OTLP Retry Configuration ​

Alternative 2: Dual Logging (Redis + OTLP) ​

Alternative 3: Centralized Collector ​

Alternative 4: Accept Log Loss ​

Implementation Plan ​

Phase 1: Collector Setup (Week 1) ​

1.1 Add Collector to Docker Compose ​

1.2 Create Collector Configuration ​

Phase 2: Update ComfyUI Configuration (Week 1) ​

2.1 Update Environment Configuration ​

2.2 Update OTEL Integration ​

Phase 3: Testing & Validation (Week 1) ​

3.1 Local Testing ​

3.2 Failure Simulation ​

3.3 Success Criteria ​

Phase 4: Production Deployment (Week 2) ​

4.1 Deployment Checklist ​

4.2 Monitoring & Alerting ​

Phase 5: Documentation & Cleanup (Week 2) ​

5.1 Update Documentation ​

5.2 Remove Legacy Code ​

Consequences ​

Positive ​

Negative ​

Neutral ​

Risks & Mitigations ​

Success Metrics ​

References ​

Related Decisions ​

ADR-001: OpenTelemetry Collector Sidecar for Reliable Log Delivery

Context

Problem Statement

Impact on Production

Why This Happens on Ephemeral Infrastructure

Decision

Alternatives Considered

Alternative 1: Increase OTLP Retry Configuration

Alternative 2: Dual Logging (Redis + OTLP)

Alternative 3: Centralized Collector

Alternative 4: Accept Log Loss

Implementation Plan

Phase 1: Collector Setup (Week 1)

1.1 Add Collector to Docker Compose

1.2 Create Collector Configuration

Phase 2: Update ComfyUI Configuration (Week 1)

2.1 Update Environment Configuration

2.2 Update OTEL Integration

Phase 3: Testing & Validation (Week 1)

3.1 Local Testing

3.2 Failure Simulation

3.3 Success Criteria

Phase 4: Production Deployment (Week 2)

4.1 Deployment Checklist

4.2 Monitoring & Alerting

Phase 5: Documentation & Cleanup (Week 2)

5.1 Update Documentation

5.2 Remove Legacy Code

Consequences

Positive

Negative

Neutral

Risks & Mitigations

Success Metrics

References

Related Decisions