Skip to content

ADR-001: OpenTelemetry Collector Sidecar for Reliable Log Delivery

Status: Proposed Date: 2025-11-12 Deciders: Architecture Team Related: how-to-add-telemetry-to-apps.md

Context

Problem Statement

ComfyUI instances running on ephemeral distributed machines (SALAD, vast.ai) occasionally encounter transient network errors when sending logs directly to Dash0:

Transient error StatusCode.UNAVAILABLE encountered while exporting logs to ingress.us-west-2.aws.dash0.com:4317

Current Architecture:

ComfyUI → OTLP Direct → Dash0 Ingress (ingress.us-west-2.aws.dash0.com:4317)

Critical Issues:

  1. Log Loss: When Dash0 ingress is unreachable, logs are dropped permanently
  2. No Retry Buffer: OTLP exporter has limited built-in retry with no persistent queue
  3. Network Instability: Ephemeral machines have unpredictable network conditions
  4. Blocking Operations: Direct export can block ComfyUI if network is slow
  5. No Observability: No visibility into export failures or dropped logs

Impact on Production

  • Job Forensics: Users cannot debug failed jobs if logs are lost
  • Error Patterns: Missing logs prevent accurate error pattern detection
  • SLA Risk: Log loss violates observability requirements
  • User Trust: Incomplete logs undermine confidence in the platform

Why This Happens on Ephemeral Infrastructure

  1. Spot Instances: SALAD/vast.ai machines can have degraded network during preemption
  2. Geographic Distribution: Machines worldwide with varying latency to us-west-2
  3. Shared Network: Contention with other workloads on the same host
  4. DNS Issues: Temporary DNS resolution failures
  5. Firewall/NAT: Dynamic network topology changes

Decision

Deploy an OpenTelemetry Collector as a sidecar container alongside each ComfyUI instance to provide:

  1. Local buffering - Logs written to localhost (fast, reliable)
  2. Automatic retries - Collector handles retry logic with exponential backoff
  3. Persistent queue - Disk-backed queue survives temporary outages
  4. Decoupling - ComfyUI never blocks on slow network operations
  5. Observability - Collector exposes metrics on export success/failure

Target Architecture:

ComfyUI → OTLP Local → Collector (localhost:4317) → Dash0 Ingress
         (fast)         (buffered, retries)      (resilient)

Alternatives Considered

Alternative 1: Increase OTLP Retry Configuration

Rejected: Limited to in-memory buffering, no persistent queue, still blocks on network issues.

Alternative 2: Dual Logging (Redis + OTLP)

Rejected: Adds complexity, requires maintaining custom Redis→Dash0 bridge, wastes resources.

Alternative 3: Centralized Collector

Rejected: Adds network hop, doesn't solve availability issue, single point of failure.

Alternative 4: Accept Log Loss

Rejected: Violates observability requirements, breaks job forensics, poor user experience.

Implementation Plan

Phase 1: Collector Setup (Week 1)

1.1 Add Collector to Docker Compose

File: apps/machine/docker-compose.yml

yaml
services:
  # Existing ComfyUI services...

  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.91.0
    container_name: otel-collector
    restart: unless-stopped
    volumes:
      - ./otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml:ro
      - otel-collector-data:/var/lib/otelcol/file_storage
    ports:
      - "4317:4317"  # OTLP gRPC receiver
      - "4318:4318"  # OTLP HTTP receiver
      - "8888:8888"  # Prometheus metrics
      - "13133:13133"  # Health check
    environment:
      - DASH0_AUTH_TOKEN=${DASH0_AUTH_TOKEN}
      - DASH0_ENDPOINT=${DASH0_ENDPOINT:-ingress.us-west-2.aws.dash0.com:4317}
    command: ["--config=/etc/otelcol-contrib/config.yaml"]
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:13133/"]
      interval: 30s
      timeout: 10s
      retries: 3

volumes:
  otel-collector-data:
    driver: local

1.2 Create Collector Configuration

File: apps/machine/otel-collector-config.yaml

yaml
# OpenTelemetry Collector Configuration
# Purpose: Buffer and reliably deliver logs from ComfyUI to Dash0

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  # Limit memory usage to prevent OOM on machines
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

  # Batch logs for efficiency
  batch:
    send_batch_size: 100
    timeout: 10s
    send_batch_max_size: 200

  # Add resource attributes for debugging
  resource:
    attributes:
      - key: collector.version
        value: 0.91.0
        action: insert
      - key: collector.hostname
        from_attribute: host.name
        action: insert

exporters:
  # Primary exporter: Dash0
  otlp/dash0:
    endpoint: ${env:DASH0_ENDPOINT}
    headers:
      Authorization: "Bearer ${env:DASH0_AUTH_TOKEN}"
    tls:
      insecure: false
    timeout: 30s

    # Retry configuration for transient failures
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 60s
      max_elapsed_time: 600s  # Retry for up to 10 minutes

    # Persistent queue for reliability
    sending_queue:
      enabled: true
      num_consumers: 5
      queue_size: 1000
      storage: file_storage/otlp_queue

  # Logging exporter for debugging (disabled in production)
  logging:
    verbosity: detailed
    sampling_initial: 5
    sampling_thereafter: 200

  # Prometheus metrics exporter
  prometheus:
    endpoint: 0.0.0.0:8888
    namespace: otelcol

extensions:
  # Health check endpoint
  health_check:
    endpoint: 0.0.0.0:13133

  # Persistent file storage for queue
  file_storage:
    directory: /var/lib/otelcol/file_storage
    timeout: 10s
    compaction:
      directory: /var/lib/otelcol/file_storage
      on_start: true
      on_rebound: true
      rebound_needed_threshold_mib: 100
      rebound_trigger_threshold_mib: 10

service:
  extensions: [health_check, file_storage]

  pipelines:
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [otlp/dash0]  # Add 'logging' for debugging

  telemetry:
    logs:
      level: info
      initial_fields:
        service: otel-collector
    metrics:
      level: detailed
      address: 0.0.0.0:8888

Phase 2: Update ComfyUI Configuration (Week 1)

2.1 Update Environment Configuration

File: config/environments/components/telemetry.env

bash
[default]
# BEFORE: Direct to Dash0
# DASH0_ENDPOINT=ingress.us-west-2.aws.dash0.com:4317

# AFTER: Local collector (sidecar)
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
DASH0_ENDPOINT=ingress.us-west-2.aws.dash0.com:4317  # Used by collector

2.2 Update OTEL Integration

File: packages/comfyui/app/otel_integration.py

Update endpoint configuration:

python
def init_otel_logging() -> bool:
    # Get endpoint from environment (now points to local collector)
    otel_endpoint = os.getenv('OTEL_EXPORTER_OTLP_ENDPOINT', 'http://otel-collector:4317')

    # Remove 'http://' prefix for gRPC endpoint
    if otel_endpoint.startswith('http://'):
        otel_endpoint = otel_endpoint.replace('http://', '')

    logging.info(f"[OTEL] Connecting to collector at: {otel_endpoint}")

    # Configure exporter to use local collector
    exporter = OTLPLogExporter(
        endpoint=otel_endpoint,
        insecure=True,  # Local collector doesn't need TLS
        timeout=10,
    )

Phase 3: Testing & Validation (Week 1)

3.1 Local Testing

bash
# Start services
cd apps/machine
docker-compose up -d otel-collector comfyui-gpu0

# Verify collector is running
docker logs otel-collector

# Check health
curl http://localhost:13133/

# Check metrics
curl http://localhost:8888/metrics | grep otelcol

# Trigger some logs from ComfyUI
# ... run a job ...

# Verify logs flowing through collector
docker logs otel-collector | grep "Logs received"
docker logs otel-collector | grep "Logs exported"

3.2 Failure Simulation

Test 1: Network outage

bash
# Block Dash0 endpoint
sudo iptables -A OUTPUT -d ingress.us-west-2.aws.dash0.com -j DROP

# Run ComfyUI job - logs should queue in collector
# ... run job ...

# Check queue metrics
curl http://localhost:8888/metrics | grep "queue_size"

# Restore network
sudo iptables -D OUTPUT -d ingress.us-west-2.aws.dash0.com -j DROP

# Verify logs delivered
docker logs otel-collector | grep "Logs exported"

Test 2: Collector restart

bash
# Stop collector
docker stop otel-collector

# Run ComfyUI job - should see connection errors
# ... run job ...

# Start collector
docker start otel-collector

# Verify logs delivered (if in retry window)

3.3 Success Criteria

  • [ ] Collector starts successfully with configuration
  • [ ] ComfyUI connects to collector (localhost:4317)
  • [ ] Logs flow through collector to Dash0
  • [ ] Queue persists during Dash0 outage
  • [ ] Logs delivered after outage resolved
  • [ ] Prometheus metrics exposed on :8888
  • [ ] Health check responds on :13133
  • [ ] No performance degradation in ComfyUI
  • [ ] emerge.job_id and emerge.workflow_id preserved

Phase 4: Production Deployment (Week 2)

4.1 Deployment Checklist

Before Deployment:

  • [ ] Test on local-docker environment
  • [ ] Test on staging machine
  • [ ] Verify disk space for queue (1GB allocated)
  • [ ] Update machine deployment scripts
  • [ ] Document troubleshooting procedures
  • [ ] Set up alerting on collector failures

Deployment Steps:

  1. Deploy to 1 test machine (vast.ai or SALAD)
  2. Monitor for 24 hours
  3. Validate log delivery and queue behavior
  4. Deploy to 10% of fleet
  5. Monitor for 48 hours
  6. Full rollout if successful

Rollback Plan:

bash
# Revert to direct Dash0 connection
docker-compose down otel-collector
# Update OTEL_EXPORTER_OTLP_ENDPOINT back to Dash0
docker-compose up -d

4.2 Monitoring & Alerting

Key Metrics to Monitor:

# Collector health
otelcol_process_uptime
otelcol_process_memory_rss

# Log throughput
otelcol_receiver_accepted_log_records
otelcol_exporter_sent_log_records
otelcol_exporter_send_failed_log_records

# Queue behavior
otelcol_exporter_queue_size
otelcol_exporter_queue_capacity

# Retry behavior
otelcol_exporter_retry_count

Alerts to Configure:

  • Collector down for >5 minutes
  • Queue >80% capacity for >15 minutes
  • Export failure rate >10% for >5 minutes
  • Memory usage >400MB

Phase 5: Documentation & Cleanup (Week 2)

5.1 Update Documentation

Files to Update:

  • docs/how-to-add-telemetry-to-apps.md - Add collector setup
  • apps/machine/README.md - Document collector service
  • CLAUDE.md - Update architecture section

New Documentation:

  • docs/troubleshooting-otel-collector.md - Common issues and solutions

5.2 Remove Legacy Code

  • Remove Redis log streaming (if no longer needed)
  • Clean up direct Dash0 configuration
  • Update environment templates

Consequences

Positive

  1. Zero Log Loss: Persistent queue ensures no logs lost during transient failures
  2. Improved Performance: ComfyUI never blocks on slow network operations
  3. Better Observability: Collector metrics provide visibility into export health
  4. Industry Standard: OTLP collector is battle-tested and well-documented
  5. Flexibility: Easy to add additional exporters (e.g., S3 backup, Datadog)
  6. Reduced Complexity: Removes need for custom Redis→Dash0 bridge

Negative

  1. Resource Overhead: ~50-100MB RAM, minimal CPU per machine
  2. Disk Usage: ~1GB for persistent queue per machine
  3. Additional Service: One more service to monitor and maintain
  4. Startup Dependency: ComfyUI should gracefully handle collector not ready
  5. Configuration Complexity: YAML configuration adds complexity

Neutral

  1. No Code Changes: ComfyUI code unchanged, only environment config
  2. No Breaking Changes: Log format and attributes remain identical
  3. Incremental Rollout: Can deploy gradually to minimize risk

Risks & Mitigations

RiskImpactLikelihoodMitigation
Collector OOM on high-volume machinesHIGHLOWMemory limiter processor, disk-backed queue
Queue fills up during extended outageMEDIUMMEDIUMMonitor queue size, alert at 80%, increase capacity if needed
Collector crash loses in-flight logsLOWLOWPersistent queue, health checks, auto-restart
Performance impact on ComfyUIMEDIUMLOWLocalhost communication is fast, load testing
Configuration errors break loggingHIGHMEDIUMValidation in CI, gradual rollout, rollback plan

Success Metrics

Week 1 (Post-Deployment):

  • 0 log loss incidents reported
  • <5ms p99 latency for OTLP export to collector
  • <2% export failure rate to Dash0

Month 1:

  • 99.9% log delivery success rate
  • <1% of logs delayed >5 minutes
  • 0 critical incidents due to log loss

Month 3:

  • Complete deprecation of Redis log streaming
  • Collector deployment on 100% of machines
  • Automated monitoring and alerting operational

References

  • Future: ADR-002: Centralized log aggregation for long-term storage
  • Future: ADR-003: Telemetry for API and Worker services

Released under the MIT License.