ADR-001: OpenTelemetry Collector Sidecar for Reliable Log Delivery
Status: Proposed Date: 2025-11-12 Deciders: Architecture Team Related: how-to-add-telemetry-to-apps.md
Context
Problem Statement
ComfyUI instances running on ephemeral distributed machines (SALAD, vast.ai) occasionally encounter transient network errors when sending logs directly to Dash0:
Transient error StatusCode.UNAVAILABLE encountered while exporting logs to ingress.us-west-2.aws.dash0.com:4317Current Architecture:
ComfyUI → OTLP Direct → Dash0 Ingress (ingress.us-west-2.aws.dash0.com:4317)Critical Issues:
- Log Loss: When Dash0 ingress is unreachable, logs are dropped permanently
- No Retry Buffer: OTLP exporter has limited built-in retry with no persistent queue
- Network Instability: Ephemeral machines have unpredictable network conditions
- Blocking Operations: Direct export can block ComfyUI if network is slow
- No Observability: No visibility into export failures or dropped logs
Impact on Production
- Job Forensics: Users cannot debug failed jobs if logs are lost
- Error Patterns: Missing logs prevent accurate error pattern detection
- SLA Risk: Log loss violates observability requirements
- User Trust: Incomplete logs undermine confidence in the platform
Why This Happens on Ephemeral Infrastructure
- Spot Instances: SALAD/vast.ai machines can have degraded network during preemption
- Geographic Distribution: Machines worldwide with varying latency to us-west-2
- Shared Network: Contention with other workloads on the same host
- DNS Issues: Temporary DNS resolution failures
- Firewall/NAT: Dynamic network topology changes
Decision
Deploy an OpenTelemetry Collector as a sidecar container alongside each ComfyUI instance to provide:
- Local buffering - Logs written to localhost (fast, reliable)
- Automatic retries - Collector handles retry logic with exponential backoff
- Persistent queue - Disk-backed queue survives temporary outages
- Decoupling - ComfyUI never blocks on slow network operations
- Observability - Collector exposes metrics on export success/failure
Target Architecture:
ComfyUI → OTLP Local → Collector (localhost:4317) → Dash0 Ingress
(fast) (buffered, retries) (resilient)Alternatives Considered
Alternative 1: Increase OTLP Retry Configuration
Rejected: Limited to in-memory buffering, no persistent queue, still blocks on network issues.
Alternative 2: Dual Logging (Redis + OTLP)
Rejected: Adds complexity, requires maintaining custom Redis→Dash0 bridge, wastes resources.
Alternative 3: Centralized Collector
Rejected: Adds network hop, doesn't solve availability issue, single point of failure.
Alternative 4: Accept Log Loss
Rejected: Violates observability requirements, breaks job forensics, poor user experience.
Implementation Plan
Phase 1: Collector Setup (Week 1)
1.1 Add Collector to Docker Compose
File: apps/machine/docker-compose.yml
services:
# Existing ComfyUI services...
otel-collector:
image: otel/opentelemetry-collector-contrib:0.91.0
container_name: otel-collector
restart: unless-stopped
volumes:
- ./otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml:ro
- otel-collector-data:/var/lib/otelcol/file_storage
ports:
- "4317:4317" # OTLP gRPC receiver
- "4318:4318" # OTLP HTTP receiver
- "8888:8888" # Prometheus metrics
- "13133:13133" # Health check
environment:
- DASH0_AUTH_TOKEN=${DASH0_AUTH_TOKEN}
- DASH0_ENDPOINT=${DASH0_ENDPOINT:-ingress.us-west-2.aws.dash0.com:4317}
command: ["--config=/etc/otelcol-contrib/config.yaml"]
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:13133/"]
interval: 30s
timeout: 10s
retries: 3
volumes:
otel-collector-data:
driver: local1.2 Create Collector Configuration
File: apps/machine/otel-collector-config.yaml
# OpenTelemetry Collector Configuration
# Purpose: Buffer and reliably deliver logs from ComfyUI to Dash0
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
# Limit memory usage to prevent OOM on machines
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
# Batch logs for efficiency
batch:
send_batch_size: 100
timeout: 10s
send_batch_max_size: 200
# Add resource attributes for debugging
resource:
attributes:
- key: collector.version
value: 0.91.0
action: insert
- key: collector.hostname
from_attribute: host.name
action: insert
exporters:
# Primary exporter: Dash0
otlp/dash0:
endpoint: ${env:DASH0_ENDPOINT}
headers:
Authorization: "Bearer ${env:DASH0_AUTH_TOKEN}"
tls:
insecure: false
timeout: 30s
# Retry configuration for transient failures
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 60s
max_elapsed_time: 600s # Retry for up to 10 minutes
# Persistent queue for reliability
sending_queue:
enabled: true
num_consumers: 5
queue_size: 1000
storage: file_storage/otlp_queue
# Logging exporter for debugging (disabled in production)
logging:
verbosity: detailed
sampling_initial: 5
sampling_thereafter: 200
# Prometheus metrics exporter
prometheus:
endpoint: 0.0.0.0:8888
namespace: otelcol
extensions:
# Health check endpoint
health_check:
endpoint: 0.0.0.0:13133
# Persistent file storage for queue
file_storage:
directory: /var/lib/otelcol/file_storage
timeout: 10s
compaction:
directory: /var/lib/otelcol/file_storage
on_start: true
on_rebound: true
rebound_needed_threshold_mib: 100
rebound_trigger_threshold_mib: 10
service:
extensions: [health_check, file_storage]
pipelines:
logs:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlp/dash0] # Add 'logging' for debugging
telemetry:
logs:
level: info
initial_fields:
service: otel-collector
metrics:
level: detailed
address: 0.0.0.0:8888Phase 2: Update ComfyUI Configuration (Week 1)
2.1 Update Environment Configuration
File: config/environments/components/telemetry.env
[default]
# BEFORE: Direct to Dash0
# DASH0_ENDPOINT=ingress.us-west-2.aws.dash0.com:4317
# AFTER: Local collector (sidecar)
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
DASH0_ENDPOINT=ingress.us-west-2.aws.dash0.com:4317 # Used by collector2.2 Update OTEL Integration
File: packages/comfyui/app/otel_integration.py
Update endpoint configuration:
def init_otel_logging() -> bool:
# Get endpoint from environment (now points to local collector)
otel_endpoint = os.getenv('OTEL_EXPORTER_OTLP_ENDPOINT', 'http://otel-collector:4317')
# Remove 'http://' prefix for gRPC endpoint
if otel_endpoint.startswith('http://'):
otel_endpoint = otel_endpoint.replace('http://', '')
logging.info(f"[OTEL] Connecting to collector at: {otel_endpoint}")
# Configure exporter to use local collector
exporter = OTLPLogExporter(
endpoint=otel_endpoint,
insecure=True, # Local collector doesn't need TLS
timeout=10,
)Phase 3: Testing & Validation (Week 1)
3.1 Local Testing
# Start services
cd apps/machine
docker-compose up -d otel-collector comfyui-gpu0
# Verify collector is running
docker logs otel-collector
# Check health
curl http://localhost:13133/
# Check metrics
curl http://localhost:8888/metrics | grep otelcol
# Trigger some logs from ComfyUI
# ... run a job ...
# Verify logs flowing through collector
docker logs otel-collector | grep "Logs received"
docker logs otel-collector | grep "Logs exported"3.2 Failure Simulation
Test 1: Network outage
# Block Dash0 endpoint
sudo iptables -A OUTPUT -d ingress.us-west-2.aws.dash0.com -j DROP
# Run ComfyUI job - logs should queue in collector
# ... run job ...
# Check queue metrics
curl http://localhost:8888/metrics | grep "queue_size"
# Restore network
sudo iptables -D OUTPUT -d ingress.us-west-2.aws.dash0.com -j DROP
# Verify logs delivered
docker logs otel-collector | grep "Logs exported"Test 2: Collector restart
# Stop collector
docker stop otel-collector
# Run ComfyUI job - should see connection errors
# ... run job ...
# Start collector
docker start otel-collector
# Verify logs delivered (if in retry window)3.3 Success Criteria
- [ ] Collector starts successfully with configuration
- [ ] ComfyUI connects to collector (localhost:4317)
- [ ] Logs flow through collector to Dash0
- [ ] Queue persists during Dash0 outage
- [ ] Logs delivered after outage resolved
- [ ] Prometheus metrics exposed on :8888
- [ ] Health check responds on :13133
- [ ] No performance degradation in ComfyUI
- [ ] emerge.job_id and emerge.workflow_id preserved
Phase 4: Production Deployment (Week 2)
4.1 Deployment Checklist
Before Deployment:
- [ ] Test on local-docker environment
- [ ] Test on staging machine
- [ ] Verify disk space for queue (1GB allocated)
- [ ] Update machine deployment scripts
- [ ] Document troubleshooting procedures
- [ ] Set up alerting on collector failures
Deployment Steps:
- Deploy to 1 test machine (vast.ai or SALAD)
- Monitor for 24 hours
- Validate log delivery and queue behavior
- Deploy to 10% of fleet
- Monitor for 48 hours
- Full rollout if successful
Rollback Plan:
# Revert to direct Dash0 connection
docker-compose down otel-collector
# Update OTEL_EXPORTER_OTLP_ENDPOINT back to Dash0
docker-compose up -d4.2 Monitoring & Alerting
Key Metrics to Monitor:
# Collector health
otelcol_process_uptime
otelcol_process_memory_rss
# Log throughput
otelcol_receiver_accepted_log_records
otelcol_exporter_sent_log_records
otelcol_exporter_send_failed_log_records
# Queue behavior
otelcol_exporter_queue_size
otelcol_exporter_queue_capacity
# Retry behavior
otelcol_exporter_retry_countAlerts to Configure:
- Collector down for >5 minutes
- Queue >80% capacity for >15 minutes
- Export failure rate >10% for >5 minutes
- Memory usage >400MB
Phase 5: Documentation & Cleanup (Week 2)
5.1 Update Documentation
Files to Update:
docs/how-to-add-telemetry-to-apps.md- Add collector setupapps/machine/README.md- Document collector serviceCLAUDE.md- Update architecture section
New Documentation:
docs/troubleshooting-otel-collector.md- Common issues and solutions
5.2 Remove Legacy Code
- Remove Redis log streaming (if no longer needed)
- Clean up direct Dash0 configuration
- Update environment templates
Consequences
Positive
- Zero Log Loss: Persistent queue ensures no logs lost during transient failures
- Improved Performance: ComfyUI never blocks on slow network operations
- Better Observability: Collector metrics provide visibility into export health
- Industry Standard: OTLP collector is battle-tested and well-documented
- Flexibility: Easy to add additional exporters (e.g., S3 backup, Datadog)
- Reduced Complexity: Removes need for custom Redis→Dash0 bridge
Negative
- Resource Overhead: ~50-100MB RAM, minimal CPU per machine
- Disk Usage: ~1GB for persistent queue per machine
- Additional Service: One more service to monitor and maintain
- Startup Dependency: ComfyUI should gracefully handle collector not ready
- Configuration Complexity: YAML configuration adds complexity
Neutral
- No Code Changes: ComfyUI code unchanged, only environment config
- No Breaking Changes: Log format and attributes remain identical
- Incremental Rollout: Can deploy gradually to minimize risk
Risks & Mitigations
| Risk | Impact | Likelihood | Mitigation |
|---|---|---|---|
| Collector OOM on high-volume machines | HIGH | LOW | Memory limiter processor, disk-backed queue |
| Queue fills up during extended outage | MEDIUM | MEDIUM | Monitor queue size, alert at 80%, increase capacity if needed |
| Collector crash loses in-flight logs | LOW | LOW | Persistent queue, health checks, auto-restart |
| Performance impact on ComfyUI | MEDIUM | LOW | Localhost communication is fast, load testing |
| Configuration errors break logging | HIGH | MEDIUM | Validation in CI, gradual rollout, rollback plan |
Success Metrics
Week 1 (Post-Deployment):
- 0 log loss incidents reported
- <5ms p99 latency for OTLP export to collector
- <2% export failure rate to Dash0
Month 1:
- 99.9% log delivery success rate
- <1% of logs delayed >5 minutes
- 0 critical incidents due to log loss
Month 3:
- Complete deprecation of Redis log streaming
- Collector deployment on 100% of machines
- Automated monitoring and alerting operational
References
- OpenTelemetry Collector Documentation
- OTLP Exporter Configuration
- Dash0 OTLP Ingestion
- File Storage Extension
Related Decisions
- Future: ADR-002: Centralized log aggregation for long-term storage
- Future: ADR-003: Telemetry for API and Worker services
