Skip to content

2025-11-12-otel-collector-sidecar Implementation Checklist

Status: Ready to Execute Owner: TBD Timeline: 2 weeks

Pre-Implementation

  • [ ] Review 2025-11-12-otel-collector-sidecar with team
  • [ ] Approve 2025-11-12-otel-collector-sidecar
  • [ ] Assign implementation owner
  • [ ] Schedule deployment window

Week 1: Setup & Testing

Day 1-2: Collector Configuration

  • [ ] Create apps/machine/otel-collector-config.yaml
  • [ ] Add validation script for YAML syntax
  • [ ] Test config with otelcol validate --config=...
  • [ ] Document configuration options

Files to Create:

  • apps/machine/otel-collector-config.yaml

Commands:

bash
# Validate configuration
docker run --rm -v $(pwd):/config otel/opentelemetry-collector-contrib:0.91.0 \
  validate --config=/config/otel-collector-config.yaml

Day 2-3: Docker Setup

  • [ ] Update apps/machine/docker-compose.yml with otel-collector service
  • [ ] Add volume for persistent queue
  • [ ] Configure health check
  • [ ] Add environment variables for Dash0
  • [ ] Test docker-compose up locally

Files to Modify:

  • apps/machine/docker-compose.yml

Testing:

bash
cd apps/machine
docker-compose up -d otel-collector
docker logs otel-collector
curl http://localhost:13133/  # Health check
curl http://localhost:8888/metrics  # Metrics

Day 3-4: Environment Configuration

  • [ ] Update config/environments/components/telemetry.env
    • [ ] Add OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
    • [ ] Keep DASH0_ENDPOINT for collector use
  • [ ] Update machine environment interfaces if needed
  • [ ] Rebuild environment with pnpm build:env
  • [ ] Verify environment variables propagate correctly

Files to Modify:

  • config/environments/components/telemetry.env
  • Possibly: config/environments/services/comfyui.interface.ts

Testing:

bash
pnpm build:env
# Check generated .env files contain OTEL_EXPORTER_OTLP_ENDPOINT

Day 4-5: ComfyUI Integration

  • [ ] Update packages/comfyui/app/otel_integration.py
    • [ ] Read OTEL_EXPORTER_OTLP_ENDPOINT environment variable
    • [ ] Remove TLS requirement for local collector
    • [ ] Add logging for endpoint being used
  • [ ] Test with local collector running
  • [ ] Verify logs flow through collector to Dash0
  • [ ] Verify emerge.job_id and emerge.workflow_id preserved

Files to Modify:

  • packages/comfyui/app/otel_integration.py

Testing:

bash
# Start collector
docker-compose up -d otel-collector

# Start ComfyUI with job
# ... run test job ...

# Check collector received logs
docker logs otel-collector | grep "Logs received"

# Check Dash0 for logs
# ... query Dash0 API ...

Day 5-7: Comprehensive Testing

Test 1: Normal Operation

  • [ ] Start collector + ComfyUI
  • [ ] Run 10 test jobs
  • [ ] Verify all logs in Dash0
  • [ ] Check collector metrics
  • [ ] Verify no performance degradation

Test 2: Network Failure

  • [ ] Block Dash0 endpoint with iptables
  • [ ] Run 5 test jobs
  • [ ] Verify logs queue in collector
  • [ ] Check queue size metrics
  • [ ] Unblock Dash0
  • [ ] Verify queued logs delivered
  • [ ] Confirm all 5 jobs' logs in Dash0

Test 3: Collector Restart

  • [ ] Start collector + ComfyUI
  • [ ] Run job (job1)
  • [ ] Stop collector mid-execution
  • [ ] Continue job execution
  • [ ] Check ComfyUI handles gracefully
  • [ ] Restart collector
  • [ ] Run another job (job2)
  • [ ] Verify job2 logs in Dash0
  • [ ] Check if any job1 logs recovered

Test 4: Resource Limits

  • [ ] Monitor collector memory usage during heavy load
  • [ ] Verify memory limiter prevents OOM
  • [ ] Check queue doesn't grow unbounded
  • [ ] Verify disk usage stays under 1GB

Test 5: High Volume

  • [ ] Run 50 concurrent jobs
  • [ ] Monitor collector throughput
  • [ ] Check export latency
  • [ ] Verify no log loss
  • [ ] Check CPU/memory usage

Success Criteria:

  • [ ] 100% log delivery in normal operation
  • [ ] 100% log delivery after network recovery
  • [ ] Graceful degradation when collector unavailable
  • [ ] Memory usage <100MB under normal load
  • [ ] Export latency <100ms p99

Week 2: Production Deployment

Day 8-9: Staging Deployment

  • [ ] Deploy to staging environment
  • [ ] Run full test suite
  • [ ] Monitor for 24 hours
  • [ ] Check for any issues
  • [ ] Validate with QA team

Day 10: Canary Deployment

  • [ ] Select 1 production machine (low traffic)
  • [ ] Deploy collector sidecar
  • [ ] Monitor for 24 hours
  • [ ] Check metrics:
    • [ ] Log delivery rate
    • [ ] Queue behavior
    • [ ] Export failures
    • [ ] Resource usage
  • [ ] Validate logs in Dash0
  • [ ] Check job forensics UI

Day 11-12: 10% Rollout

  • [ ] Deploy to 10% of fleet
  • [ ] Monitor for 48 hours
  • [ ] Set up alerting:
    • [ ] Collector down >5min
    • [ ] Queue >80% capacity
    • [ ] Export failure rate >10%
    • [ ] Memory usage >400MB
  • [ ] Analyze metrics:
    • [ ] Log delivery success rate
    • [ ] Average queue size
    • [ ] Export latency
  • [ ] Address any issues

Day 13: Full Rollout (if successful)

  • [ ] Deploy to remaining 90% of fleet
  • [ ] Monitor closely for first 6 hours
  • [ ] Check all alerting channels
  • [ ] Validate random sample of jobs
  • [ ] Document any issues encountered

Day 14: Verification & Cleanup

  • [ ] Verify 100% of machines running collector
  • [ ] Check overall metrics across fleet:
    • [ ] Log delivery success rate >99.9%
    • [ ] Export failure rate <1%
    • [ ] Queue size stable
  • [ ] Update documentation
  • [ ] Clean up old configuration
  • [ ] Consider deprecating Redis log streaming

Post-Deployment

Documentation Updates

  • [ ] Update docs/how-to-add-telemetry-to-apps.md
  • [ ] Update apps/machine/README.md
  • [ ] Update CLAUDE.md architecture section
  • [ ] Create docs/troubleshooting-otel-collector.md
  • [ ] Document rollback procedure
  • [ ] Add runbook for common issues

Monitoring Setup

  • [ ] Create Dash0 dashboard for collector metrics
  • [ ] Set up alerting rules
  • [ ] Document metric thresholds
  • [ ] Test alert notifications
  • [ ] Create incident response runbook

Team Training

  • [ ] Present architecture changes to team
  • [ ] Demo troubleshooting procedures
  • [ ] Review metrics and alerting
  • [ ] Share lessons learned
  • [ ] Update onboarding docs

Rollback Plan

If issues encountered during deployment:

Immediate Rollback (if critical):

bash
# Stop collector
docker-compose stop otel-collector

# Update environment to point back to Dash0 direct
export OTEL_EXPORTER_OTLP_ENDPOINT=ingress.us-west-2.aws.dash0.com:4317

# Restart ComfyUI
docker-compose restart comfyui-gpu0 comfyui-gpu1

Gradual Rollback (if partial deployment):

  1. Stop deploying to additional machines
  2. Investigate issues on affected machines
  3. Roll back specific machines to previous config
  4. Continue monitoring remaining collector deployments

Success Criteria

Technical:

  • [ ] 100% of machines running collector sidecar
  • [ ] 99.9% log delivery success rate
  • [ ] <1% export failure rate
  • [ ] Queue size stable and bounded
  • [ ] No performance degradation in ComfyUI
  • [ ] Alerting operational

Operational:

  • [ ] 0 critical incidents due to log loss
  • [ ] Team trained on new architecture
  • [ ] Documentation complete
  • [ ] Runbooks created and tested

Business:

  • [ ] Job forensics UI shows 100% log coverage
  • [ ] User-reported "missing logs" incidents eliminated
  • [ ] Improved debugging experience for users

Risk Register

RiskMitigationOwnerStatus
Collector OOM kills containerMemory limiter, monitoring, auto-restartTBDOpen
Queue fills during extended outageAlert at 80%, increase capacityTBDOpen
Deployment breaks logging entirelyGradual rollout, health checks, rollback planTBDOpen
Performance impact on ComfyUILoad testing, monitoring, resource limitsTBDOpen
Team unfamiliar with new architectureTraining, documentation, runbooksTBDOpen

Sign-off

  • [ ] Architecture Lead: _____________________ Date: _______
  • [ ] DevOps Lead: _____________________ Date: _______
  • [ ] QA Lead: _____________________ Date: _______
  • [ ] Product Owner: _____________________ Date: _______

Notes

Use this section to track issues, learnings, and updates during implementation.


Last Updated: 2025-11-12 Next Review: After deployment completion

Released under the MIT License.