2025-11-12-otel-collector-sidecar Implementation Checklist
Status: Ready to Execute Owner: TBD Timeline: 2 weeks
Pre-Implementation
- [ ] Review 2025-11-12-otel-collector-sidecar with team
- [ ] Approve 2025-11-12-otel-collector-sidecar
- [ ] Assign implementation owner
- [ ] Schedule deployment window
Week 1: Setup & Testing
Day 1-2: Collector Configuration
- [ ] Create
apps/machine/otel-collector-config.yaml - [ ] Add validation script for YAML syntax
- [ ] Test config with
otelcol validate --config=... - [ ] Document configuration options
Files to Create:
apps/machine/otel-collector-config.yaml
Commands:
bash
# Validate configuration
docker run --rm -v $(pwd):/config otel/opentelemetry-collector-contrib:0.91.0 \
validate --config=/config/otel-collector-config.yamlDay 2-3: Docker Setup
- [ ] Update
apps/machine/docker-compose.ymlwith otel-collector service - [ ] Add volume for persistent queue
- [ ] Configure health check
- [ ] Add environment variables for Dash0
- [ ] Test docker-compose up locally
Files to Modify:
apps/machine/docker-compose.yml
Testing:
bash
cd apps/machine
docker-compose up -d otel-collector
docker logs otel-collector
curl http://localhost:13133/ # Health check
curl http://localhost:8888/metrics # MetricsDay 3-4: Environment Configuration
- [ ] Update
config/environments/components/telemetry.env- [ ] Add
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 - [ ] Keep
DASH0_ENDPOINTfor collector use
- [ ] Add
- [ ] Update machine environment interfaces if needed
- [ ] Rebuild environment with
pnpm build:env - [ ] Verify environment variables propagate correctly
Files to Modify:
config/environments/components/telemetry.env- Possibly:
config/environments/services/comfyui.interface.ts
Testing:
bash
pnpm build:env
# Check generated .env files contain OTEL_EXPORTER_OTLP_ENDPOINTDay 4-5: ComfyUI Integration
- [ ] Update
packages/comfyui/app/otel_integration.py- [ ] Read
OTEL_EXPORTER_OTLP_ENDPOINTenvironment variable - [ ] Remove TLS requirement for local collector
- [ ] Add logging for endpoint being used
- [ ] Read
- [ ] Test with local collector running
- [ ] Verify logs flow through collector to Dash0
- [ ] Verify
emerge.job_idandemerge.workflow_idpreserved
Files to Modify:
packages/comfyui/app/otel_integration.py
Testing:
bash
# Start collector
docker-compose up -d otel-collector
# Start ComfyUI with job
# ... run test job ...
# Check collector received logs
docker logs otel-collector | grep "Logs received"
# Check Dash0 for logs
# ... query Dash0 API ...Day 5-7: Comprehensive Testing
Test 1: Normal Operation
- [ ] Start collector + ComfyUI
- [ ] Run 10 test jobs
- [ ] Verify all logs in Dash0
- [ ] Check collector metrics
- [ ] Verify no performance degradation
Test 2: Network Failure
- [ ] Block Dash0 endpoint with iptables
- [ ] Run 5 test jobs
- [ ] Verify logs queue in collector
- [ ] Check queue size metrics
- [ ] Unblock Dash0
- [ ] Verify queued logs delivered
- [ ] Confirm all 5 jobs' logs in Dash0
Test 3: Collector Restart
- [ ] Start collector + ComfyUI
- [ ] Run job (job1)
- [ ] Stop collector mid-execution
- [ ] Continue job execution
- [ ] Check ComfyUI handles gracefully
- [ ] Restart collector
- [ ] Run another job (job2)
- [ ] Verify job2 logs in Dash0
- [ ] Check if any job1 logs recovered
Test 4: Resource Limits
- [ ] Monitor collector memory usage during heavy load
- [ ] Verify memory limiter prevents OOM
- [ ] Check queue doesn't grow unbounded
- [ ] Verify disk usage stays under 1GB
Test 5: High Volume
- [ ] Run 50 concurrent jobs
- [ ] Monitor collector throughput
- [ ] Check export latency
- [ ] Verify no log loss
- [ ] Check CPU/memory usage
Success Criteria:
- [ ] 100% log delivery in normal operation
- [ ] 100% log delivery after network recovery
- [ ] Graceful degradation when collector unavailable
- [ ] Memory usage <100MB under normal load
- [ ] Export latency <100ms p99
Week 2: Production Deployment
Day 8-9: Staging Deployment
- [ ] Deploy to staging environment
- [ ] Run full test suite
- [ ] Monitor for 24 hours
- [ ] Check for any issues
- [ ] Validate with QA team
Day 10: Canary Deployment
- [ ] Select 1 production machine (low traffic)
- [ ] Deploy collector sidecar
- [ ] Monitor for 24 hours
- [ ] Check metrics:
- [ ] Log delivery rate
- [ ] Queue behavior
- [ ] Export failures
- [ ] Resource usage
- [ ] Validate logs in Dash0
- [ ] Check job forensics UI
Day 11-12: 10% Rollout
- [ ] Deploy to 10% of fleet
- [ ] Monitor for 48 hours
- [ ] Set up alerting:
- [ ] Collector down >5min
- [ ] Queue >80% capacity
- [ ] Export failure rate >10%
- [ ] Memory usage >400MB
- [ ] Analyze metrics:
- [ ] Log delivery success rate
- [ ] Average queue size
- [ ] Export latency
- [ ] Address any issues
Day 13: Full Rollout (if successful)
- [ ] Deploy to remaining 90% of fleet
- [ ] Monitor closely for first 6 hours
- [ ] Check all alerting channels
- [ ] Validate random sample of jobs
- [ ] Document any issues encountered
Day 14: Verification & Cleanup
- [ ] Verify 100% of machines running collector
- [ ] Check overall metrics across fleet:
- [ ] Log delivery success rate >99.9%
- [ ] Export failure rate <1%
- [ ] Queue size stable
- [ ] Update documentation
- [ ] Clean up old configuration
- [ ] Consider deprecating Redis log streaming
Post-Deployment
Documentation Updates
- [ ] Update
docs/how-to-add-telemetry-to-apps.md - [ ] Update
apps/machine/README.md - [ ] Update
CLAUDE.mdarchitecture section - [ ] Create
docs/troubleshooting-otel-collector.md - [ ] Document rollback procedure
- [ ] Add runbook for common issues
Monitoring Setup
- [ ] Create Dash0 dashboard for collector metrics
- [ ] Set up alerting rules
- [ ] Document metric thresholds
- [ ] Test alert notifications
- [ ] Create incident response runbook
Team Training
- [ ] Present architecture changes to team
- [ ] Demo troubleshooting procedures
- [ ] Review metrics and alerting
- [ ] Share lessons learned
- [ ] Update onboarding docs
Rollback Plan
If issues encountered during deployment:
Immediate Rollback (if critical):
bash
# Stop collector
docker-compose stop otel-collector
# Update environment to point back to Dash0 direct
export OTEL_EXPORTER_OTLP_ENDPOINT=ingress.us-west-2.aws.dash0.com:4317
# Restart ComfyUI
docker-compose restart comfyui-gpu0 comfyui-gpu1Gradual Rollback (if partial deployment):
- Stop deploying to additional machines
- Investigate issues on affected machines
- Roll back specific machines to previous config
- Continue monitoring remaining collector deployments
Success Criteria
Technical:
- [ ] 100% of machines running collector sidecar
- [ ] 99.9% log delivery success rate
- [ ] <1% export failure rate
- [ ] Queue size stable and bounded
- [ ] No performance degradation in ComfyUI
- [ ] Alerting operational
Operational:
- [ ] 0 critical incidents due to log loss
- [ ] Team trained on new architecture
- [ ] Documentation complete
- [ ] Runbooks created and tested
Business:
- [ ] Job forensics UI shows 100% log coverage
- [ ] User-reported "missing logs" incidents eliminated
- [ ] Improved debugging experience for users
Risk Register
| Risk | Mitigation | Owner | Status |
|---|---|---|---|
| Collector OOM kills container | Memory limiter, monitoring, auto-restart | TBD | Open |
| Queue fills during extended outage | Alert at 80%, increase capacity | TBD | Open |
| Deployment breaks logging entirely | Gradual rollout, health checks, rollback plan | TBD | Open |
| Performance impact on ComfyUI | Load testing, monitoring, resource limits | TBD | Open |
| Team unfamiliar with new architecture | Training, documentation, runbooks | TBD | Open |
Sign-off
- [ ] Architecture Lead: _____________________ Date: _______
- [ ] DevOps Lead: _____________________ Date: _______
- [ ] QA Lead: _____________________ Date: _______
- [ ] Product Owner: _____________________ Date: _______
Notes
Use this section to track issues, learnings, and updates during implementation.
Last Updated: 2025-11-12 Next Review: After deployment completion
