2025-11-12-otel-collector-sidecar Implementation Checklist

Status: Ready to Execute Owner: TBD Timeline: 2 weeks

Pre-Implementation

[ ] Review 2025-11-12-otel-collector-sidecar with team
[ ] Approve 2025-11-12-otel-collector-sidecar
[ ] Assign implementation owner
[ ] Schedule deployment window

Week 1: Setup & Testing

Day 1-2: Collector Configuration

[ ] Create apps/machine/otel-collector-config.yaml
[ ] Add validation script for YAML syntax
[ ] Test config with otelcol validate --config=...
[ ] Document configuration options

Files to Create:

apps/machine/otel-collector-config.yaml

Commands:

bash

# Validate configuration
docker run --rm -v $(pwd):/config otel/opentelemetry-collector-contrib:0.91.0 \
  validate --config=/config/otel-collector-config.yaml

Day 2-3: Docker Setup

[ ] Update apps/machine/docker-compose.yml with otel-collector service
[ ] Add volume for persistent queue
[ ] Configure health check
[ ] Add environment variables for Dash0
[ ] Test docker-compose up locally

Files to Modify:

apps/machine/docker-compose.yml

Testing:

bash

cd apps/machine
docker-compose up -d otel-collector
docker logs otel-collector
curl http://localhost:13133/  # Health check
curl http://localhost:8888/metrics  # Metrics

Day 3-4: Environment Configuration

[ ] Update config/environments/components/telemetry.env
- [ ] Add OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
- [ ] Keep DASH0_ENDPOINT for collector use
[ ] Update machine environment interfaces if needed
[ ] Rebuild environment with pnpm build:env
[ ] Verify environment variables propagate correctly

Files to Modify:

config/environments/components/telemetry.env
Possibly: config/environments/services/comfyui.interface.ts

Testing:

bash

pnpm build:env
# Check generated .env files contain OTEL_EXPORTER_OTLP_ENDPOINT

Day 4-5: ComfyUI Integration

[ ] Update packages/comfyui/app/otel_integration.py
- [ ] Read OTEL_EXPORTER_OTLP_ENDPOINT environment variable
- [ ] Remove TLS requirement for local collector
- [ ] Add logging for endpoint being used
[ ] Test with local collector running
[ ] Verify logs flow through collector to Dash0
[ ] Verify emerge.job_id and emerge.workflow_id preserved

Files to Modify:

packages/comfyui/app/otel_integration.py

Testing:

bash

# Start collector
docker-compose up -d otel-collector

# Start ComfyUI with job
# ... run test job ...

# Check collector received logs
docker logs otel-collector | grep "Logs received"

# Check Dash0 for logs
# ... query Dash0 API ...

Day 5-7: Comprehensive Testing

Test 1: Normal Operation

[ ] Start collector + ComfyUI
[ ] Run 10 test jobs
[ ] Verify all logs in Dash0
[ ] Check collector metrics
[ ] Verify no performance degradation

Test 2: Network Failure

[ ] Block Dash0 endpoint with iptables
[ ] Run 5 test jobs
[ ] Verify logs queue in collector
[ ] Check queue size metrics
[ ] Unblock Dash0
[ ] Verify queued logs delivered
[ ] Confirm all 5 jobs' logs in Dash0

Test 3: Collector Restart

[ ] Start collector + ComfyUI
[ ] Run job (job1)
[ ] Stop collector mid-execution
[ ] Continue job execution
[ ] Check ComfyUI handles gracefully
[ ] Restart collector
[ ] Run another job (job2)
[ ] Verify job2 logs in Dash0
[ ] Check if any job1 logs recovered

Test 4: Resource Limits

[ ] Monitor collector memory usage during heavy load
[ ] Verify memory limiter prevents OOM
[ ] Check queue doesn't grow unbounded
[ ] Verify disk usage stays under 1GB

Test 5: High Volume

[ ] Run 50 concurrent jobs
[ ] Monitor collector throughput
[ ] Check export latency
[ ] Verify no log loss
[ ] Check CPU/memory usage

Success Criteria:

[ ] 100% log delivery in normal operation
[ ] 100% log delivery after network recovery
[ ] Graceful degradation when collector unavailable
[ ] Memory usage <100MB under normal load
[ ] Export latency <100ms p99

Week 2: Production Deployment

Day 8-9: Staging Deployment

[ ] Deploy to staging environment
[ ] Run full test suite
[ ] Monitor for 24 hours
[ ] Check for any issues
[ ] Validate with QA team

Day 10: Canary Deployment

[ ] Select 1 production machine (low traffic)
[ ] Deploy collector sidecar
[ ] Monitor for 24 hours
[ ] Check metrics:
- [ ] Log delivery rate
- [ ] Queue behavior
- [ ] Export failures
- [ ] Resource usage
[ ] Validate logs in Dash0
[ ] Check job forensics UI

Day 11-12: 10% Rollout

[ ] Deploy to 10% of fleet
[ ] Monitor for 48 hours
[ ] Set up alerting:
- [ ] Collector down >5min
- [ ] Queue >80% capacity
- [ ] Export failure rate >10%
- [ ] Memory usage >400MB
[ ] Analyze metrics:
- [ ] Log delivery success rate
- [ ] Average queue size
- [ ] Export latency
[ ] Address any issues

Day 13: Full Rollout (if successful)

[ ] Deploy to remaining 90% of fleet
[ ] Monitor closely for first 6 hours
[ ] Check all alerting channels
[ ] Validate random sample of jobs
[ ] Document any issues encountered

Day 14: Verification & Cleanup

[ ] Verify 100% of machines running collector
[ ] Check overall metrics across fleet:
- [ ] Log delivery success rate >99.9%
- [ ] Export failure rate <1%
- [ ] Queue size stable
[ ] Update documentation
[ ] Clean up old configuration
[ ] Consider deprecating Redis log streaming

Post-Deployment

Documentation Updates

[ ] Update docs/how-to-add-telemetry-to-apps.md
[ ] Update apps/machine/README.md
[ ] Update CLAUDE.md architecture section
[ ] Create docs/troubleshooting-otel-collector.md
[ ] Document rollback procedure
[ ] Add runbook for common issues

Monitoring Setup

[ ] Create Dash0 dashboard for collector metrics
[ ] Set up alerting rules
[ ] Document metric thresholds
[ ] Test alert notifications
[ ] Create incident response runbook

Team Training

[ ] Present architecture changes to team
[ ] Demo troubleshooting procedures
[ ] Review metrics and alerting
[ ] Share lessons learned
[ ] Update onboarding docs

Rollback Plan

If issues encountered during deployment:

Immediate Rollback (if critical):

bash

# Stop collector
docker-compose stop otel-collector

# Update environment to point back to Dash0 direct
export OTEL_EXPORTER_OTLP_ENDPOINT=ingress.us-west-2.aws.dash0.com:4317

# Restart ComfyUI
docker-compose restart comfyui-gpu0 comfyui-gpu1

Gradual Rollback (if partial deployment):

Stop deploying to additional machines
Investigate issues on affected machines
Roll back specific machines to previous config
Continue monitoring remaining collector deployments

Success Criteria

Technical:

[ ] 100% of machines running collector sidecar
[ ] 99.9% log delivery success rate
[ ] <1% export failure rate
[ ] Queue size stable and bounded
[ ] No performance degradation in ComfyUI
[ ] Alerting operational

Operational:

[ ] 0 critical incidents due to log loss
[ ] Team trained on new architecture
[ ] Documentation complete
[ ] Runbooks created and tested

Business:

[ ] Job forensics UI shows 100% log coverage
[ ] User-reported "missing logs" incidents eliminated
[ ] Improved debugging experience for users

Risk Register

Risk	Mitigation	Owner	Status
Collector OOM kills container	Memory limiter, monitoring, auto-restart	TBD	Open
Queue fills during extended outage	Alert at 80%, increase capacity	TBD	Open
Deployment breaks logging entirely	Gradual rollout, health checks, rollback plan	TBD	Open
Performance impact on ComfyUI	Load testing, monitoring, resource limits	TBD	Open
Team unfamiliar with new architecture	Training, documentation, runbooks	TBD	Open

Sign-off

[ ] Architecture Lead: _____________________ Date: _______
[ ] DevOps Lead: _____________________ Date: _______
[ ] QA Lead: _____________________ Date: _______
[ ] Product Owner: _____________________ Date: _______

Notes

Use this section to track issues, learnings, and updates during implementation.

Last Updated: 2025-11-12 Next Review: After deployment completion

2025-11-12-otel-collector-sidecar Implementation Checklist ​

Pre-Implementation ​

Week 1: Setup & Testing ​

Day 1-2: Collector Configuration ​

Day 2-3: Docker Setup ​

Day 3-4: Environment Configuration ​

Day 4-5: ComfyUI Integration ​

Day 5-7: Comprehensive Testing ​

Test 1: Normal Operation ​

Test 2: Network Failure ​

Test 3: Collector Restart ​

Test 4: Resource Limits ​

Test 5: High Volume ​

Week 2: Production Deployment ​

Day 8-9: Staging Deployment ​

Day 10: Canary Deployment ​

Day 11-12: 10% Rollout ​

Day 13: Full Rollout (if successful) ​

Day 14: Verification & Cleanup ​

Post-Deployment ​

Documentation Updates ​

Monitoring Setup ​

Team Training ​

Rollback Plan ​

Success Criteria ​

Risk Register ​

Sign-off ​

Notes ​

2025-11-12-otel-collector-sidecar Implementation Checklist

Pre-Implementation

Week 1: Setup & Testing

Day 1-2: Collector Configuration

Day 2-3: Docker Setup

Day 3-4: Environment Configuration

Day 4-5: ComfyUI Integration

Day 5-7: Comprehensive Testing

Test 1: Normal Operation

Test 2: Network Failure

Test 3: Collector Restart

Test 4: Resource Limits

Test 5: High Volume

Week 2: Production Deployment

Day 8-9: Staging Deployment

Day 10: Canary Deployment

Day 11-12: 10% Rollout

Day 13: Full Rollout (if successful)

Day 14: Verification & Cleanup

Post-Deployment

Documentation Updates

Monitoring Setup

Team Training

Rollback Plan

Success Criteria

Risk Register

Sign-off

Notes