Non-Critical Error Handling
How to identify and handle infrastructure/telemetry errors that should be logged but not fail jobs.
The Problem
Sometimes errors appear in the event stream that look like job failures, but they're actually infrastructure errors that don't impact the AI workflow itself.
Example:
Transient error StatusCode.UNAVAILABLE encountered while exporting logs to ingress.us-west-2.aws.dash0.com:4317, retrying in 0.90s.This triggered a job retry even though the actual AI generation succeeded. The telemetry export failure is non-critical - it doesn't affect the job outcome.
Critical vs Non-Critical
The system now distinguishes between:
- Critical Errors → Fail the job (GPU OOM, missing models, workflow errors)
- Non-Critical Errors → Log but don't fail (telemetry exports, metrics, infrastructure)
Error Filtering Pipeline
Filter 1: Log Level
Only critical and error level logs can fail jobs:
// Filter 1: Only fail on critical/error level logs
const isCriticalLevel = logLevel === 'critical' || logLevel === 'error';
if (!isCriticalLevel) {
logger.debug(`Skipping non-critical log (${logLevel})`);
return { type: 'timeout' }; // Not a critical error
}Filter 2: Infrastructure Error
Even if logged as "error", infrastructure failures don't fail jobs:
// Filter 2: Skip infrastructure/telemetry errors
const isInfrastructureError = this.isInfrastructureError(logMessage);
if (isInfrastructureError) {
logger.warn(`⚙️ Infrastructure error detected (not failing job)`);
return { type: 'timeout' }; // Log but don't fail
}THE ONE PLACE for Infrastructure Patterns
When you find a new false positive (error that shouldn't fail jobs), add it here:
📁 apps/worker/src/connectors/comfyui-rest-stream-connector.ts
→ isInfrastructureError() methodStep-by-Step: Adding a Pattern
Example: New False Positive
You see this in production:
ERROR: Failed to connect to Prometheus endpoint, retrying...This caused a job to fail, but it's just a metrics collection issue.
Step 1: Identify the Pattern
Key patterns:
prometheus+endpoint+failed/retry- Or:
prometheus+metrics+error
Step 2: Open THE ONE PLACE
code apps/worker/src/connectors/comfyui-rest-stream-connector.tsStep 3: Add Pattern
Find the isInfrastructureError() method and add your pattern:
private isInfrastructureError(message: string): boolean {
const messageLower = message.toLowerCase();
// OpenTelemetry/OTLP export failures
if (
(messageLower.includes('statuscode.unavailable') ||
messageLower.includes('grpc') ||
messageLower.includes('otlp')) &&
(messageLower.includes('export') || messageLower.includes('dash0'))
) {
return true;
}
// Prometheus metrics export failures (YOUR NEW PATTERN)
if (
messageLower.includes('prometheus') &&
(messageLower.includes('endpoint') || messageLower.includes('metrics'))
) {
return true;
}
// ... rest of patterns ...
return false; // Not an infrastructure error - this is critical
}Step 4: Build and Test
# Build worker package
pnpm --filter=@emp/worker build
# Deploy and monitor
# Check logs for: "⚙️ Infrastructure error detected (not failing job)"Step 5: Done! ✅
The error will now be logged but won't terminate jobs.
Current Infrastructure Patterns
These patterns are already handled as non-critical:
OpenTelemetry/OTLP Exports
// Example: "StatusCode.UNAVAILABLE encountered while exporting logs to dash0.com"
if (
(messageLower.includes('statuscode.unavailable') ||
messageLower.includes('grpc') ||
messageLower.includes('otlp')) &&
(messageLower.includes('export') || messageLower.includes('dash0'))
) {
return true;
}Telemetry Endpoints (Dash0, Jaeger, Zipkin)
// Example: "Failed to export traces to jaeger endpoint"
if (
(messageLower.includes('dash0.com') ||
messageLower.includes('jaeger') ||
messageLower.includes('zipkin')) &&
(messageLower.includes('error') || messageLower.includes('failed'))
) {
return true;
}OpenTelemetry SDK
// Example: "OpenTelemetry exporter connection refused"
if (
messageLower.includes('opentelemetry') &&
(messageLower.includes('exporter') || messageLower.includes('export'))
) {
return true;
}Metrics/Traces/Spans Exports
// Example: "Failed to export metrics to collector"
if (
(messageLower.includes('metric') ||
messageLower.includes('trace') ||
messageLower.includes('span')) &&
(messageLower.includes('export') || messageLower.includes('exporter'))
) {
return true;
}How to Identify Non-Critical Errors
Ask these questions:
1. Does this error prevent the AI workflow from running?
YES → Critical (fail job)
- GPU out of memory
- Model not found
- Node execution failed
- Invalid workflow
NO → Non-critical (log only)
- Telemetry export failed
- Metrics collection failed
- Health check timeout (non-essential)
2. Is this error from infrastructure/observability?
Infrastructure/Observability (Non-Critical):
- Telemetry exports (Dash0, Jaeger, OTLP)
- Metrics collection (Prometheus, StatsD)
- Health checks (non-essential monitoring)
- Log shipping failures (logs to external systems)
- APM/Tracing exports
AI Service (Critical):
- ComfyUI node execution
- Model loading/inference
- GPU operations
- Workflow validation
- Image processing
3. Did the job complete successfully despite the error?
If you see the error in logs BUT the job produced correct output → Non-critical
Pattern Matching Best Practices
DO
✅ Use broad pattern matching for infrastructure errors ✅ Check for multiple indicators (service name + error type) ✅ Test patterns with real production logs ✅ Document new patterns with examples
DON'T
❌ Don't be too specific (may miss variations) ❌ Don't match workflow-critical errors ❌ Don't skip testing before deploying
Example Patterns
Good Pattern (Broad)
// Catches all OpenTelemetry export failures
if (
messageLower.includes('opentelemetry') &&
messageLower.includes('export')
) {
return true;
}Bad Pattern (Too Specific)
// ❌ Only catches exact message - will miss variations
if (message === 'OpenTelemetry exporter failed to connect') {
return true;
}Good Pattern (Multiple Indicators)
// Checks for telemetry service AND error type
if (
(messageLower.includes('dash0.com') || messageLower.includes('jaeger')) &&
(messageLower.includes('export') || messageLower.includes('failed'))
) {
return true;
}Testing
Manual Test
# 1. Simulate infrastructure error in logs
docker exec <comfyui-container> python -c "
import logging
logging.error('StatusCode.UNAVAILABLE while exporting to dash0.com')
"
# 2. Check worker logs
docker logs <worker-container> -f | grep "Infrastructure error"
# 3. Verify job didn't fail
# Check monitor UI - job should still show as running/completedIntegration Test
// apps/worker/src/__tests__/non-critical-errors.test.ts
import { describe, it, expect } from 'vitest';
describe('Infrastructure Error Filtering', () => {
it('should identify OpenTelemetry export errors as non-critical', () => {
const connector = new ComfyUIRestStreamConnector(/* config */);
const message = 'StatusCode.UNAVAILABLE while exporting logs to dash0.com';
const result = connector['isInfrastructureError'](message);
expect(result).toBe(true); // Non-critical
});
it('should identify GPU OOM as critical', () => {
const connector = new ComfyUIRestStreamConnector(/* config */);
const message = 'CUDA out of memory. Tried to allocate 2.00 GiB';
const result = connector['isInfrastructureError'](message);
expect(result).toBe(false); // Critical
});
});Debugging
Check Log Filtering
# Watch worker logs for filtering decisions
docker logs <worker-container> -f | grep -E "Skipping|Infrastructure error|Critical error"
# Expected output:
# ✅ "⚙️ Infrastructure error detected (not failing job)" - Non-critical
# ✅ "📝 Skipping non-critical log (warning)" - Non-critical log level
# ✅ "🚨 Critical error detected from ComfyUI (error)" - CriticalVerify Job Status
# Check if job failed or succeeded
curl http://localhost:3100/jobs/<job-id>
# If infrastructure error was logged but job succeeded:
# → Pattern correctly identified as non-critical ✅
# If infrastructure error was logged and job failed:
# → Pattern NOT identified - needs to be added ❌Common Infrastructure Errors
Telemetry/APM
- Dash0 export failures
- Jaeger/Zipkin trace exports
- OpenTelemetry SDK errors
- OTLP gRPC connection issues
- APM agent connection failures
Metrics
- Prometheus scrape failures
- StatsD connection issues
- Metric exporter errors
- Time series database connection failures
Logging
- Log shipper failures (Fluentd, Logstash)
- Log aggregation errors
- External log storage issues
Health Checks
- Non-essential service health checks
- Readiness/liveness probe timeouts (non-critical)
What to Watch For
Don't Make Critical Errors Non-Critical!
// ❌ BAD - This will hide real workflow errors
if (messageLower.includes('error')) {
return true; // Way too broad!
}
// ✅ GOOD - Specific to infrastructure
if (
messageLower.includes('error') &&
(messageLower.includes('dash0') || messageLower.includes('telemetry'))
) {
return true;
}Be Specific About Service
// ❌ BAD - "export" could mean workflow export
if (messageLower.includes('export')) {
return true;
}
// ✅ GOOD - Clear it's telemetry export
if (
messageLower.includes('export') &&
(messageLower.includes('otlp') || messageLower.includes('telemetry'))
) {
return true;
}Related Documentation
Quick Reference
// Template for new infrastructure error pattern
// Location: apps/worker/src/connectors/comfyui-rest-stream-connector.ts
private isInfrastructureError(message: string): boolean {
const messageLower = message.toLowerCase();
// {Description of what this catches}
// Example: "{Actual error message from logs}"
if (
messageLower.includes('{service-indicator}') &&
(messageLower.includes('{error-type-1}') || messageLower.includes('{error-type-2}'))
) {
return true; // Don't fail job
}
return false; // Critical - fail job
}