Skip to content

Non-Critical Error Handling

How to identify and handle infrastructure/telemetry errors that should be logged but not fail jobs.

The Problem

Sometimes errors appear in the event stream that look like job failures, but they're actually infrastructure errors that don't impact the AI workflow itself.

Example:

Transient error StatusCode.UNAVAILABLE encountered while exporting logs to ingress.us-west-2.aws.dash0.com:4317, retrying in 0.90s.

This triggered a job retry even though the actual AI generation succeeded. The telemetry export failure is non-critical - it doesn't affect the job outcome.

Critical vs Non-Critical

The system now distinguishes between:

  • Critical Errors → Fail the job (GPU OOM, missing models, workflow errors)
  • Non-Critical Errors → Log but don't fail (telemetry exports, metrics, infrastructure)

Error Filtering Pipeline

Filter 1: Log Level

Only critical and error level logs can fail jobs:

typescript
// Filter 1: Only fail on critical/error level logs
const isCriticalLevel = logLevel === 'critical' || logLevel === 'error';
if (!isCriticalLevel) {
  logger.debug(`Skipping non-critical log (${logLevel})`);
  return { type: 'timeout' }; // Not a critical error
}

Filter 2: Infrastructure Error

Even if logged as "error", infrastructure failures don't fail jobs:

typescript
// Filter 2: Skip infrastructure/telemetry errors
const isInfrastructureError = this.isInfrastructureError(logMessage);
if (isInfrastructureError) {
  logger.warn(`⚙️ Infrastructure error detected (not failing job)`);
  return { type: 'timeout' }; // Log but don't fail
}

THE ONE PLACE for Infrastructure Patterns

When you find a new false positive (error that shouldn't fail jobs), add it here:

📁 apps/worker/src/connectors/comfyui-rest-stream-connector.ts
   → isInfrastructureError() method

Step-by-Step: Adding a Pattern

Example: New False Positive

You see this in production:

ERROR: Failed to connect to Prometheus endpoint, retrying...

This caused a job to fail, but it's just a metrics collection issue.

Step 1: Identify the Pattern

Key patterns:

  • prometheus + endpoint + failed/retry
  • Or: prometheus + metrics + error

Step 2: Open THE ONE PLACE

bash
code apps/worker/src/connectors/comfyui-rest-stream-connector.ts

Step 3: Add Pattern

Find the isInfrastructureError() method and add your pattern:

typescript
private isInfrastructureError(message: string): boolean {
  const messageLower = message.toLowerCase();

  // OpenTelemetry/OTLP export failures
  if (
    (messageLower.includes('statuscode.unavailable') ||
      messageLower.includes('grpc') ||
      messageLower.includes('otlp')) &&
    (messageLower.includes('export') || messageLower.includes('dash0'))
  ) {
    return true;
  }

  // Prometheus metrics export failures (YOUR NEW PATTERN)
  if (
    messageLower.includes('prometheus') &&
    (messageLower.includes('endpoint') || messageLower.includes('metrics'))
  ) {
    return true;
  }

  // ... rest of patterns ...

  return false; // Not an infrastructure error - this is critical
}

Step 4: Build and Test

bash
# Build worker package
pnpm --filter=@emp/worker build

# Deploy and monitor
# Check logs for: "⚙️ Infrastructure error detected (not failing job)"

Step 5: Done! ✅

The error will now be logged but won't terminate jobs.

Current Infrastructure Patterns

These patterns are already handled as non-critical:

OpenTelemetry/OTLP Exports

typescript
// Example: "StatusCode.UNAVAILABLE encountered while exporting logs to dash0.com"
if (
  (messageLower.includes('statuscode.unavailable') ||
    messageLower.includes('grpc') ||
    messageLower.includes('otlp')) &&
  (messageLower.includes('export') || messageLower.includes('dash0'))
) {
  return true;
}

Telemetry Endpoints (Dash0, Jaeger, Zipkin)

typescript
// Example: "Failed to export traces to jaeger endpoint"
if (
  (messageLower.includes('dash0.com') ||
    messageLower.includes('jaeger') ||
    messageLower.includes('zipkin')) &&
  (messageLower.includes('error') || messageLower.includes('failed'))
) {
  return true;
}

OpenTelemetry SDK

typescript
// Example: "OpenTelemetry exporter connection refused"
if (
  messageLower.includes('opentelemetry') &&
  (messageLower.includes('exporter') || messageLower.includes('export'))
) {
  return true;
}

Metrics/Traces/Spans Exports

typescript
// Example: "Failed to export metrics to collector"
if (
  (messageLower.includes('metric') ||
    messageLower.includes('trace') ||
    messageLower.includes('span')) &&
  (messageLower.includes('export') || messageLower.includes('exporter'))
) {
  return true;
}

How to Identify Non-Critical Errors

Ask these questions:

1. Does this error prevent the AI workflow from running?

  • YES → Critical (fail job)

    • GPU out of memory
    • Model not found
    • Node execution failed
    • Invalid workflow
  • NO → Non-critical (log only)

    • Telemetry export failed
    • Metrics collection failed
    • Health check timeout (non-essential)

2. Is this error from infrastructure/observability?

Infrastructure/Observability (Non-Critical):

  • Telemetry exports (Dash0, Jaeger, OTLP)
  • Metrics collection (Prometheus, StatsD)
  • Health checks (non-essential monitoring)
  • Log shipping failures (logs to external systems)
  • APM/Tracing exports

AI Service (Critical):

  • ComfyUI node execution
  • Model loading/inference
  • GPU operations
  • Workflow validation
  • Image processing

3. Did the job complete successfully despite the error?

If you see the error in logs BUT the job produced correct output → Non-critical

Pattern Matching Best Practices

DO

✅ Use broad pattern matching for infrastructure errors ✅ Check for multiple indicators (service name + error type) ✅ Test patterns with real production logs ✅ Document new patterns with examples

DON'T

❌ Don't be too specific (may miss variations) ❌ Don't match workflow-critical errors ❌ Don't skip testing before deploying

Example Patterns

Good Pattern (Broad)

typescript
// Catches all OpenTelemetry export failures
if (
  messageLower.includes('opentelemetry') &&
  messageLower.includes('export')
) {
  return true;
}

Bad Pattern (Too Specific)

typescript
// ❌ Only catches exact message - will miss variations
if (message === 'OpenTelemetry exporter failed to connect') {
  return true;
}

Good Pattern (Multiple Indicators)

typescript
// Checks for telemetry service AND error type
if (
  (messageLower.includes('dash0.com') || messageLower.includes('jaeger')) &&
  (messageLower.includes('export') || messageLower.includes('failed'))
) {
  return true;
}

Testing

Manual Test

bash
# 1. Simulate infrastructure error in logs
docker exec <comfyui-container> python -c "
import logging
logging.error('StatusCode.UNAVAILABLE while exporting to dash0.com')
"

# 2. Check worker logs
docker logs <worker-container> -f | grep "Infrastructure error"

# 3. Verify job didn't fail
# Check monitor UI - job should still show as running/completed

Integration Test

typescript
// apps/worker/src/__tests__/non-critical-errors.test.ts
import { describe, it, expect } from 'vitest';

describe('Infrastructure Error Filtering', () => {
  it('should identify OpenTelemetry export errors as non-critical', () => {
    const connector = new ComfyUIRestStreamConnector(/* config */);

    const message = 'StatusCode.UNAVAILABLE while exporting logs to dash0.com';
    const result = connector['isInfrastructureError'](message);

    expect(result).toBe(true); // Non-critical
  });

  it('should identify GPU OOM as critical', () => {
    const connector = new ComfyUIRestStreamConnector(/* config */);

    const message = 'CUDA out of memory. Tried to allocate 2.00 GiB';
    const result = connector['isInfrastructureError'](message);

    expect(result).toBe(false); // Critical
  });
});

Debugging

Check Log Filtering

bash
# Watch worker logs for filtering decisions
docker logs <worker-container> -f | grep -E "Skipping|Infrastructure error|Critical error"

# Expected output:
# ✅ "⚙️ Infrastructure error detected (not failing job)" - Non-critical
# ✅ "📝 Skipping non-critical log (warning)" - Non-critical log level
# ✅ "🚨 Critical error detected from ComfyUI (error)" - Critical

Verify Job Status

bash
# Check if job failed or succeeded
curl http://localhost:3100/jobs/<job-id>

# If infrastructure error was logged but job succeeded:
# → Pattern correctly identified as non-critical ✅

# If infrastructure error was logged and job failed:
# → Pattern NOT identified - needs to be added ❌

Common Infrastructure Errors

Telemetry/APM

  • Dash0 export failures
  • Jaeger/Zipkin trace exports
  • OpenTelemetry SDK errors
  • OTLP gRPC connection issues
  • APM agent connection failures

Metrics

  • Prometheus scrape failures
  • StatsD connection issues
  • Metric exporter errors
  • Time series database connection failures

Logging

  • Log shipper failures (Fluentd, Logstash)
  • Log aggregation errors
  • External log storage issues

Health Checks

  • Non-essential service health checks
  • Readiness/liveness probe timeouts (non-critical)

What to Watch For

Don't Make Critical Errors Non-Critical!

typescript
// ❌ BAD - This will hide real workflow errors
if (messageLower.includes('error')) {
  return true; // Way too broad!
}

// ✅ GOOD - Specific to infrastructure
if (
  messageLower.includes('error') &&
  (messageLower.includes('dash0') || messageLower.includes('telemetry'))
) {
  return true;
}

Be Specific About Service

typescript
// ❌ BAD - "export" could mean workflow export
if (messageLower.includes('export')) {
  return true;
}

// ✅ GOOD - Clear it's telemetry export
if (
  messageLower.includes('export') &&
  (messageLower.includes('otlp') || messageLower.includes('telemetry'))
) {
  return true;
}

Quick Reference

typescript
// Template for new infrastructure error pattern
// Location: apps/worker/src/connectors/comfyui-rest-stream-connector.ts

private isInfrastructureError(message: string): boolean {
  const messageLower = message.toLowerCase();

  // {Description of what this catches}
  // Example: "{Actual error message from logs}"
  if (
    messageLower.includes('{service-indicator}') &&
    (messageLower.includes('{error-type-1}') || messageLower.includes('{error-type-2}'))
  ) {
    return true; // Don't fail job
  }

  return false; // Critical - fail job
}

Released under the MIT License.