Non-Critical Error Handling

How to identify and handle infrastructure/telemetry errors that should be logged but not fail jobs.

The Problem

Sometimes errors appear in the event stream that look like job failures, but they're actually infrastructure errors that don't impact the AI workflow itself.

Example:

Transient error StatusCode.UNAVAILABLE encountered while exporting logs to ingress.us-west-2.aws.dash0.com:4317, retrying in 0.90s.

This triggered a job retry even though the actual AI generation succeeded. The telemetry export failure is non-critical - it doesn't affect the job outcome.

Critical vs Non-Critical

The system now distinguishes between:

Critical Errors → Fail the job (GPU OOM, missing models, workflow errors)
Non-Critical Errors → Log but don't fail (telemetry exports, metrics, infrastructure)

Error Filtering Pipeline

Filter 1: Log Level

Only critical and error level logs can fail jobs:

typescript

// Filter 1: Only fail on critical/error level logs
const isCriticalLevel = logLevel === 'critical' || logLevel === 'error';
if (!isCriticalLevel) {
  logger.debug(`Skipping non-critical log (${logLevel})`);
  return { type: 'timeout' }; // Not a critical error
}

Filter 2: Infrastructure Error

Even if logged as "error", infrastructure failures don't fail jobs:

typescript

// Filter 2: Skip infrastructure/telemetry errors
const isInfrastructureError = this.isInfrastructureError(logMessage);
if (isInfrastructureError) {
  logger.warn(`⚙️ Infrastructure error detected (not failing job)`);
  return { type: 'timeout' }; // Log but don't fail
}

THE ONE PLACE for Infrastructure Patterns

When you find a new false positive (error that shouldn't fail jobs), add it here:

📁 apps/worker/src/connectors/comfyui-rest-stream-connector.ts
   → isInfrastructureError() method

Step-by-Step: Adding a Pattern

Example: New False Positive

You see this in production:

ERROR: Failed to connect to Prometheus endpoint, retrying...

This caused a job to fail, but it's just a metrics collection issue.

Step 1: Identify the Pattern

Key patterns:

prometheus + endpoint + failed/retry
Or: prometheus + metrics + error

Step 2: Open THE ONE PLACE

bash

code apps/worker/src/connectors/comfyui-rest-stream-connector.ts

Step 3: Add Pattern

Find the isInfrastructureError() method and add your pattern:

typescript

private isInfrastructureError(message: string): boolean {
  const messageLower = message.toLowerCase();

  // OpenTelemetry/OTLP export failures
  if (
    (messageLower.includes('statuscode.unavailable') ||
      messageLower.includes('grpc') ||
      messageLower.includes('otlp')) &&
    (messageLower.includes('export') || messageLower.includes('dash0'))
  ) {
    return true;
  }

  // Prometheus metrics export failures (YOUR NEW PATTERN)
  if (
    messageLower.includes('prometheus') &&
    (messageLower.includes('endpoint') || messageLower.includes('metrics'))
  ) {
    return true;
  }

  // ... rest of patterns ...

  return false; // Not an infrastructure error - this is critical
}

Step 4: Build and Test

bash

# Build worker package
pnpm --filter=@emp/worker build

# Deploy and monitor
# Check logs for: "⚙️ Infrastructure error detected (not failing job)"

Step 5: Done! ✅

The error will now be logged but won't terminate jobs.

Current Infrastructure Patterns

These patterns are already handled as non-critical:

OpenTelemetry/OTLP Exports

typescript

// Example: "StatusCode.UNAVAILABLE encountered while exporting logs to dash0.com"
if (
  (messageLower.includes('statuscode.unavailable') ||
    messageLower.includes('grpc') ||
    messageLower.includes('otlp')) &&
  (messageLower.includes('export') || messageLower.includes('dash0'))
) {
  return true;
}

Telemetry Endpoints (Dash0, Jaeger, Zipkin)

typescript

// Example: "Failed to export traces to jaeger endpoint"
if (
  (messageLower.includes('dash0.com') ||
    messageLower.includes('jaeger') ||
    messageLower.includes('zipkin')) &&
  (messageLower.includes('error') || messageLower.includes('failed'))
) {
  return true;
}

OpenTelemetry SDK

typescript

// Example: "OpenTelemetry exporter connection refused"
if (
  messageLower.includes('opentelemetry') &&
  (messageLower.includes('exporter') || messageLower.includes('export'))
) {
  return true;
}

Metrics/Traces/Spans Exports

typescript

// Example: "Failed to export metrics to collector"
if (
  (messageLower.includes('metric') ||
    messageLower.includes('trace') ||
    messageLower.includes('span')) &&
  (messageLower.includes('export') || messageLower.includes('exporter'))
) {
  return true;
}

How to Identify Non-Critical Errors

Ask these questions:

1. Does this error prevent the AI workflow from running?

YES → Critical (fail job)
- GPU out of memory
- Model not found
- Node execution failed
- Invalid workflow
NO → Non-critical (log only)
- Telemetry export failed
- Metrics collection failed
- Health check timeout (non-essential)

2. Is this error from infrastructure/observability?

Infrastructure/Observability (Non-Critical):

Telemetry exports (Dash0, Jaeger, OTLP)
Metrics collection (Prometheus, StatsD)
Health checks (non-essential monitoring)
Log shipping failures (logs to external systems)
APM/Tracing exports

AI Service (Critical):

ComfyUI node execution
Model loading/inference
GPU operations
Workflow validation
Image processing

3. Did the job complete successfully despite the error?

If you see the error in logs BUT the job produced correct output → Non-critical

Pattern Matching Best Practices

DO

✅ Use broad pattern matching for infrastructure errors ✅ Check for multiple indicators (service name + error type) ✅ Test patterns with real production logs ✅ Document new patterns with examples

DON'T

❌ Don't be too specific (may miss variations) ❌ Don't match workflow-critical errors ❌ Don't skip testing before deploying

Example Patterns

Good Pattern (Broad)

typescript

// Catches all OpenTelemetry export failures
if (
  messageLower.includes('opentelemetry') &&
  messageLower.includes('export')
) {
  return true;
}

Bad Pattern (Too Specific)

typescript

// ❌ Only catches exact message - will miss variations
if (message === 'OpenTelemetry exporter failed to connect') {
  return true;
}

Good Pattern (Multiple Indicators)

typescript

// Checks for telemetry service AND error type
if (
  (messageLower.includes('dash0.com') || messageLower.includes('jaeger')) &&
  (messageLower.includes('export') || messageLower.includes('failed'))
) {
  return true;
}

Testing

Manual Test

bash

# 1. Simulate infrastructure error in logs
docker exec <comfyui-container> python -c "
import logging
logging.error('StatusCode.UNAVAILABLE while exporting to dash0.com')
"

# 2. Check worker logs
docker logs <worker-container> -f | grep "Infrastructure error"

# 3. Verify job didn't fail
# Check monitor UI - job should still show as running/completed

Integration Test

typescript

// apps/worker/src/__tests__/non-critical-errors.test.ts
import { describe, it, expect } from 'vitest';

describe('Infrastructure Error Filtering', () => {
  it('should identify OpenTelemetry export errors as non-critical', () => {
    const connector = new ComfyUIRestStreamConnector(/* config */);

    const message = 'StatusCode.UNAVAILABLE while exporting logs to dash0.com';
    const result = connector['isInfrastructureError'](message);

    expect(result).toBe(true); // Non-critical
  });

  it('should identify GPU OOM as critical', () => {
    const connector = new ComfyUIRestStreamConnector(/* config */);

    const message = 'CUDA out of memory. Tried to allocate 2.00 GiB';
    const result = connector['isInfrastructureError'](message);

    expect(result).toBe(false); // Critical
  });
});

Debugging

Check Log Filtering

bash

# Watch worker logs for filtering decisions
docker logs <worker-container> -f | grep -E "Skipping|Infrastructure error|Critical error"

# Expected output:
# ✅ "⚙️ Infrastructure error detected (not failing job)" - Non-critical
# ✅ "📝 Skipping non-critical log (warning)" - Non-critical log level
# ✅ "🚨 Critical error detected from ComfyUI (error)" - Critical

Verify Job Status

bash

# Check if job failed or succeeded
curl http://localhost:3100/jobs/<job-id>

# If infrastructure error was logged but job succeeded:
# → Pattern correctly identified as non-critical ✅

# If infrastructure error was logged and job failed:
# → Pattern NOT identified - needs to be added ❌

Common Infrastructure Errors

Telemetry/APM

Dash0 export failures
Jaeger/Zipkin trace exports
OpenTelemetry SDK errors
OTLP gRPC connection issues
APM agent connection failures

Metrics

Prometheus scrape failures
StatsD connection issues
Metric exporter errors
Time series database connection failures

Logging

Log shipper failures (Fluentd, Logstash)
Log aggregation errors
External log storage issues

Health Checks

Non-essential service health checks
Readiness/liveness probe timeouts (non-critical)

What to Watch For

Don't Make Critical Errors Non-Critical!

typescript

// ❌ BAD - This will hide real workflow errors
if (messageLower.includes('error')) {
  return true; // Way too broad!
}

// ✅ GOOD - Specific to infrastructure
if (
  messageLower.includes('error') &&
  (messageLower.includes('dash0') || messageLower.includes('telemetry'))
) {
  return true;
}

Be Specific About Service

typescript

// ❌ BAD - "export" could mean workflow export
if (messageLower.includes('export')) {
  return true;
}

// ✅ GOOD - Clear it's telemetry export
if (
  messageLower.includes('export') &&
  (messageLower.includes('otlp') || messageLower.includes('telemetry'))
) {
  return true;
}

Quick Reference

typescript

// Template for new infrastructure error pattern
// Location: apps/worker/src/connectors/comfyui-rest-stream-connector.ts

private isInfrastructureError(message: string): boolean {
  const messageLower = message.toLowerCase();

  // {Description of what this catches}
  // Example: "{Actual error message from logs}"
  if (
    messageLower.includes('{service-indicator}') &&
    (messageLower.includes('{error-type-1}') || messageLower.includes('{error-type-2}'))
  ) {
    return true; // Don't fail job
  }

  return false; // Critical - fail job
}

Non-Critical Error Handling ​

The Problem ​

Critical vs Non-Critical ​

Error Filtering Pipeline ​

Filter 1: Log Level ​

Filter 2: Infrastructure Error ​

THE ONE PLACE for Infrastructure Patterns ​

Step-by-Step: Adding a Pattern ​

Example: New False Positive ​

Step 1: Identify the Pattern ​

Step 2: Open THE ONE PLACE ​

Step 3: Add Pattern ​

Step 4: Build and Test ​

Step 5: Done! ✅ ​

Current Infrastructure Patterns ​

OpenTelemetry/OTLP Exports ​

Telemetry Endpoints (Dash0, Jaeger, Zipkin) ​

OpenTelemetry SDK ​

Metrics/Traces/Spans Exports ​

How to Identify Non-Critical Errors ​

1. Does this error prevent the AI workflow from running? ​

2. Is this error from infrastructure/observability? ​

3. Did the job complete successfully despite the error? ​

Pattern Matching Best Practices ​

DO ​

DON'T ​

Example Patterns ​

Good Pattern (Broad) ​

Bad Pattern (Too Specific) ​

Good Pattern (Multiple Indicators) ​

Testing ​

Manual Test ​

Integration Test ​

Debugging ​

Check Log Filtering ​

Verify Job Status ​

Common Infrastructure Errors ​

Telemetry/APM ​

Metrics ​

Logging ​

Health Checks ​

What to Watch For ​

Don't Make Critical Errors Non-Critical! ​

Be Specific About Service ​

Related Documentation ​

Quick Reference ​

Non-Critical Error Handling

The Problem

Critical vs Non-Critical

Error Filtering Pipeline

Filter 1: Log Level

Filter 2: Infrastructure Error

THE ONE PLACE for Infrastructure Patterns

Step-by-Step: Adding a Pattern

Example: New False Positive

Step 1: Identify the Pattern

Step 2: Open THE ONE PLACE

Step 3: Add Pattern

Step 4: Build and Test

Step 5: Done! ✅

Current Infrastructure Patterns

OpenTelemetry/OTLP Exports

Telemetry Endpoints (Dash0, Jaeger, Zipkin)

OpenTelemetry SDK

Metrics/Traces/Spans Exports

How to Identify Non-Critical Errors

1. Does this error prevent the AI workflow from running?

2. Is this error from infrastructure/observability?

3. Did the job complete successfully despite the error?

Pattern Matching Best Practices

DO

DON'T

Example Patterns

Good Pattern (Broad)

Bad Pattern (Too Specific)

Good Pattern (Multiple Indicators)

Testing

Manual Test

Integration Test

Debugging

Check Log Filtering

Verify Job Status

Common Infrastructure Errors

Telemetry/APM

Metrics

Logging

Health Checks

What to Watch For

Don't Make Critical Errors Non-Critical!

Be Specific About Service

Related Documentation

Quick Reference