Error Handling Architecture

Technical overview of the error handling system architecture.

Design Principles

1. TypeScript as Source of Truth

All error classification happens in TypeScript, not in Python or any other language.

Why?

Unified error handling across all services (ComfyUI, OpenAI, Gemini, etc.)
Minimal changes to upstream dependencies (ComfyUI fork)
Faster iteration (no Docker rebuilds for ComfyUI)
Single language/ecosystem for maintenance

2. Fail with Clarity, Not Fallbacks

From CLAUDE.md:

ALWAYS favor descriptive errors over fallbacks... ALWAYS

No silent failures
Explicit error messages with actionable suggestions
Fail fast with clear root cause visibility

3. Two-Tier Classification

FailureType (High-Level)  →  FailureReason (Specific)
      ↓                              ↓
VALIDATION_ERROR         MISSING_REQUIRED_FIELD
RESOURCE_LIMIT          GPU_MEMORY_FULL
MODEL_ERROR             MODEL_NOT_FOUND

4. Critical vs Non-Critical Filtering

Not all errors should fail jobs:

Critical: Workflow execution errors → Fail job
Non-Critical: Infrastructure/telemetry errors → Log only

System Components

Python Layer (ComfyUI)

Location: packages/comfyui/execution.py

Responsibility: Catch exceptions and send raw data only

python

# Send RAW exception info only (no classification)
error_details = {
    "node_id": real_node_id,
    "node_type": class_type,
    "exception_message": str(ex),
    "exception_type": exception_type,
    "traceback": traceback.format_tb(tb),
    "current_inputs": input_data_formatted,
}

# Publish to Redis Stream
redis.xadd('comfyui:unified:events', {
    'event_type': 'log',
    'level': 'error',
    'message': json.dumps(error_details)
})

What it does NOT do:

❌ Classify errors
❌ Determine retryability
❌ Generate user messages
❌ Pattern matching

TypeScript Worker (Connector)

Location: apps/worker/src/connectors/comfyui-rest-stream-connector.ts

Responsibility: Filter and route errors

typescript

// Filter 1: Log Level
const isCriticalLevel = logLevel === 'critical' || logLevel === 'error';
if (!isCriticalLevel) {
  return { type: 'timeout' }; // Skip warning/info/debug
}

// Filter 2: Infrastructure Errors
const isInfrastructureError = this.isInfrastructureError(logMessage);
if (isInfrastructureError) {
  logger.warn(`⚙️ Infrastructure error (not failing job)`);
  return { type: 'timeout' }; // Log but don't fail
}

// Critical error → Send to enhancer
const connectorError = ComfyUIErrorEnhancer.enhance(error, context);

TypeScript Core (Enhancer)

Location: packages/core/src/errors/comfyui-error-enhancer.ts

Responsibility: Pattern matching and classification

typescript

static enhance(error: ComfyUIError, context?: ComfyUIErrorContext): ConnectorError {
  const message = error.exception_message || '';
  const exceptionType = error.exception_type || '';

  // Pattern matching in order (specific → generic)
  if (message.match(/node .+ does not exist/i)) {
    // Missing custom node
    return new ConnectorError(
      FailureType.VALIDATION_ERROR,
      FailureReason.UNSUPPORTED_OPERATION,
      'Custom node not installed',
      false,
      { suggestion: 'Install the custom node' }
    );
  }

  if (exceptionType.toLowerCase() === 'keyerror') {
    // Missing required field
    return new ConnectorError(
      FailureType.VALIDATION_ERROR,
      FailureReason.MISSING_REQUIRED_FIELD,
      `Missing required field '${fieldName}'`,
      false,
      { suggestion: 'Provide the required parameter' }
    );
  }

  // Fallback to FailureClassifier
  return FailureClassifier.classify(message);
}

Fallback Classifier

Location: packages/core/src/types/failure-classification.ts

Responsibility: Generic pattern matching for all services

typescript

export class FailureClassifier {
  static classify(errorMessage: string, context?: ClassificationContext) {
    const error = errorMessage.toLowerCase();

    // GPU OOM
    if (error.includes('cuda out of memory')) {
      return {
        failure_type: FailureType.RESOURCE_LIMIT,
        failure_reason: FailureReason.GPU_MEMORY_FULL,
        failure_description: 'GPU ran out of memory',
        retryable: true
      };
    }

    // ... more patterns ...

    // Unknown
    return {
      failure_type: FailureType.SYSTEM_ERROR,
      failure_reason: FailureReason.UNKNOWN_ERROR,
      failure_description: errorMessage,
      retryable: false
    };
  }
}

ConnectorError

Location: packages/core/src/types/connector-errors.ts

Responsibility: Structured error object

typescript

class ConnectorError extends Error {
  constructor(
    public failureType: FailureType,
    public failureReason: FailureReason,
    public message: string,
    public retryable: boolean,
    public context: Record<string, any>
  ) {
    super(message);
  }
}

Data Flow

1. Exception Occurs

2. Worker Processing

3. UI Display

Error Flow Example

Scenario: Missing URL Parameter

Step 1: Python Exception

python

# packages/comfyui/custom_nodes/emprops_comfy_nodes/nodes/emprops_image_loader.py
def load_image(self, **kwargs):
    url = kwargs['url']  # KeyError: 'url'

Step 2: Exception Handler

python

# packages/comfyui/execution.py
except Exception as ex:
    error_details = {
        "node_id": "66",
        "node_type": "emprops_image_loader",
        "exception_message": "KeyError: 'url'",
        "exception_type": "KeyError",
        "traceback": [...],
        "current_inputs": {}
    }
    # Publish to stream

Step 3: Worker Receives

typescript

// apps/worker/src/connectors/comfyui-rest-stream-connector.ts
const logResult = await this.readLogStream(...);

// Filter 1: logLevel === 'error' ✅ (passes)
// Filter 2: not infrastructure ✅ (passes)

// Send to enhancer
const connectorError = ComfyUIErrorEnhancer.enhance(error);

Step 4: Pattern Matching

typescript

// packages/core/src/errors/comfyui-error-enhancer.ts
if (exceptionType.toLowerCase() === 'keyerror') {
  const fieldMatch = message.match(/KeyError:\s*['"]([^'"]+)['"]/i);
  const fieldName = fieldMatch[1]; // 'url'

  return new ConnectorError(
    FailureType.VALIDATION_ERROR,
    FailureReason.MISSING_REQUIRED_FIELD,
    `Missing required field 'url' in node emprops_image_loader`,
    false,
    {
      suggestion: "Provide the required 'url' parameter in your workflow configuration",
      missingField: 'url'
    }
  );
}

Step 5: Job Fails

typescript

// Job marked as failed
// User sees in Monitor UI:
// ❌ Missing required field 'url' in node emprops_image_loader
// Suggestion: Provide the required 'url' parameter in your workflow configuration

Extension Points

Adding New Service Error Handlers

typescript

// Create service-specific enhancer
export class OpenAIErrorEnhancer {
  static enhance(error: OpenAIError): ConnectorError {
    // Service-specific pattern matching
  }
}

// Use in connector
class OpenAIConnector extends BaseConnector {
  protected handleServiceError(error: any): ConnectorError {
    return OpenAIErrorEnhancer.enhance(error);
  }
}

Adding New FailureReason

typescript

// packages/core/src/types/failure-classification.ts
export enum FailureReason {
  // Existing...
  GPU_MEMORY_FULL = 'gpu_memory_full',

  // New
  VRAM_EXCEEDED = 'vram_exceeded',  // Add this
}

// packages/core/src/types/connector-errors.ts
export const ErrorDescriptions: Record<FailureReason, string> = {
  // Existing...
  [FailureReason.GPU_MEMORY_FULL]: 'GPU ran out of memory',

  // New
  [FailureReason.VRAM_EXCEEDED]: 'VRAM allocation limit exceeded',
};

Performance Considerations

Pattern Matching Order

typescript

// ✅ Specific patterns first (fast path)
if (message.match(/node .+ does not exist/i)) { ... }

// ✅ Common patterns second
if (exceptionType === 'keyerror') { ... }

// ✅ Expensive regex last
if (message.match(/complex.*regex.*pattern/i)) { ... }

// ✅ Fallback (slowest)
return FailureClassifier.classify(message);

Caching

Currently no caching - each error is pattern-matched fresh.

Future optimization: Cache pattern → ConnectorError mapping

typescript

// Potential future optimization
const errorCache = new LRU<string, ConnectorError>(100);

static enhance(error: ComfyUIError): ConnectorError {
  const cacheKey = `${error.exception_type}:${error.exception_message}`;

  if (errorCache.has(cacheKey)) {
    return errorCache.get(cacheKey);
  }

  const result = /* pattern matching */;
  errorCache.set(cacheKey, result);
  return result;
}

Testing Strategy

Unit Tests

typescript

// Test individual pattern matchers
describe('ComfyUIErrorEnhancer', () => {
  it('should classify KeyError', () => {
    const error = { exception_type: 'KeyError', exception_message: "KeyError: 'url'" };
    const result = ComfyUIErrorEnhancer.enhance(error);
    expect(result.failureReason).toBe('missing_required_field');
  });
});

Integration Tests

bash

# Test full error flow (Python → TypeScript → UI)
1. Start local development environment
2. Trigger error in ComfyUI workflow
3. Verify error classification in worker logs
4. Verify error display in Monitor UI

E2E Tests

typescript

// Test error handling across services
it('should handle ComfyUI KeyError end-to-end', async () => {
  const jobId = await submitJob({ /* missing url */ });

  await waitForJobCompletion(jobId);

  const job = await getJob(jobId);
  expect(job.status).toBe('failed');
  expect(job.error.failureReason).toBe('missing_required_field');
  expect(job.error.context.missingField).toBe('url');
});

Monitoring

Metrics to Track

Error classification accuracy (% matched vs fallback)
Pattern matching performance (avg time)
Infrastructure error filter rate (% skipped)
Most common error patterns

Logging

typescript

// Debug logging for pattern matching
logger.debug('Classifying error:', {
  exceptionType: error.exception_type,
  messagePreview: error.exception_message.substring(0, 100),
  matchedPattern: 'KeyError',
  failureReason: 'missing_required_field'
});

// Info logging for infrastructure errors
logger.info('Infrastructure error detected (not failing job):', {
  message: logMessage.substring(0, 200),
  category: 'OpenTelemetry Export'
});

Future Improvements

Machine Learning Classification

typescript

// Potential ML-based classification
class MLErrorClassifier {
  async classify(error: string): Promise<ConnectorError> {
    const embedding = await getEmbedding(error);
    const prediction = await model.predict(embedding);

    return new ConnectorError(
      prediction.failureType,
      prediction.failureReason,
      prediction.message,
      prediction.retryable,
      { confidence: prediction.confidence }
    );
  }
}

Error Pattern Analytics

typescript

// Track pattern effectiveness
interface PatternStats {
  pattern: string;
  matchCount: number;
  lastMatched: Date;
  avgConfidence: number;
}

// Identify gaps in coverage
const unmatchedErrors = errors.filter(e =>
  e.failureReason === FailureReason.UNKNOWN_ERROR
);

Auto-Retry Strategy

typescript

// Smart retry based on classification
if (error.failureType === FailureType.RESOURCE_LIMIT && error.retryable) {
  // Retry with reduced settings
  await retryWithReducedBatchSize(job);
} else if (error.failureType === FailureType.TIMEOUT) {
  // Retry with increased timeout
  await retryWithIncreasedTimeout(job);
}

Error Handling Architecture ​

Design Principles ​

1. TypeScript as Source of Truth ​

2. Fail with Clarity, Not Fallbacks ​

3. Two-Tier Classification ​

4. Critical vs Non-Critical Filtering ​

System Components ​

Python Layer (ComfyUI) ​

TypeScript Worker (Connector) ​

TypeScript Core (Enhancer) ​

Fallback Classifier ​

ConnectorError ​

Data Flow ​

1. Exception Occurs ​

2. Worker Processing ​

3. UI Display ​

Error Flow Example ​

Scenario: Missing URL Parameter ​

Extension Points ​

Adding New Service Error Handlers ​

Adding New FailureReason ​

Performance Considerations ​

Pattern Matching Order ​

Caching ​

Testing Strategy ​

Unit Tests ​

Integration Tests ​

E2E Tests ​

Monitoring ​

Metrics to Track ​

Logging ​

Future Improvements ​

Machine Learning Classification ​

Error Pattern Analytics ​

Auto-Retry Strategy ​

Related Documentation ​

Error Handling Architecture

Design Principles

1. TypeScript as Source of Truth

2. Fail with Clarity, Not Fallbacks

3. Two-Tier Classification

4. Critical vs Non-Critical Filtering

System Components

Python Layer (ComfyUI)

TypeScript Worker (Connector)

TypeScript Core (Enhancer)

Fallback Classifier

ConnectorError

Data Flow

1. Exception Occurs

2. Worker Processing

3. UI Display

Error Flow Example

Scenario: Missing URL Parameter

Extension Points

Adding New Service Error Handlers

Adding New FailureReason

Performance Considerations

Pattern Matching Order

Caching

Testing Strategy

Unit Tests

Integration Tests

E2E Tests

Monitoring

Metrics to Track

Logging

Future Improvements

Machine Learning Classification

Error Pattern Analytics

Auto-Retry Strategy

Related Documentation