Error Handling Architecture
Technical overview of the error handling system architecture.
Design Principles
1. TypeScript as Source of Truth
All error classification happens in TypeScript, not in Python or any other language.
Why?
- Unified error handling across all services (ComfyUI, OpenAI, Gemini, etc.)
- Minimal changes to upstream dependencies (ComfyUI fork)
- Faster iteration (no Docker rebuilds for ComfyUI)
- Single language/ecosystem for maintenance
2. Fail with Clarity, Not Fallbacks
From CLAUDE.md:
ALWAYS favor descriptive errors over fallbacks... ALWAYS
- No silent failures
- Explicit error messages with actionable suggestions
- Fail fast with clear root cause visibility
3. Two-Tier Classification
FailureType (High-Level) → FailureReason (Specific)
↓ ↓
VALIDATION_ERROR MISSING_REQUIRED_FIELD
RESOURCE_LIMIT GPU_MEMORY_FULL
MODEL_ERROR MODEL_NOT_FOUND4. Critical vs Non-Critical Filtering
Not all errors should fail jobs:
- Critical: Workflow execution errors → Fail job
- Non-Critical: Infrastructure/telemetry errors → Log only
System Components
Python Layer (ComfyUI)
Location: packages/comfyui/execution.py
Responsibility: Catch exceptions and send raw data only
# Send RAW exception info only (no classification)
error_details = {
"node_id": real_node_id,
"node_type": class_type,
"exception_message": str(ex),
"exception_type": exception_type,
"traceback": traceback.format_tb(tb),
"current_inputs": input_data_formatted,
}
# Publish to Redis Stream
redis.xadd('comfyui:unified:events', {
'event_type': 'log',
'level': 'error',
'message': json.dumps(error_details)
})What it does NOT do:
- ❌ Classify errors
- ❌ Determine retryability
- ❌ Generate user messages
- ❌ Pattern matching
TypeScript Worker (Connector)
Location: apps/worker/src/connectors/comfyui-rest-stream-connector.ts
Responsibility: Filter and route errors
// Filter 1: Log Level
const isCriticalLevel = logLevel === 'critical' || logLevel === 'error';
if (!isCriticalLevel) {
return { type: 'timeout' }; // Skip warning/info/debug
}
// Filter 2: Infrastructure Errors
const isInfrastructureError = this.isInfrastructureError(logMessage);
if (isInfrastructureError) {
logger.warn(`⚙️ Infrastructure error (not failing job)`);
return { type: 'timeout' }; // Log but don't fail
}
// Critical error → Send to enhancer
const connectorError = ComfyUIErrorEnhancer.enhance(error, context);TypeScript Core (Enhancer)
Location: packages/core/src/errors/comfyui-error-enhancer.ts
Responsibility: Pattern matching and classification
static enhance(error: ComfyUIError, context?: ComfyUIErrorContext): ConnectorError {
const message = error.exception_message || '';
const exceptionType = error.exception_type || '';
// Pattern matching in order (specific → generic)
if (message.match(/node .+ does not exist/i)) {
// Missing custom node
return new ConnectorError(
FailureType.VALIDATION_ERROR,
FailureReason.UNSUPPORTED_OPERATION,
'Custom node not installed',
false,
{ suggestion: 'Install the custom node' }
);
}
if (exceptionType.toLowerCase() === 'keyerror') {
// Missing required field
return new ConnectorError(
FailureType.VALIDATION_ERROR,
FailureReason.MISSING_REQUIRED_FIELD,
`Missing required field '${fieldName}'`,
false,
{ suggestion: 'Provide the required parameter' }
);
}
// Fallback to FailureClassifier
return FailureClassifier.classify(message);
}Fallback Classifier
Location: packages/core/src/types/failure-classification.ts
Responsibility: Generic pattern matching for all services
export class FailureClassifier {
static classify(errorMessage: string, context?: ClassificationContext) {
const error = errorMessage.toLowerCase();
// GPU OOM
if (error.includes('cuda out of memory')) {
return {
failure_type: FailureType.RESOURCE_LIMIT,
failure_reason: FailureReason.GPU_MEMORY_FULL,
failure_description: 'GPU ran out of memory',
retryable: true
};
}
// ... more patterns ...
// Unknown
return {
failure_type: FailureType.SYSTEM_ERROR,
failure_reason: FailureReason.UNKNOWN_ERROR,
failure_description: errorMessage,
retryable: false
};
}
}ConnectorError
Location: packages/core/src/types/connector-errors.ts
Responsibility: Structured error object
class ConnectorError extends Error {
constructor(
public failureType: FailureType,
public failureReason: FailureReason,
public message: string,
public retryable: boolean,
public context: Record<string, any>
) {
super(message);
}
}Data Flow
1. Exception Occurs
2. Worker Processing
3. UI Display
Error Flow Example
Scenario: Missing URL Parameter
Step 1: Python Exception
# packages/comfyui/custom_nodes/emprops_comfy_nodes/nodes/emprops_image_loader.py
def load_image(self, **kwargs):
url = kwargs['url'] # KeyError: 'url'Step 2: Exception Handler
# packages/comfyui/execution.py
except Exception as ex:
error_details = {
"node_id": "66",
"node_type": "emprops_image_loader",
"exception_message": "KeyError: 'url'",
"exception_type": "KeyError",
"traceback": [...],
"current_inputs": {}
}
# Publish to streamStep 3: Worker Receives
// apps/worker/src/connectors/comfyui-rest-stream-connector.ts
const logResult = await this.readLogStream(...);
// Filter 1: logLevel === 'error' ✅ (passes)
// Filter 2: not infrastructure ✅ (passes)
// Send to enhancer
const connectorError = ComfyUIErrorEnhancer.enhance(error);Step 4: Pattern Matching
// packages/core/src/errors/comfyui-error-enhancer.ts
if (exceptionType.toLowerCase() === 'keyerror') {
const fieldMatch = message.match(/KeyError:\s*['"]([^'"]+)['"]/i);
const fieldName = fieldMatch[1]; // 'url'
return new ConnectorError(
FailureType.VALIDATION_ERROR,
FailureReason.MISSING_REQUIRED_FIELD,
`Missing required field 'url' in node emprops_image_loader`,
false,
{
suggestion: "Provide the required 'url' parameter in your workflow configuration",
missingField: 'url'
}
);
}Step 5: Job Fails
// Job marked as failed
// User sees in Monitor UI:
// ❌ Missing required field 'url' in node emprops_image_loader
// Suggestion: Provide the required 'url' parameter in your workflow configurationExtension Points
Adding New Service Error Handlers
// Create service-specific enhancer
export class OpenAIErrorEnhancer {
static enhance(error: OpenAIError): ConnectorError {
// Service-specific pattern matching
}
}
// Use in connector
class OpenAIConnector extends BaseConnector {
protected handleServiceError(error: any): ConnectorError {
return OpenAIErrorEnhancer.enhance(error);
}
}Adding New FailureReason
// packages/core/src/types/failure-classification.ts
export enum FailureReason {
// Existing...
GPU_MEMORY_FULL = 'gpu_memory_full',
// New
VRAM_EXCEEDED = 'vram_exceeded', // Add this
}
// packages/core/src/types/connector-errors.ts
export const ErrorDescriptions: Record<FailureReason, string> = {
// Existing...
[FailureReason.GPU_MEMORY_FULL]: 'GPU ran out of memory',
// New
[FailureReason.VRAM_EXCEEDED]: 'VRAM allocation limit exceeded',
};Performance Considerations
Pattern Matching Order
// ✅ Specific patterns first (fast path)
if (message.match(/node .+ does not exist/i)) { ... }
// ✅ Common patterns second
if (exceptionType === 'keyerror') { ... }
// ✅ Expensive regex last
if (message.match(/complex.*regex.*pattern/i)) { ... }
// ✅ Fallback (slowest)
return FailureClassifier.classify(message);Caching
Currently no caching - each error is pattern-matched fresh.
Future optimization: Cache pattern → ConnectorError mapping
// Potential future optimization
const errorCache = new LRU<string, ConnectorError>(100);
static enhance(error: ComfyUIError): ConnectorError {
const cacheKey = `${error.exception_type}:${error.exception_message}`;
if (errorCache.has(cacheKey)) {
return errorCache.get(cacheKey);
}
const result = /* pattern matching */;
errorCache.set(cacheKey, result);
return result;
}Testing Strategy
Unit Tests
// Test individual pattern matchers
describe('ComfyUIErrorEnhancer', () => {
it('should classify KeyError', () => {
const error = { exception_type: 'KeyError', exception_message: "KeyError: 'url'" };
const result = ComfyUIErrorEnhancer.enhance(error);
expect(result.failureReason).toBe('missing_required_field');
});
});Integration Tests
# Test full error flow (Python → TypeScript → UI)
1. Start local development environment
2. Trigger error in ComfyUI workflow
3. Verify error classification in worker logs
4. Verify error display in Monitor UIE2E Tests
// Test error handling across services
it('should handle ComfyUI KeyError end-to-end', async () => {
const jobId = await submitJob({ /* missing url */ });
await waitForJobCompletion(jobId);
const job = await getJob(jobId);
expect(job.status).toBe('failed');
expect(job.error.failureReason).toBe('missing_required_field');
expect(job.error.context.missingField).toBe('url');
});Monitoring
Metrics to Track
- Error classification accuracy (% matched vs fallback)
- Pattern matching performance (avg time)
- Infrastructure error filter rate (% skipped)
- Most common error patterns
Logging
// Debug logging for pattern matching
logger.debug('Classifying error:', {
exceptionType: error.exception_type,
messagePreview: error.exception_message.substring(0, 100),
matchedPattern: 'KeyError',
failureReason: 'missing_required_field'
});
// Info logging for infrastructure errors
logger.info('Infrastructure error detected (not failing job):', {
message: logMessage.substring(0, 200),
category: 'OpenTelemetry Export'
});Future Improvements
Machine Learning Classification
// Potential ML-based classification
class MLErrorClassifier {
async classify(error: string): Promise<ConnectorError> {
const embedding = await getEmbedding(error);
const prediction = await model.predict(embedding);
return new ConnectorError(
prediction.failureType,
prediction.failureReason,
prediction.message,
prediction.retryable,
{ confidence: prediction.confidence }
);
}
}Error Pattern Analytics
// Track pattern effectiveness
interface PatternStats {
pattern: string;
matchCount: number;
lastMatched: Date;
avgConfidence: number;
}
// Identify gaps in coverage
const unmatchedErrors = errors.filter(e =>
e.failureReason === FailureReason.UNKNOWN_ERROR
);Auto-Retry Strategy
// Smart retry based on classification
if (error.failureType === FailureType.RESOURCE_LIMIT && error.retryable) {
// Retry with reduced settings
await retryWithReducedBatchSize(job);
} else if (error.failureType === FailureType.TIMEOUT) {
// Retry with increased timeout
await retryWithIncreasedTimeout(job);
}