Error Handling Modernization - ADR (Connector-Agnostic)
Status: Proposed Date: 2025-11-07 Author: System Architecture Supersedes: None Related: LOGGING_ARCHITECTURE.md, connector-error-handling-standard.md
Context
The emp-job-queue system integrates with multiple external services (ComfyUI, OpenAI, Gemini, Glif, etc.), and currently suffers from inconsistent error handling:
User-Facing Problems:
"Error in component 'Group Image': [object Object]"(ComfyUI serialization)"Unknown error type. Simple retry may help."(Generic fallback)"Rate limit exceeded"(Lost component context)- Inconsistent error messages across different connectors
Root Causes:
- Service-Specific Issues: Each external service returns errors differently
- Context Loss: Component/workflow info lost during error propagation
- Inconsistent Connector Handling: Each connector handles errors differently
- Poor Descriptions: FailureClassifier fallbacks are too generic
Critical Principle:
ComfyUI is just one external service among many. TypeScript must be the source of truth for error classification - NOT Python, NOT any external service.
Current State
Multi-Connector Architecture
┌─────────────────────────────────────────────────────────────┐
│ External Services (We Don't Control) │
│ │
│ ComfyUI OpenAI Gemini Glif │
│ Python REST API REST API REST API │
│ WebSocket JSON errors JSON errors JSON errors │
│ execution_error {"error": {... {"error": {... {"message":│
│ str(x) → [obj] "code": 429} "code": 400} "fail"} │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ TypeScript Connector Layer (WE CONTROL) │
│ │
│ ComfyUIRestStreamConnector OpenAITextConnector │
│ ComfyUIWebSocketConnector GeminiImageConnector │
│ ComfyUIRemoteConnector GlifConnector │
│ │
│ ⚠️ PROBLEM: Each connector handles errors differently │
│ ⚠️ PROBLEM: Pattern matching is fragile and inconsistent │
│ ⚠️ PROBLEM: Context (component, workflow) often lost │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ FailureClassifier (TypeScript - Our Code) │
│ ✅ Already connector-agnostic │
│ ✅ Already works across all services │
│ ⚠️ BUT: Fallback messages too generic │
│ ⚠️ BUT: Doesn't preserve context │
└─────────────────────────────────────────────────────────────┘Specific Problems by Connector
| Connector | Service | Current Issue | User Impact |
|---|---|---|---|
| ComfyUI REST Stream | ComfyUI | Pattern matching only, generic errors | "Unknown error" for uncommon issues |
| ComfyUI WebSocket | ComfyUI | 15+ service-specific patterns | Works but fragile (breaks if ComfyUI changes wording) |
| OpenAI Text/Image | OpenAI | Uses HTTP error parser | Lost OpenAI error codes, generic descriptions |
| Gemini Image | Uses HTTP error parser | Lost Google error details | |
| Glif | Glif API | Uses HTTP error parser | Lost Glif-specific context |
Key Insight: We already have ConnectorError.fromHTTPError() but it's too generic. Each connector needs service-specific enhancement while using common TypeScript infrastructure.
Decision
Implement connector-agnostic error handling with TypeScript as source of truth:
Principle 1: TypeScript Owns Error Classification
// Source of Truth: @emp/core/src/types/failure-classification.ts
export enum FailureType { ... }
export enum FailureReason { ... }
export class FailureClassifier { ... }NOT:
- ❌ Python error codes (duplicates logic, ComfyUI-specific)
- ❌ Service-specific enums (doesn't scale to N services)
- ❌ External service error codes (we don't control them)
Principle 2: Service-Specific Parsers Enhance Generic Errors
Each service gets a lightweight parser that maps service errors → FailureClassifier:
// Service-specific knowledge
class OpenAIErrorEnhancer {
static enhance(httpError: any): ConnectorError {
// Map OpenAI error codes to FailureReason
if (httpError.response?.data?.error?.code === 'context_length_exceeded') {
return new ConnectorError(
FailureType.VALIDATION_ERROR,
FailureReason.INVALID_PAYLOAD,
'Input exceeds OpenAI token limit',
false,
{
serviceType: 'openai',
maxTokens: httpError.response.data.error.param,
suggestion: 'Reduce input text length or use a larger model'
}
);
}
// ... 10 more OpenAI-specific mappings
// Fallback to generic HTTP handler
return ConnectorError.fromHTTPError(httpError, 'openai');
}
}Principle 3: Always Preserve Context
Every ConnectorError must capture:
serviceType(comfyui, openai, gemini, etc.)component(if from Studio V2 workflow)workflow(workflow name/ID)suggestion(actionable next step)rawError(full error for debugging)
Proposed Architecture (Connector-Agnostic)
Component Overview
┌────────────────────────────────────────────────────────────────┐
│ External Services (Unchanged) │
│ - ComfyUI, OpenAI, Gemini, Glif, etc. │
│ - Each returns errors in different format │
└────────────────────────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────────────────────────┐
│ Service-Specific Error Enhancers (NEW) │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ OpenAIErrorEnhancer │ │
│ │ - Maps OpenAI error codes → FailureReason │ │
│ │ - Adds OpenAI-specific suggestions │ │
│ │ │ │
│ │ GeminiErrorEnhancer │ │
│ │ - Maps Google error codes → FailureReason │ │
│ │ - Adds Gemini-specific suggestions │ │
│ │ │ │
│ │ ComfyUIErrorEnhancer │ │
│ │ - Parses ComfyUI log messages → FailureReason │ │
│ │ - Adds node/workflow context │ │
│ │ - Fallback to FailureClassifier pattern matching │ │
│ └──────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────────────────────────┐
│ FailureClassifier (Enhanced - Core Package) │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ EXISTING (Keep): │ │
│ │ - classify(message: string) → FailureClassification │ │
│ │ - Pattern matching for generic errors │ │
│ │ │ │
│ │ NEW (Add): │ │
│ │ - Better fallback descriptions │ │
│ │ - Suggestion generation │ │
│ │ - Context preservation helpers │ │
│ └──────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────────────────────────┐
│ ConnectorError (Enhanced - Core Package) │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ EXISTING (Keep): │ │
│ │ - failureType, failureReason, retryable │ │
│ │ - context: ConnectorErrorContext │ │
│ │ - fromError(), fromHTTPError() │ │
│ │ │ │
│ │ NEW (Add): │ │
│ │ - getUserMessage(): hierarchical display │ │
│ │ - getSuggestion(): actionable next step │ │
│ │ - preserveContext(component, workflow) │ │
│ └──────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────────────────────────┐
│ Connector Base Pattern (Standard) │
│ All connectors follow standard error handling pattern: │
│ │
│ 1. Catch service error │
│ 2. Use service-specific enhancer (if available) │
│ 3. Preserve context (component, workflow) │
│ 4. Return ConnectorError │
└────────────────────────────────────────────────────────────────┘Implementation Phases (Quick Wins First)
Phase 1: Fix ComfyUI Object Serialization (1 Hour) ✅ QUICK WIN
Problem: "model=[object Object]" in ComfyUI errors Solution: Fix Python serialization (ComfyUI-specific, one-time fix)
File: packages/comfyui/execution.py:393
Replace format_value() with safe JSON serialization:
import json
def format_value_safe(x, max_depth=3, current_depth=0):
"""Serialize value for error reporting (prevents [object Object])."""
if current_depth >= max_depth:
return f"<{type(x).__name__}>"
if isinstance(x, (int, float, bool, str, type(None))):
return x
elif isinstance(x, dict):
return {k: format_value_safe(v, max_depth, current_depth + 1)
for k, v in list(x.items())[:10]}
elif isinstance(x, (list, tuple)):
return [format_value_safe(v, max_depth, current_depth + 1)
for v in list(x)[:10]]
else:
try:
json.dumps(x)
return x
except (TypeError, ValueError):
return f"<{type(x).__name__}>"Impact: Zero code changes in TypeScript, fixes one service's serialization issue.
Phase 2: Enhance FailureClassifier Descriptions (2 Hours) ✅ QUICK WIN
Problem: Generic fallback messages like "Unclassified error" Solution: Better descriptions in existing ErrorDescriptions map
File: packages/core/src/types/connector-errors.ts:311
Enhance existing descriptions with context variables:
export const ErrorDescriptions: Record<FailureReason, string> = {
// OLD: Generic and unhelpful
[FailureReason.UNKNOWN_ERROR]:
'An unknown error occurred. Please try again or contact support.',
// NEW: Contextual and actionable
[FailureReason.UNKNOWN_ERROR]:
'An unexpected error occurred in {serviceType}: {message}. ' +
'This may be a temporary issue - try again or contact support if it persists.',
// OLD: Technical jargon
[FailureReason.INVALID_PAYLOAD]:
'Request contains invalid data. Please check your input parameters.',
// NEW: Service-aware guidance
[FailureReason.INVALID_PAYLOAD]:
'{serviceType} rejected the request due to invalid data. ' +
'Common causes: incorrect parameter format, missing required fields, or unsupported values. ' +
'Check your workflow configuration and try again.',
// ... enhance all 40+ error descriptions
};Impact: All connectors get better error messages immediately (no connector changes needed).
Phase 3: Standard Connector Error Pattern (4 Hours) ✅ CRITICAL
Problem: Each connector handles errors differently Solution: Base class method for standardized error handling
File: apps/worker/src/connectors/base-connector.ts
Add standard error handling method:
export abstract class BaseConnector {
/**
* Standard error handling pattern for all connectors
*
* Usage in child connectors:
* catch (error) {
* throw this.createConnectorError(error, jobData, {
* suggestion: 'Try reducing image size'
* });
* }
*/
protected createConnectorError(
error: any,
jobData: JobData,
options?: {
suggestion?: string;
context?: Record<string, any>;
retryable?: boolean;
}
): ConnectorError {
// Extract context from job data
const component = this.extractComponentFromJobData(jobData);
const workflow = this.extractWorkflowFromJobData(jobData);
// Use service-specific enhancer if available
let connectorError: ConnectorError;
if (this.service_type === 'openai' && error.response) {
connectorError = OpenAIErrorEnhancer.enhance(error);
} else if (this.service_type === 'gemini' && error.response) {
connectorError = GeminiErrorEnhancer.enhance(error);
} else if (error instanceof Error) {
connectorError = ConnectorError.fromError(error, this.service_type);
} else {
connectorError = ConnectorError.fromHTTPError(error, this.service_type);
}
// Preserve context
connectorError.context = {
...connectorError.context,
serviceType: this.service_type,
component,
workflow,
jobId: jobData.id,
...options?.context
};
// Override suggestion if provided
if (options?.suggestion) {
connectorError.context.suggestion = options.suggestion;
}
// Override retryability if specified
if (options?.retryable !== undefined) {
(connectorError as any).retryable = options.retryable;
}
return connectorError;
}
/** Extract component name from job data (Studio V2 workflows) */
private extractComponentFromJobData(jobData: JobData): string | undefined {
return jobData.metadata?.component_name ||
jobData.ctx?.component_name ||
undefined;
}
/** Extract workflow name from job data */
private extractWorkflowFromJobData(jobData: JobData): string | undefined {
return jobData.payload?.workflow_name ||
jobData.metadata?.workflow_name ||
undefined;
}
}Impact: All connectors now have standard error handling (consistent, context-preserving).
Phase 4: Service-Specific Enhancers (1 Day - Optional)
Problem: HTTP errors lose service-specific details Solution: Create lightweight enhancers for each major service
OpenAI Error Enhancer
File: packages/core/src/error-enhancers/openai-error-enhancer.ts (NEW)
export class OpenAIErrorEnhancer {
static enhance(httpError: any): ConnectorError {
const errorData = httpError.response?.data?.error;
if (!errorData) {
return ConnectorError.fromHTTPError(httpError, 'openai');
}
// Map OpenAI error codes to our FailureReason
const codeMap: Record<string, [FailureType, FailureReason, string]> = {
'context_length_exceeded': [
FailureType.VALIDATION_ERROR,
FailureReason.INVALID_PAYLOAD,
'Input exceeds OpenAI token limit. Reduce input text or use a larger model (gpt-4-turbo has 128k context).'
],
'rate_limit_exceeded': [
FailureType.RATE_LIMIT,
FailureReason.REQUESTS_PER_MINUTE,
'OpenAI rate limit exceeded. Wait a moment before retrying.'
],
'invalid_api_key': [
FailureType.AUTH_ERROR,
FailureReason.INVALID_API_KEY,
'OpenAI API key is invalid or expired. Check credentials configuration.'
],
'insufficient_quota': [
FailureType.RATE_LIMIT,
FailureReason.DAILY_QUOTA_EXCEEDED,
'OpenAI account quota exceeded. Add credits or upgrade plan.'
],
'model_not_found': [
FailureType.VALIDATION_ERROR,
FailureReason.MODEL_NOT_FOUND,
'OpenAI model not found. Check model name or your account access.'
]
};
const mapping = codeMap[errorData.code];
if (mapping) {
const [failureType, failureReason, suggestion] = mapping;
return new ConnectorError(
failureType,
failureReason,
errorData.message || 'OpenAI API error',
failureType === FailureType.RATE_LIMIT, // Rate limits are retryable
{
serviceType: 'openai',
httpStatus: httpError.response?.status,
openaiErrorCode: errorData.code,
openaiErrorType: errorData.type,
suggestion,
rawRequest: httpError.config?.data,
rawResponse: errorData
}
);
}
// Fallback to HTTP handler
return ConnectorError.fromHTTPError(httpError, 'openai');
}
}Gemini Error Enhancer
File: packages/core/src/error-enhancers/gemini-error-enhancer.ts (NEW)
export class GeminiErrorEnhancer {
static enhance(httpError: any): ConnectorError {
const errorData = httpError.response?.data?.error;
const status = httpError.response?.status;
// Google Cloud error format
if (errorData?.status) {
const statusMap: Record<string, [FailureType, FailureReason, string]> = {
'PERMISSION_DENIED': [
FailureType.AUTH_ERROR,
FailureReason.INSUFFICIENT_PERMISSIONS,
'Google Cloud API access denied. Check service account permissions.'
],
'RESOURCE_EXHAUSTED': [
FailureType.RATE_LIMIT,
FailureReason.REQUESTS_PER_MINUTE,
'Gemini API quota exceeded. Wait before retrying.'
],
'INVALID_ARGUMENT': [
FailureType.VALIDATION_ERROR,
FailureReason.INVALID_PAYLOAD,
'Gemini rejected request parameters. Check input format and requirements.'
],
'FAILED_PRECONDITION': [
FailureType.VALIDATION_ERROR,
FailureReason.UNSUPPORTED_OPERATION,
'Gemini model doesn\'t support this operation. Check model capabilities.'
]
};
const mapping = statusMap[errorData.status];
if (mapping) {
const [failureType, failureReason, suggestion] = mapping;
return new ConnectorError(
failureType,
failureReason,
errorData.message || 'Gemini API error',
failureType === FailureType.RATE_LIMIT,
{
serviceType: 'gemini',
httpStatus: status,
geminiErrorStatus: errorData.status,
suggestion,
rawResponse: errorData
}
);
}
}
return ConnectorError.fromHTTPError(httpError, 'gemini');
}
}ComfyUI Error Enhancer
File: packages/core/src/error-enhancers/comfyui-error-enhancer.ts (NEW)
export class ComfyUIErrorEnhancer {
/**
* Enhance ComfyUI errors (both log stream and WebSocket)
*/
static enhance(error: any, context?: { component?: string; workflow?: string }): ConnectorError {
const message = error.message || error.exception_message || '';
// Check for specific ComfyUI error patterns
// GPU OOM
if (message.includes('CUDA out of memory') || message.includes('GPU memory')) {
return new ConnectorError(
FailureType.RESOURCE_LIMIT,
FailureReason.GPU_MEMORY_FULL,
'GPU ran out of memory during workflow execution',
true, // Retryable
{
serviceType: 'comfyui',
node: error.node_id ? { id: error.node_id, type: error.node_type } : undefined,
component: context?.component,
workflow: context?.workflow,
suggestion: 'Try reducing image size, batch size, or use a lower resolution. ' +
'Current settings exceeded GPU capacity.',
rawError: error
}
);
}
// Missing custom node
if (message.match(/node .+ does not exist/i)) {
const nodeMatch = message.match(/node (\S+) does not exist/i);
const nodeName = nodeMatch ? nodeMatch[1] : 'unknown';
return new ConnectorError(
FailureType.VALIDATION_ERROR,
FailureReason.UNSUPPORTED_OPERATION,
`Custom node '${nodeName}' is not installed`,
false, // Not retryable
{
serviceType: 'comfyui',
node: { id: error.node_id, type: nodeName },
component: context?.component,
workflow: context?.workflow,
suggestion: `Install the '${nodeName}' custom node from ComfyUI Manager before running this workflow.`,
rawError: error
}
);
}
// Model not found
if (message.match(/checkpoint|model.+not found|does not exist/i)) {
const modelMatch = message.match(/'([^']+)'/);
const modelName = modelMatch ? modelMatch[1] : 'unknown';
return new ConnectorError(
FailureType.VALIDATION_ERROR,
FailureReason.MODEL_NOT_FOUND,
`Model '${modelName}' not found in ComfyUI`,
false,
{
serviceType: 'comfyui',
node: error.node_id ? { id: error.node_id, type: error.node_type } : undefined,
component: context?.component,
workflow: context?.workflow,
suggestion: `Install the model '${modelName}' or select a different model from the available list.`,
rawError: error
}
);
}
// Fallback to FailureClassifier pattern matching
const classification = FailureClassifier.classify(message, { serviceType: 'comfyui' });
return new ConnectorError(
classification.failure_type,
classification.failure_reason,
message,
classification.retryable ?? true,
{
serviceType: 'comfyui',
node: error.node_id ? { id: error.node_id, type: error.node_type } : undefined,
component: context?.component,
workflow: context?.workflow,
rawError: error
}
);
}
}Export in Core:
// packages/core/src/index.ts
export * from './error-enhancers/openai-error-enhancer.js';
export * from './error-enhancers/gemini-error-enhancer.js';
export * from './error-enhancers/comfyui-error-enhancer.js';Impact: Service-specific errors get detailed context while maintaining connector-agnostic architecture.
Phase 5: Update Connectors to Use Pattern (4 Hours)
Problem: Existing connectors don't use standard pattern Solution: Refactor all connectors to use createConnectorError()
Example: OpenAI Text Connector
Before:
// apps/worker/src/connectors/openai-text-connector.ts
try {
const response = await this.client.chat.completions.create(params);
// ...
} catch (error: any) {
// Generic HTTP error handling - loses OpenAI context
throw ConnectorError.fromHTTPError(error, 'openai');
}After:
try {
const response = await this.client.chat.completions.create(params);
// ...
} catch (error: any) {
// Use standard pattern with service-specific enhancement
throw this.createConnectorError(error, jobData);
}Example: ComfyUI REST Stream Connector
Before:
// Inline error classification
const classification = FailureClassifier.classify(message.data.message);
return new ConnectorError(
classification.failure_type,
classification.failure_reason,
message.data.message,
classification.retryable,
{ serviceType: 'comfyui_rest_stream' }
);After:
// Use standard pattern + ComfyUI enhancer
throw this.createConnectorError(message.data, jobData);Impact: All connectors now have consistent error handling, context preservation, and service-specific enhancements.
Consequences
Benefits
1. Connector-Agnostic Architecture ✅
- Works for ANY external service (ComfyUI, OpenAI, Gemini, future services)
- TypeScript is single source of truth
- No duplication between languages
2. Better User Experience ✅
- Context preserved: "Error in component 'Image Generator' → OpenAI rate limit exceeded"
- Actionable suggestions: "Wait 60 seconds and try again"
- Service-aware messages: "OpenAI token limit (4096) exceeded. Use gpt-4-turbo for larger inputs."
3. Maintainability ✅
- Standard pattern for all connectors
- Service-specific logic isolated in enhancers
- Easy to add new services (just add new enhancer)
4. Quick Wins ✅
- Phase 1: Fixes ComfyUI serialization (1 hour, zero TypeScript changes)
- Phase 2: Better error descriptions (2 hours, all connectors benefit)
- Phase 3: Standard pattern (4 hours, foundation for everything)
Drawbacks
1. Service-Specific Knowledge Required
Each new service needs an enhancer with service-specific error code mappings.
Mitigation: Enhancers are optional - fallback to generic HTTP/FailureClassifier always works.
2. Migration Effort
All existing connectors need refactoring to use standard pattern.
Mitigation: Can be done incrementally (connector by connector). Old code still works.
Success Metrics
Quantitative
- Zero "[object Object]" in production error logs (Phase 1)
- 100% context preservation - all errors include component/workflow (Phase 3)
- < 5% "Unknown error" fallback - most errors get specific classification (Phase 4)
- 90%+ errors have suggestions (Phase 2-4)
Qualitative
- Support ticket reduction - fewer "what does this error mean?" tickets
- Developer productivity - faster debugging with preserved context
- User satisfaction - clear, actionable error messages
Migration Strategy
Incremental Rollout
Week 1: Quick wins (Phases 1-2)
- Fix ComfyUI serialization
- Enhance error descriptions
- Deploy to production (low risk, high impact)
Week 2: Foundation (Phase 3)
- Add standard pattern to BaseConnector
- Refactor 1-2 connectors as proof of concept
- A/B test with old error handling
Week 3: Service Enhancers (Phase 4 - Optional)
- Create OpenAI, Gemini, ComfyUI enhancers
- Test with real errors from production
Week 4: Full Migration (Phase 5)
- Refactor all remaining connectors
- Remove old error handling code
- Monitor error rates
Dependencies
TypeScript
- Existing:
@emp/core(FailureClassifier, ConnectorError) - New: Service-specific enhancers (lightweight, ~100 LOC each)
Python (ComfyUI only)
- One-time fix:
format_value_safe()in execution.py
Alternative Approaches Considered
❌ Alternative 1: Python Error Codes (Original ADR)
Why Rejected:
- ComfyUI-specific, doesn't help OpenAI/Gemini/etc.
- Duplicates logic between Python and TypeScript
- Makes Python the source of truth (wrong direction)
❌ Alternative 2: Each Connector Handles Errors Independently
Why Rejected:
- Inconsistent error messages
- Duplicated logic across connectors
- No context preservation standard
✅ Alternative 3: TypeScript-First with Service Enhancers (THIS ADR)
Why Chosen:
- Connector-agnostic
- TypeScript is source of truth
- Service-specific knowledge isolated and optional
- Incremental implementation path
Related ADRs
- connector-error-handling-standard.md - ConnectorError class design
- LOGGING_ARCHITECTURE.md - Redis log streaming (Phase 2 implemented)
Appendix: Error Message Examples
Before Modernization
ComfyUI:
Error: model=[object Object], vae=[object Object]OpenAI:
HTTP 429: Rate limit exceededGeneric:
Unknown error type. Simple retry may help.After Modernization
ComfyUI:
ComfyUI Workflow Error → Resource Limit → GPU Memory Exceeded
GPU ran out of memory while processing node 'KSampler' (ID: 3)
in workflow 'Image Generation'.
💡 Suggestion: Try reducing image size from 2048x2048 to 1024x1024,
or reduce batch size from 4 to 2.
Component: Image Generator
Workflow: txt2img-basic
Node: KSampler (ID: 3)OpenAI:
OpenAI API Error → Validation Error → Token Limit Exceeded
Input exceeds OpenAI token limit (4096 tokens).
💡 Suggestion: Reduce input text length or use a larger model like
gpt-4-turbo which supports 128,000 tokens.
Component: Text Generator
Model: gpt-3.5-turbo
Input Tokens: 5200
Max Tokens: 4096Gemini:
Gemini API Error → Rate Limit → Request Quota Exceeded
Gemini API quota exceeded for your project.
💡 Suggestion: Wait a moment before retrying, or upgrade your
Google Cloud quota limit.
Component: Image Analysis
Model: gemini-pro-vision
Quota: 60 requests/minute (exceeded)End of ADR
