ADR: Error Management System V2 - Declarative Response Rules
Status: Proposed Date: 2025-11-07 Authors: Claude Code Context: Current error system provides structured errors but lacks intelligent error recovery and user-friendly messaging
Problem Statement
Current error system (V1) has fundamental limitations:
- Static User Messages: Generic messages like "Service returned unexpected content type" don't help users fix their prompts
- No Error Recovery: System can't automatically retry with modified prompts or adjusted parameters
- No Pattern Learning: Same errors repeat without system learning optimal responses
- Developer Burden: Every new error pattern requires code changes across multiple files
- Lost Intelligence: Raw error data is captured but not analyzed for actionable insights
Example Current Flow:
User: "turn my pfp into the locked in gamer meme"
↓
OpenAI returns text instead of image
↓
Error: "MIME type mismatch: requested image/png but received text"
↓
User sees: "Service returned unexpected content type. Please try again."
↓
User tries again... same error... gives upWhat We Need:
User: "turn my pfp into the locked in gamer meme"
↓
OpenAI returns text instead of image
↓
System recognizes: Pattern #47 - "Image request returned JSON"
↓
System applies rule: "Add explicit image generation instruction"
↓
Auto-retry with: "Generate an image of... (DO NOT return JSON instructions)"
↓
Success OR User sees: "The AI tried to explain how to create the image instead of
generating it. Try being more explicit: 'Create an actual image of...'"V2 Architecture: Three-Layer Error Intelligence
Layer 1: Comprehensive Internal Capture (Existing - Enhanced)
Purpose: Capture EVERYTHING for forensic analysis
interface ErrorAttestation {
// Core Classification
failure_type: FailureType;
failure_reason: FailureReason;
retryable: boolean;
// Full Context (unlimited detail for internal use)
raw_request: object; // Complete request payload
raw_response: object; // Complete response (with smart truncation for base64)
requested_mime_type: string;
actual_mime_type: string;
http_status?: number;
// Service Context
service_type: string; // 'openai_responses', 'comfyui', etc.
model_used?: string; // 'gpt-4.1', 'dall-e-3', etc.
component_name?: string; // Which part of pipeline failed
// Job Context
job_id: string;
workflow_id?: string;
retry_count: number;
// Pattern Matching Keys (for rule engine)
error_signature: string; // Hash of error pattern for matching
similar_errors_count: number; // How many times we've seen this before
// Recovery Attempts
recovery_attempts: RecoveryAttempt[];
}
interface RecoveryAttempt {
rule_id: string;
rule_name: string;
modifications: object; // What we changed
result: 'success' | 'failed' | 'skipped';
new_error?: ErrorAttestation; // If retry failed, what error?
}Key Enhancement: Add error_signature generation
function generateErrorSignature(error: ConnectorError): string {
// Create pattern hash for matching
return hash({
failure_type: error.failureType,
failure_reason: error.failureReason,
service_type: error.context.serviceType,
model: error.context.modelUsed,
requested_mime: error.context.requestedMimeType,
actual_mime: error.context.actualMimeType,
http_status: error.context.httpStatus
});
}Layer 2: Declarative Error Response Rules (NEW)
Purpose: Define error recovery strategies and user messaging as data, not code
Rule File: /config/error-rules/openai-responses.yaml
# Error Response Rules for OpenAI Responses Service
version: 2
service: openai_responses
rules:
# Rule 1: Image request returned JSON/text explanation
- id: openai-image-to-text-001
name: "Image Request Returned Text Explanation"
priority: 100
# Pattern Matching
match:
failure_type: response_error
failure_reason: unexpected_content_type
conditions:
- field: context.requestedMimeType
operator: equals
value: "image/png"
- field: context.actualMimeType
operator: in
values: ["text", "application/json"]
- field: context.modelUsed
operator: startsWith
value: "gpt-4"
# User-Facing Message (generic, helpful)
user_message: |
The AI returned instructions on how to create the image instead of generating it.
Try being more explicit: "Generate an actual image of [your request]"
# Internal Note (for our debugging)
internal_note: |
GPT-4.1 with image_generation tool sometimes returns JSON prompt instead of
calling the tool. This happens when the user request is ambiguous about whether
they want an image or image instructions.
# Automatic Recovery Strategy
recovery:
enabled: true
max_attempts: 1
# Modify the request before retry
modifications:
- type: prepend_to_prompt
value: "IMPORTANT: Generate an actual image file, do not return text instructions or JSON. "
- type: add_system_message
value: "You must use the image_generation tool to create an actual image. Never return text descriptions or JSON prompts."
- type: set_parameter
path: "tools[0].strict"
value: true
# Only retry if conditions met
retry_conditions:
- field: retry_count
operator: less_than
value: 2
- field: workflow_id
operator: exists
value: true # Only auto-retry in workflows
# Rule 2: Rate limit with retry-after
- id: openai-rate-limit-001
name: "OpenAI Rate Limit with Retry-After"
priority: 90
match:
failure_type: rate_limit
failure_reason: requests_per_minute
conditions:
- field: context.retryAfterSeconds
operator: exists
value: true
user_message: |
Rate limit reached. Your request will automatically retry in {{context.retryAfterSeconds}} seconds.
internal_note: "Standard OpenAI rate limit - should auto-retry"
recovery:
enabled: true
max_attempts: 3
# Wait before retry
delay:
type: from_header
header: retry-after
fallback_seconds: 60
modifications: [] # No changes needed, just retry as-is
retry_conditions:
- field: retry_count
operator: less_than
value: 3
# Rule 3: Model not found - suggest alternatives
- id: openai-model-404-001
name: "Model Not Found"
priority: 80
match:
failure_type: validation_error
failure_reason: model_not_found
user_message: |
The requested AI model is not available. Common alternatives:
- For text: gpt-4-turbo, gpt-3.5-turbo
- For images: dall-e-3
Please contact support if you need a specific model.
internal_note: "Model name typo or deprecated model"
recovery:
enabled: false # Don't auto-retry - needs user decision
# Rule 4: MIME mismatch - generic fallback
- id: generic-mime-mismatch-001
name: "Generic MIME Type Mismatch"
priority: 10 # Low priority - catches anything not matched above
match:
failure_type: response_error
failure_reason: unexpected_content_type
user_message: |
The AI returned {{context.actualMimeType}} instead of the expected {{context.requestedMimeType}}.
This usually means the AI didn't understand the format you wanted.
Try rephrasing your request more explicitly.
internal_note: "Unhandled MIME mismatch - investigate pattern"
recovery:
enabled: false
# Rule 5: Content Policy Violation - Clear guidance
- id: openai-content-policy-001
name: "Content Policy Violation"
priority: 95
match:
failure_type: generation_refusal
failure_reason: safety_filter
conditions:
- field: context.rawResponse.error.code
operator: equals
value: "content_policy_violation"
user_message: |
Your request was blocked by content safety filters. This can happen if:
- The prompt contains potentially harmful content
- The request could generate inappropriate material
- Certain words triggered automated filters
Try rephrasing your request in a different way.
internal_note: "OpenAI content filter triggered"
recovery:
enabled: false # Don't retry - will fail againLayer 3: Rule Engine Execution (NEW)
Purpose: Match errors to rules and execute recovery strategies
// /packages/core/src/services/error-rule-engine.ts
interface ErrorRule {
id: string;
name: string;
priority: number;
match: RuleMatch;
user_message: string;
internal_note: string;
recovery: RecoveryStrategy;
}
interface RuleMatch {
failure_type: FailureType;
failure_reason: FailureReason;
conditions?: MatchCondition[];
}
interface MatchCondition {
field: string; // JSONPath to field in error context
operator: 'equals' | 'in' | 'exists' | 'startsWith' | 'matches' | 'less_than' | 'greater_than';
value: any;
values?: any[];
}
interface RecoveryStrategy {
enabled: boolean;
max_attempts: number;
delay?: {
type: 'fixed' | 'exponential' | 'from_header';
seconds?: number;
header?: string;
fallback_seconds?: number;
};
modifications: RequestModification[];
retry_conditions?: MatchCondition[];
}
interface RequestModification {
type: 'prepend_to_prompt' | 'append_to_prompt' | 'add_system_message' | 'set_parameter' | 'remove_parameter';
value?: any;
path?: string; // JSONPath for set/remove operations
}
class ErrorRuleEngine {
private rules: Map<string, ErrorRule[]> = new Map(); // service_type -> rules
constructor() {
this.loadRules();
}
/**
* Load all error rules from YAML configs
*/
private async loadRules(): Promise<void> {
const ruleFiles = await glob('/config/error-rules/*.yaml');
for (const file of ruleFiles) {
const config = await YAML.parse(file);
this.rules.set(config.service, config.rules);
}
}
/**
* Find matching rule for an error
*/
findMatchingRule(error: ConnectorError): ErrorRule | null {
const serviceRules = this.rules.get(error.context.serviceType) || [];
// Sort by priority (highest first)
const sortedRules = serviceRules.sort((a, b) => b.priority - a.priority);
for (const rule of sortedRules) {
if (this.matchesRule(error, rule)) {
logger.info(`📋 Matched error to rule: ${rule.id} - ${rule.name}`);
return rule;
}
}
logger.warn(`⚠️ No matching rule found for error: ${error.failureReason}`);
return null;
}
/**
* Check if error matches rule conditions
*/
private matchesRule(error: ConnectorError, rule: ErrorRule): boolean {
// Check basic match
if (error.failureType !== rule.match.failure_type) return false;
if (error.failureReason !== rule.match.failure_reason) return false;
// Check additional conditions
if (!rule.match.conditions) return true;
for (const condition of rule.match.conditions) {
if (!this.evaluateCondition(error, condition)) {
return false;
}
}
return true;
}
/**
* Evaluate a single condition
*/
private evaluateCondition(error: ConnectorError, condition: MatchCondition): boolean {
const actualValue = this.getFieldValue(error, condition.field);
switch (condition.operator) {
case 'equals':
return actualValue === condition.value;
case 'in':
return condition.values?.includes(actualValue) ?? false;
case 'exists':
return (actualValue !== undefined && actualValue !== null) === condition.value;
case 'startsWith':
return typeof actualValue === 'string' && actualValue.startsWith(condition.value);
case 'matches':
return typeof actualValue === 'string' && new RegExp(condition.value).test(actualValue);
case 'less_than':
return typeof actualValue === 'number' && actualValue < condition.value;
case 'greater_than':
return typeof actualValue === 'number' && actualValue > condition.value;
default:
return false;
}
}
/**
* Get field value from error using JSONPath
*/
private getFieldValue(error: ConnectorError, path: string): any {
// Simple JSONPath implementation
const parts = path.split('.');
let value: any = error;
for (const part of parts) {
if (value && typeof value === 'object') {
value = value[part];
} else {
return undefined;
}
}
return value;
}
/**
* Apply recovery strategy and return modified request
*/
async applyRecoveryStrategy(
originalRequest: any,
rule: ErrorRule,
error: ConnectorError
): Promise<{ shouldRetry: boolean; modifiedRequest?: any; reason?: string }> {
if (!rule.recovery.enabled) {
return { shouldRetry: false, reason: 'Recovery disabled for this rule' };
}
// Check retry conditions
if (rule.recovery.retry_conditions) {
for (const condition of rule.recovery.retry_conditions) {
if (!this.evaluateCondition(error, condition)) {
return { shouldRetry: false, reason: `Retry condition failed: ${condition.field}` };
}
}
}
// Apply modifications
let modifiedRequest = JSON.parse(JSON.stringify(originalRequest));
for (const modification of rule.recovery.modifications) {
modifiedRequest = this.applyModification(modifiedRequest, modification);
}
logger.info(`✨ Applied recovery modifications from rule ${rule.id}`, {
modifications: rule.recovery.modifications.map(m => m.type)
});
return { shouldRetry: true, modifiedRequest };
}
/**
* Apply a single modification to request
*/
private applyModification(request: any, modification: RequestModification): any {
switch (modification.type) {
case 'prepend_to_prompt':
// Find the prompt in request structure and prepend
if (request.input && Array.isArray(request.input)) {
const userMessage = request.input.find((msg: any) => msg.role === 'user');
if (userMessage && userMessage.content) {
if (typeof userMessage.content === 'string') {
userMessage.content = modification.value + userMessage.content;
} else if (Array.isArray(userMessage.content)) {
const textContent = userMessage.content.find((c: any) => c.type === 'input_text');
if (textContent) {
textContent.text = modification.value + textContent.text;
}
}
}
}
break;
case 'append_to_prompt':
// Similar to prepend but at end
if (request.input && Array.isArray(request.input)) {
const userMessage = request.input.find((msg: any) => msg.role === 'user');
if (userMessage && userMessage.content) {
if (typeof userMessage.content === 'string') {
userMessage.content = userMessage.content + modification.value;
} else if (Array.isArray(userMessage.content)) {
const textContent = userMessage.content.find((c: any) => c.type === 'input_text');
if (textContent) {
textContent.text = textContent.text + modification.value;
}
}
}
}
break;
case 'add_system_message':
if (request.input && Array.isArray(request.input)) {
request.input.unshift({
role: 'system',
content: modification.value
});
}
break;
case 'set_parameter':
if (modification.path) {
this.setValueAtPath(request, modification.path, modification.value);
}
break;
case 'remove_parameter':
if (modification.path) {
this.deleteValueAtPath(request, modification.path);
}
break;
}
return request;
}
/**
* Set value at JSONPath
*/
private setValueAtPath(obj: any, path: string, value: any): void {
const parts = path.split('.');
let current = obj;
for (let i = 0; i < parts.length - 1; i++) {
const part = parts[i];
const arrayMatch = part.match(/(.+)\[(\d+)\]/);
if (arrayMatch) {
const [, key, index] = arrayMatch;
if (!current[key]) current[key] = [];
if (!current[key][parseInt(index)]) current[key][parseInt(index)] = {};
current = current[key][parseInt(index)];
} else {
if (!current[part]) current[part] = {};
current = current[part];
}
}
const lastPart = parts[parts.length - 1];
current[lastPart] = value;
}
/**
* Delete value at JSONPath
*/
private deleteValueAtPath(obj: any, path: string): void {
const parts = path.split('.');
let current = obj;
for (let i = 0; i < parts.length - 1; i++) {
if (!current[parts[i]]) return;
current = current[parts[i]];
}
delete current[parts[parts.length - 1]];
}
/**
* Get user-friendly error message with template substitution
*/
getUserMessage(rule: ErrorRule, error: ConnectorError): string {
let message = rule.user_message;
// Replace {{context.field}} with actual values
message = message.replace(/\{\{([^}]+)\}\}/g, (match, path) => {
const value = this.getFieldValue(error, path);
return value !== undefined ? String(value) : match;
});
return message;
}
}Integration with Existing Error Flow
Modified Worker Flow:
// In async-rest-connector.ts or openai-base-connector.ts
async processJob(jobData: Job): Promise<JobResult> {
const ruleEngine = new ErrorRuleEngine();
let currentRequest = this.buildRequest(jobData);
let attemptCount = 0;
const maxAttempts = 3;
const recoveryHistory: RecoveryAttempt[] = [];
while (attemptCount < maxAttempts) {
try {
const result = await this.executeRequest(currentRequest);
return result;
} catch (error) {
attemptCount++;
if (!(error instanceof ConnectorError)) {
throw error; // Re-throw non-connector errors
}
// Find matching rule
const rule = ruleEngine.findMatchingRule(error);
if (!rule) {
// No rule found - use default behavior
logger.warn(`No error rule matched - using default error handling`);
throw error;
}
// Log internal note for debugging
logger.info(`📋 Error Rule Matched: ${rule.name}`, {
rule_id: rule.id,
internal_note: rule.internal_note
});
// Try recovery
const recovery = await ruleEngine.applyRecoveryStrategy(
currentRequest,
rule,
error
);
recoveryHistory.push({
rule_id: rule.id,
rule_name: rule.name,
modifications: rule.recovery.modifications,
result: recovery.shouldRetry ? 'attempted' : 'skipped'
});
if (!recovery.shouldRetry) {
// Can't recover - enhance error with user message from rule
error.context.user_message = ruleEngine.getUserMessage(rule, error);
error.context.recovery_attempts = recoveryHistory;
throw error;
}
// Apply delay if specified
if (rule.recovery.delay) {
const delayMs = this.calculateDelay(rule.recovery.delay, error);
logger.info(`⏳ Waiting ${delayMs}ms before retry (${rule.name})`);
await new Promise(resolve => setTimeout(resolve, delayMs));
}
// Update request for retry
currentRequest = recovery.modifiedRequest;
logger.info(`🔄 Retrying with modifications from rule ${rule.id} (attempt ${attemptCount}/${maxAttempts})`);
// Continue loop to retry
}
}
// If we get here, all retries failed
throw new ConnectorError(
FailureType.SYSTEM_ERROR,
FailureReason.UNKNOWN_ERROR,
`Failed after ${maxAttempts} attempts with rule-based recovery`,
false,
{ recovery_attempts: recoveryHistory }
);
}Benefits of V2 System
1. Separation of Concerns
- Engineers: Build error capture and rule engine (once)
- Operations: Write YAML rules (no code changes needed)
- Users: Get helpful, context-aware error messages
2. Continuous Improvement
# Add new rule without touching code
- id: openai-dall-e-nsfw-002
name: "DALL-E NSFW Filter"
match:
failure_type: generation_refusal
failure_reason: nsfw_content
conditions:
- field: context.modelUsed
operator: equals
value: "dall-e-3"
user_message: |
DALL-E detected potentially inappropriate content in your request.
Try rephrasing without descriptive terms for people's appearance.
recovery:
enabled: false3. Data-Driven Learning
Track which rules are matching most often:
// Analytics
interface RuleAnalytics {
rule_id: string;
match_count: number;
recovery_success_rate: number;
avg_retries_to_success: number;
}4. A/B Testing Recovery Strategies
- id: openai-image-to-text-001-variant-a
enabled: true
traffic_percentage: 50 # Send 50% of matches here
modifications:
- type: prepend_to_prompt
value: "Generate an actual image. "
- id: openai-image-to-text-001-variant-b
enabled: true
traffic_percentage: 50
modifications:
- type: add_system_message
value: "Always use image_generation tool."5. User Customization
Allow customers to override rules per workspace:
# /config/error-rules/overrides/customer-abc123.yaml
overrides:
- rule_id: openai-rate-limit-001
user_message: "Your workspace has reached its API limit. Contact billing@company.com"Implementation Phases
Phase 1: Foundation (Week 1-2)
- [x] V1 already captures comprehensive error data ✅
- [ ] Build
ErrorRuleEngineclass - [ ] YAML rule loader
- [ ] Basic rule matching (no recovery yet)
- [ ] Template substitution for user messages
Phase 2: Recovery Engine (Week 3-4)
- [ ] Request modification system
- [ ] Retry logic with delays
- [ ] Recovery attempt tracking
- [ ] Integration with worker connectors
Phase 3: Rule Library (Week 5-6)
- [ ] Create initial rule set for OpenAI
- [ ] Create rules for ComfyUI
- [ ] Create rules for other services
- [ ] Documentation for writing rules
Phase 4: Analytics & Optimization (Week 7-8)
- [ ] Rule match tracking
- [ ] Success rate analytics
- [ ] A/B testing framework
- [ ] Auto-suggest new rules based on patterns
Example Rule Files
OpenAI Responses
/config/error-rules/openai-responses.yaml - as shown above
ComfyUI
version: 2
service: comfyui
rules:
- id: comfyui-node-missing-001
name: "Missing Custom Node"
priority: 100
match:
failure_type: validation_error
failure_reason: component_error
conditions:
- field: context.componentError
operator: matches
value: "Cannot find node class"
user_message: |
The ComfyUI workflow requires a custom node that isn't installed: {{context.componentName}}
This workflow may have been created with different extensions than available on this system.
internal_note: "Custom node missing - need to track which nodes are available on which machines"
recovery:
enabled: false # Can't auto-fix missing nodesMigration Path from V1
- No Breaking Changes: V1 continues to work
- Opt-in per Service: Add rule files service by service
- Gradual Enhancement: Start with user messages, add recovery later
- Analytics on Both: Compare V1 vs V2 error rates
Success Metrics
- User Retry Rate: Should decrease as auto-recovery improves
- Support Tickets: Fewer "what does this error mean" tickets
- Error Message Clarity: User surveys on message helpfulness
- Recovery Success Rate: % of errors that auto-recover
- Time to Add New Rule: Should be < 10 minutes
Future Enhancements
ML-Based Rule Suggestions
// Analyze error patterns and suggest new rules
interface RuleSuggestion {
pattern: ErrorPattern;
occurrences: number;
suggested_rule: Partial<ErrorRule>;
confidence: number;
}User Feedback Loop
// Let users rate error messages
interface ErrorFeedback {
error_id: string;
helpful: boolean;
comment?: string;
}Dynamic Rule Updates
// Hot-reload rules without restart
ruleEngine.watchRuleFiles();Decision
Recommended: Implement V2 Error System with Declarative Rules
Rationale:
- Separates error intelligence from code
- Enables rapid iteration on error handling
- Provides better user experience
- Captures comprehensive data for learning
- Supports automatic error recovery
- Scales to new services easily
Next Steps:
- Review and approve this ADR
- Create
/config/error-rules/directory structure - Implement
ErrorRuleEngineclass - Create initial OpenAI rule set
- Integrate with one connector (OpenAI) as proof of concept
- Measure impact and iterate
Appendix: Current Implementation Analysis (2025-01-07)
Overview of Existing Error Handling ADRs
Three related ADRs exist:
ERROR_HANDLING_MODERNIZATION.md - Python → TypeScript error flow improvements
- Fix Python object serialization (
[object Object]→ proper JSON) - Add structured error codes in Python (
ErrorCodeenum) - Create
ComfyUIErrorParserin TypeScript - Status: Proposed, partially implemented
- Fix Python object serialization (
connector-error-handling-standard.md - Standardized connector error handling
ConnectorErrorclass with structured classification- BaseConnector enforcement wrapper
- Protocol layers (HTTPConnector, WebSocketConnector)
- Status: Accepted, partially implemented
error-management-v2-declarative-rules.md (this document) - Declarative error recovery
- YAML-based error rules
- Automatic retry with modifications
- Pattern learning and A/B testing
- Status: Proposed, NOT implemented
Current Connector Error Handling State
✅ What's Working Well
BaseConnector (base-connector.ts:487-595):
processJob()wrapper catches ALL errors automatically- Converts errors to
ConnectorErrorviaConnectorError.fromError() - Validates connectors don't return
{success: false}(throws if they do) - Logs structured error data with telemetry
- Reports errors to Redis for monitoring
AsyncHTTPConnector (async-http-connector.ts:242-245):
- Protocol layer for HTTP-based connectors
handleServiceError()hook for service-specific error handling- Generic HTTP status mapping:
- 401/403 →
AUTH_ERROR/INVALID_API_KEY - 429 →
RATE_LIMIT/REQUESTS_PER_MINUTE - 4xx →
VALIDATION_ERROR - 5xx →
SERVICE_ERROR
- 401/403 →
- MIME type validation
- Fetch-based with timeout handling
OpenAIResponsesConnector (openai-responses-connector.ts:156-381):
- Semantic validation for content refusals
- Rich error context (rawServiceOutput, rawServiceRequest)
- MIME mismatch detection
- Refusal pattern detection:typescript
const refusalPatterns = [ /I[''']m sorry,?\s+but I can[''']t assist with that/i, /against (OpenAI[''']s? )?content policy/i, // ... ];
❌ Issues & Inconsistencies
1. Inconsistent Error Throwing
- Some connectors:
throw new ConnectorError(...)✅ - Some connectors:
throw new Error(...)⚠️ (gets converted by BaseConnector) - Some connectors: return error in result object ⚠️ (gets converted by AsyncHTTPConnector)
Example from OpenAIBaseConnector (lines 768, 804, 823):
// ❌ Throws generic Error instead of ConnectorError
throw new Error(`OpenAI job ${openaiJobId} ${currentStatus}: ${errorMessage}`);2. Service Error Hook Underutilized
- AsyncHTTPConnector provides
handleServiceError()hook - OpenAI connectors don't override it (missing OpenAI-specific error patterns)
- ComfyUI connectors likely similar
- Opportunity for service-specific classification before generic fallback
3. Missing Rich Context
- Not all errors include
rawRequest/rawResponse - Some errors lack
serviceJobId,retryAfterSeconds, etc. - Context helps with debugging and forensic analysis
4. No Declarative Rules (V2)
- This ADR proposed but not implemented
- No automatic retry with modifications
- No pattern learning
- No YAML rule files
- No user-friendly message templates
5. Python Error Codes Not Implemented
- ERROR_HANDLING_MODERNIZATION.md ADR not implemented
- Still getting
[object Object]serialization issues from ComfyUI - No structured error codes from Python layer
Connector Error Handling Patterns Summary
| Connector | Extends | Error Pattern | Service Hook | Context | Rating |
|---|---|---|---|---|---|
| BaseConnector | - | Catches all, converts to ConnectorError | ✅ | ✅ Full | ⭐⭐⭐⭐⭐ |
| AsyncHTTPConnector | BaseConnector | Generic HTTP mapping + hook | ✅ Provides | ✅ Good | ⭐⭐⭐⭐ |
| OpenAIBaseConnector | BaseConnector | Throws generic Error | ❌ None | ⚠️ Partial | ⭐⭐⭐ |
| OpenAIResponsesConnector | AsyncHTTPConnector | Returns error in result | ❌ None | ✅ Good | ⭐⭐⭐⭐ |
| ComfyUIRestStreamConnector | AsyncHTTPConnector | Unknown | Unknown | Unknown | ❓ |
Recommendations for V2 Implementation
1. Implement Service Error Hooks First
Before implementing declarative rules, standardize service-specific error handling:
// In OpenAIBaseConnector
protected handleServiceError(error: any, jobData: JobData): ConnectorError | null {
// Handle OpenAI-specific error codes
if (error.response?.data?.error?.code) {
const errorCode = error.response.data.error.code;
const errorMessage = error.response.data.error.message;
switch (errorCode) {
case 'rate_limit_exceeded':
return new ConnectorError(
FailureType.RATE_LIMIT,
FailureReason.REQUESTS_PER_MINUTE,
errorMessage,
true,
{
serviceType: 'openai',
httpStatus: error.response.status,
rawResponse: error.response.data,
retryAfterSeconds: error.response.headers?.['retry-after']
}
);
case 'invalid_api_key':
return new ConnectorError(
FailureType.AUTH_ERROR,
FailureReason.INVALID_API_KEY,
errorMessage,
false,
{ serviceType: 'openai', httpStatus: 401 }
);
case 'content_policy_violation':
return new ConnectorError(
FailureType.GENERATION_REFUSAL,
FailureReason.SAFETY_FILTER,
errorMessage,
false,
{
serviceType: 'openai',
rawRequest: jobData.payload,
rawResponse: error.response.data
}
);
}
}
return null; // Fall through to generic handling
}2. Standardize on ConnectorError
- Update OpenAIBaseConnector to throw
ConnectorErrorinstead ofError - Add rich context to all errors (rawRequest, rawResponse, serviceJobId)
- Ensure all connectors follow the same pattern
3. Semantic Validation Framework
Generalize the refusal pattern detection from OpenAIResponsesConnector:
// packages/core/src/services/semantic-validator.ts
export class SemanticValidator {
private static refusalPatterns = [
/I[''']m sorry,?\s+but I can[''']t assist with that/i,
/cannot generate/i,
/unable to create/i,
/policy violation/i,
/content policy/i,
// ... more patterns
];
static detectRefusal(text: string): {
isRefusal: boolean;
confidence: number;
matchedPattern?: string;
} {
// Pattern matching logic
}
static detectMimeTypeMismatch(
requestedType: string,
actualContent: any
): boolean {
// MIME validation logic
}
}4. Phased V2 Implementation
Given the current state, recommend this approach:
Phase 1: Foundation (Week 1-2) - Focus on standardization first
- [ ] Implement service error hooks in all connectors
- [ ] Standardize ConnectorError usage
- [ ] Add semantic validation framework
- [ ] Ensure rich context in all errors
Phase 2: Rule Engine Core (Week 3-4) - Build the infrastructure
- [ ] Implement
ErrorRuleEngineclass - [ ] YAML rule loader
- [ ] Rule matching logic
- [ ] Template substitution for user messages
Phase 3: Recovery Strategies (Week 5-6) - Add intelligence
- [ ] Request modification system
- [ ] Retry logic with delays
- [ ] Recovery attempt tracking
- [ ] Integration with connectors
Phase 4: Rule Library (Week 7-8) - Build the knowledge base
- [ ] OpenAI rule set (based on observed patterns)
- [ ] ComfyUI rule set
- [ ] Documentation for writing rules
- [ ] Rule testing framework
Phase 5: Analytics & Learning (Week 9-10) - Close the loop
- [ ] Rule match tracking
- [ ] Success rate analytics
- [ ] A/B testing framework
- [ ] Auto-suggest new rules based on patterns
5. Immediate Quick Wins
Before implementing V2, these changes provide immediate value:
- Add service error hooks to OpenAI connectors (1-2 days)
- Create semantic validator for LLM responses (1 day)
- Standardize error throwing in OpenAIBaseConnector (1 day)
- Add rich context to all errors (1 day)
Total: 1 week of work for significant improvement
Key Insights for V2 Design
1. Service Hook Pattern Works Well
- AsyncHTTPConnector's
handleServiceError()hook is a clean pattern - Allows service-specific logic without bloating base classes
- Should be the integration point for declarative rules
2. Semantic Validation is Critical
- OpenAIResponsesConnector's refusal detection prevents silent failures
- Image-vs-text mismatch detection catches model misbehavior
- This intelligence should be in rule engine, not individual connectors
3. Rich Context Enables Learning
- Errors with
rawRequestandrawResponseenable pattern analysis - Missing context makes it impossible to write good rules
- Context must be standard before V2 can succeed
4. Connector Compliance is Key
- V2 only works if connectors use
ConnectorErrorconsistently - BaseConnector validation catches
{success: false}pattern - Need similar validation for generic
Errorthrows
5. Start Simple, Evolve
- Don't implement all V2 features at once
- Start with static rules (no recovery)
- Add recovery strategies incrementally
- Add A/B testing and learning last
Migration Path to V2
Step 1: Current State Audit ✅ (Complete - this document)
- [x] Document existing error handling patterns
- [x] Identify gaps and inconsistencies
- [x] Analyze connector implementations
Step 2: Standardization (Before V2)
- [ ] Implement service error hooks in all connectors
- [ ] Standardize ConnectorError usage
- [ ] Add semantic validation framework
- [ ] Ensure rich context in all errors
Step 3: V2 Foundation (Weeks 1-4)
- [ ] Build ErrorRuleEngine core
- [ ] Create YAML rule format
- [ ] Implement rule matching
- [ ] Add user message templates
Step 4: V2 Intelligence (Weeks 5-8)
- [ ] Request modification system
- [ ] Recovery strategies
- [ ] Rule library for OpenAI and ComfyUI
- [ ] Integration testing
Step 5: V2 Learning (Weeks 9-10)
- [ ] Analytics and tracking
- [ ] A/B testing
- [ ] Auto-suggest rules
- [ ] User feedback loop
Step 6: Python Integration (Optional - if ComfyUI priority)
- [ ] Implement Python ErrorCode enum
- [ ] Fix object serialization
- [ ] Create ComfyUIErrorParser
- [ ] Integrate with existing connectors
Success Metrics Tracking
To measure V2 effectiveness, track these metrics:
Before V2 (Baseline - capture now):
- [ ] Error classification accuracy (manual audit of 100 errors)
- [ ] % of errors with helpful user messages
- [ ] Average retry attempts per error type
- [ ] Support tickets related to error messages
- [ ] Time to resolve production errors (MTTR)
After V2 Implementation:
- [ ] Error classification accuracy improvement
- [ ] User message helpfulness rating (survey)
- [ ] Reduction in wasted retries (non-retryable errors)
- [ ] Support ticket reduction
- [ ] MTTR improvement
- [ ] Auto-recovery success rate
Risk Assessment
Low Risk:
- Adding service error hooks (backwards compatible)
- Creating semantic validator (pure utility)
- Standardizing ConnectorError usage (caught by BaseConnector)
Medium Risk:
- Rule engine implementation (new system, needs testing)
- Request modification (could make problems worse if buggy)
- YAML parsing and validation (security considerations)
High Risk:
- Automatic retries with modifications (could cause cascading failures)
- A/B testing in production (could confuse users)
- Python error code integration (requires ComfyUI changes)
Mitigation:
- Feature flags for rule engine (enable per service)
- Dry-run mode (log rule matches without taking action)
- Gradual rollout (one service at a time)
- Comprehensive testing (unit + integration + E2E)
- Monitoring and alerting (track rule effectiveness)
Open Questions for V2
Rule Priority: How do we handle overlapping rules?
- Answer: Priority field (higher number = higher priority)
- First matching rule wins
Rule Updates: Hot-reload or restart required?
- Answer: Start with restart, add hot-reload later
- Use file watcher for development
Rule Testing: How do we test rules without production traffic?
- Answer: Dry-run mode + replay production errors
- Create rule test framework
Rule Versioning: How do we track rule changes over time?
- Answer: Git version control for YAML files
- Add version field to rule format
Rule Conflicts: What if multiple services use same error pattern?
- Answer: Service-specific rule files
- Rules are scoped to service_type
Recovery Limits: How many times should we retry with modifications?
- Answer: Configurable per rule (max_attempts)
- Default: 1 retry to avoid cascading
User Override: Should users be able to customize rules?
- Answer: Yes, via override files
- Workspace-specific customization
Telemetry: What data should we collect about rule effectiveness?
- Answer: Match count, success rate, average retries
- Store in Redis for analytics
Last Updated: 2025-01-07 Analyst: Claude Code Status: Analysis Complete - Ready for V2 Planning
