ADR: Error Management System V2 - Declarative Response Rules

Status: Proposed Date: 2025-11-07 Authors: Claude Code Context: Current error system provides structured errors but lacks intelligent error recovery and user-friendly messaging

Problem Statement

Current error system (V1) has fundamental limitations:

Static User Messages: Generic messages like "Service returned unexpected content type" don't help users fix their prompts
No Error Recovery: System can't automatically retry with modified prompts or adjusted parameters
No Pattern Learning: Same errors repeat without system learning optimal responses
Developer Burden: Every new error pattern requires code changes across multiple files
Lost Intelligence: Raw error data is captured but not analyzed for actionable insights

Example Current Flow:

User: "turn my pfp into the locked in gamer meme"
↓
OpenAI returns text instead of image
↓
Error: "MIME type mismatch: requested image/png but received text"
↓
User sees: "Service returned unexpected content type. Please try again."
↓
User tries again... same error... gives up

What We Need:

User: "turn my pfp into the locked in gamer meme"
↓
OpenAI returns text instead of image
↓
System recognizes: Pattern #47 - "Image request returned JSON"
↓
System applies rule: "Add explicit image generation instruction"
↓
Auto-retry with: "Generate an image of... (DO NOT return JSON instructions)"
↓
Success OR User sees: "The AI tried to explain how to create the image instead of
generating it. Try being more explicit: 'Create an actual image of...'"

V2 Architecture: Three-Layer Error Intelligence

Layer 1: Comprehensive Internal Capture (Existing - Enhanced)

Purpose: Capture EVERYTHING for forensic analysis

typescript

interface ErrorAttestation {
  // Core Classification
  failure_type: FailureType;
  failure_reason: FailureReason;
  retryable: boolean;

  // Full Context (unlimited detail for internal use)
  raw_request: object;           // Complete request payload
  raw_response: object;          // Complete response (with smart truncation for base64)
  requested_mime_type: string;
  actual_mime_type: string;
  http_status?: number;

  // Service Context
  service_type: string;          // 'openai_responses', 'comfyui', etc.
  model_used?: string;           // 'gpt-4.1', 'dall-e-3', etc.
  component_name?: string;       // Which part of pipeline failed

  // Job Context
  job_id: string;
  workflow_id?: string;
  retry_count: number;

  // Pattern Matching Keys (for rule engine)
  error_signature: string;       // Hash of error pattern for matching
  similar_errors_count: number;  // How many times we've seen this before

  // Recovery Attempts
  recovery_attempts: RecoveryAttempt[];
}

interface RecoveryAttempt {
  rule_id: string;
  rule_name: string;
  modifications: object;         // What we changed
  result: 'success' | 'failed' | 'skipped';
  new_error?: ErrorAttestation;  // If retry failed, what error?
}

Key Enhancement: Add error_signature generation

typescript

function generateErrorSignature(error: ConnectorError): string {
  // Create pattern hash for matching
  return hash({
    failure_type: error.failureType,
    failure_reason: error.failureReason,
    service_type: error.context.serviceType,
    model: error.context.modelUsed,
    requested_mime: error.context.requestedMimeType,
    actual_mime: error.context.actualMimeType,
    http_status: error.context.httpStatus
  });
}

Layer 2: Declarative Error Response Rules (NEW)

Purpose: Define error recovery strategies and user messaging as data, not code

Rule File: /config/error-rules/openai-responses.yaml

yaml

# Error Response Rules for OpenAI Responses Service
version: 2
service: openai_responses

rules:
  # Rule 1: Image request returned JSON/text explanation
  - id: openai-image-to-text-001
    name: "Image Request Returned Text Explanation"
    priority: 100

    # Pattern Matching
    match:
      failure_type: response_error
      failure_reason: unexpected_content_type
      conditions:
        - field: context.requestedMimeType
          operator: equals
          value: "image/png"
        - field: context.actualMimeType
          operator: in
          values: ["text", "application/json"]
        - field: context.modelUsed
          operator: startsWith
          value: "gpt-4"

    # User-Facing Message (generic, helpful)
    user_message: |
      The AI returned instructions on how to create the image instead of generating it.
      Try being more explicit: "Generate an actual image of [your request]"

    # Internal Note (for our debugging)
    internal_note: |
      GPT-4.1 with image_generation tool sometimes returns JSON prompt instead of
      calling the tool. This happens when the user request is ambiguous about whether
      they want an image or image instructions.

    # Automatic Recovery Strategy
    recovery:
      enabled: true
      max_attempts: 1

      # Modify the request before retry
      modifications:
        - type: prepend_to_prompt
          value: "IMPORTANT: Generate an actual image file, do not return text instructions or JSON. "

        - type: add_system_message
          value: "You must use the image_generation tool to create an actual image. Never return text descriptions or JSON prompts."

        - type: set_parameter
          path: "tools[0].strict"
          value: true

      # Only retry if conditions met
      retry_conditions:
        - field: retry_count
          operator: less_than
          value: 2
        - field: workflow_id
          operator: exists
          value: true  # Only auto-retry in workflows

  # Rule 2: Rate limit with retry-after
  - id: openai-rate-limit-001
    name: "OpenAI Rate Limit with Retry-After"
    priority: 90

    match:
      failure_type: rate_limit
      failure_reason: requests_per_minute
      conditions:
        - field: context.retryAfterSeconds
          operator: exists
          value: true

    user_message: |
      Rate limit reached. Your request will automatically retry in {{context.retryAfterSeconds}} seconds.

    internal_note: "Standard OpenAI rate limit - should auto-retry"

    recovery:
      enabled: true
      max_attempts: 3

      # Wait before retry
      delay:
        type: from_header
        header: retry-after
        fallback_seconds: 60

      modifications: []  # No changes needed, just retry as-is

      retry_conditions:
        - field: retry_count
          operator: less_than
          value: 3

  # Rule 3: Model not found - suggest alternatives
  - id: openai-model-404-001
    name: "Model Not Found"
    priority: 80

    match:
      failure_type: validation_error
      failure_reason: model_not_found

    user_message: |
      The requested AI model is not available. Common alternatives:
      - For text: gpt-4-turbo, gpt-3.5-turbo
      - For images: dall-e-3
      Please contact support if you need a specific model.

    internal_note: "Model name typo or deprecated model"

    recovery:
      enabled: false  # Don't auto-retry - needs user decision

  # Rule 4: MIME mismatch - generic fallback
  - id: generic-mime-mismatch-001
    name: "Generic MIME Type Mismatch"
    priority: 10  # Low priority - catches anything not matched above

    match:
      failure_type: response_error
      failure_reason: unexpected_content_type

    user_message: |
      The AI returned {{context.actualMimeType}} instead of the expected {{context.requestedMimeType}}.
      This usually means the AI didn't understand the format you wanted.
      Try rephrasing your request more explicitly.

    internal_note: "Unhandled MIME mismatch - investigate pattern"

    recovery:
      enabled: false

  # Rule 5: Content Policy Violation - Clear guidance
  - id: openai-content-policy-001
    name: "Content Policy Violation"
    priority: 95

    match:
      failure_type: generation_refusal
      failure_reason: safety_filter
      conditions:
        - field: context.rawResponse.error.code
          operator: equals
          value: "content_policy_violation"

    user_message: |
      Your request was blocked by content safety filters. This can happen if:
      - The prompt contains potentially harmful content
      - The request could generate inappropriate material
      - Certain words triggered automated filters

      Try rephrasing your request in a different way.

    internal_note: "OpenAI content filter triggered"

    recovery:
      enabled: false  # Don't retry - will fail again

Layer 3: Rule Engine Execution (NEW)

Purpose: Match errors to rules and execute recovery strategies

typescript

// /packages/core/src/services/error-rule-engine.ts

interface ErrorRule {
  id: string;
  name: string;
  priority: number;
  match: RuleMatch;
  user_message: string;
  internal_note: string;
  recovery: RecoveryStrategy;
}

interface RuleMatch {
  failure_type: FailureType;
  failure_reason: FailureReason;
  conditions?: MatchCondition[];
}

interface MatchCondition {
  field: string;           // JSONPath to field in error context
  operator: 'equals' | 'in' | 'exists' | 'startsWith' | 'matches' | 'less_than' | 'greater_than';
  value: any;
  values?: any[];
}

interface RecoveryStrategy {
  enabled: boolean;
  max_attempts: number;
  delay?: {
    type: 'fixed' | 'exponential' | 'from_header';
    seconds?: number;
    header?: string;
    fallback_seconds?: number;
  };
  modifications: RequestModification[];
  retry_conditions?: MatchCondition[];
}

interface RequestModification {
  type: 'prepend_to_prompt' | 'append_to_prompt' | 'add_system_message' | 'set_parameter' | 'remove_parameter';
  value?: any;
  path?: string;  // JSONPath for set/remove operations
}

class ErrorRuleEngine {
  private rules: Map<string, ErrorRule[]> = new Map(); // service_type -> rules

  constructor() {
    this.loadRules();
  }

  /**
   * Load all error rules from YAML configs
   */
  private async loadRules(): Promise<void> {
    const ruleFiles = await glob('/config/error-rules/*.yaml');

    for (const file of ruleFiles) {
      const config = await YAML.parse(file);
      this.rules.set(config.service, config.rules);
    }
  }

  /**
   * Find matching rule for an error
   */
  findMatchingRule(error: ConnectorError): ErrorRule | null {
    const serviceRules = this.rules.get(error.context.serviceType) || [];

    // Sort by priority (highest first)
    const sortedRules = serviceRules.sort((a, b) => b.priority - a.priority);

    for (const rule of sortedRules) {
      if (this.matchesRule(error, rule)) {
        logger.info(`📋 Matched error to rule: ${rule.id} - ${rule.name}`);
        return rule;
      }
    }

    logger.warn(`⚠️ No matching rule found for error: ${error.failureReason}`);
    return null;
  }

  /**
   * Check if error matches rule conditions
   */
  private matchesRule(error: ConnectorError, rule: ErrorRule): boolean {
    // Check basic match
    if (error.failureType !== rule.match.failure_type) return false;
    if (error.failureReason !== rule.match.failure_reason) return false;

    // Check additional conditions
    if (!rule.match.conditions) return true;

    for (const condition of rule.match.conditions) {
      if (!this.evaluateCondition(error, condition)) {
        return false;
      }
    }

    return true;
  }

  /**
   * Evaluate a single condition
   */
  private evaluateCondition(error: ConnectorError, condition: MatchCondition): boolean {
    const actualValue = this.getFieldValue(error, condition.field);

    switch (condition.operator) {
      case 'equals':
        return actualValue === condition.value;
      case 'in':
        return condition.values?.includes(actualValue) ?? false;
      case 'exists':
        return (actualValue !== undefined && actualValue !== null) === condition.value;
      case 'startsWith':
        return typeof actualValue === 'string' && actualValue.startsWith(condition.value);
      case 'matches':
        return typeof actualValue === 'string' && new RegExp(condition.value).test(actualValue);
      case 'less_than':
        return typeof actualValue === 'number' && actualValue < condition.value;
      case 'greater_than':
        return typeof actualValue === 'number' && actualValue > condition.value;
      default:
        return false;
    }
  }

  /**
   * Get field value from error using JSONPath
   */
  private getFieldValue(error: ConnectorError, path: string): any {
    // Simple JSONPath implementation
    const parts = path.split('.');
    let value: any = error;

    for (const part of parts) {
      if (value && typeof value === 'object') {
        value = value[part];
      } else {
        return undefined;
      }
    }

    return value;
  }

  /**
   * Apply recovery strategy and return modified request
   */
  async applyRecoveryStrategy(
    originalRequest: any,
    rule: ErrorRule,
    error: ConnectorError
  ): Promise<{ shouldRetry: boolean; modifiedRequest?: any; reason?: string }> {

    if (!rule.recovery.enabled) {
      return { shouldRetry: false, reason: 'Recovery disabled for this rule' };
    }

    // Check retry conditions
    if (rule.recovery.retry_conditions) {
      for (const condition of rule.recovery.retry_conditions) {
        if (!this.evaluateCondition(error, condition)) {
          return { shouldRetry: false, reason: `Retry condition failed: ${condition.field}` };
        }
      }
    }

    // Apply modifications
    let modifiedRequest = JSON.parse(JSON.stringify(originalRequest));

    for (const modification of rule.recovery.modifications) {
      modifiedRequest = this.applyModification(modifiedRequest, modification);
    }

    logger.info(`✨ Applied recovery modifications from rule ${rule.id}`, {
      modifications: rule.recovery.modifications.map(m => m.type)
    });

    return { shouldRetry: true, modifiedRequest };
  }

  /**
   * Apply a single modification to request
   */
  private applyModification(request: any, modification: RequestModification): any {
    switch (modification.type) {
      case 'prepend_to_prompt':
        // Find the prompt in request structure and prepend
        if (request.input && Array.isArray(request.input)) {
          const userMessage = request.input.find((msg: any) => msg.role === 'user');
          if (userMessage && userMessage.content) {
            if (typeof userMessage.content === 'string') {
              userMessage.content = modification.value + userMessage.content;
            } else if (Array.isArray(userMessage.content)) {
              const textContent = userMessage.content.find((c: any) => c.type === 'input_text');
              if (textContent) {
                textContent.text = modification.value + textContent.text;
              }
            }
          }
        }
        break;

      case 'append_to_prompt':
        // Similar to prepend but at end
        if (request.input && Array.isArray(request.input)) {
          const userMessage = request.input.find((msg: any) => msg.role === 'user');
          if (userMessage && userMessage.content) {
            if (typeof userMessage.content === 'string') {
              userMessage.content = userMessage.content + modification.value;
            } else if (Array.isArray(userMessage.content)) {
              const textContent = userMessage.content.find((c: any) => c.type === 'input_text');
              if (textContent) {
                textContent.text = textContent.text + modification.value;
              }
            }
          }
        }
        break;

      case 'add_system_message':
        if (request.input && Array.isArray(request.input)) {
          request.input.unshift({
            role: 'system',
            content: modification.value
          });
        }
        break;

      case 'set_parameter':
        if (modification.path) {
          this.setValueAtPath(request, modification.path, modification.value);
        }
        break;

      case 'remove_parameter':
        if (modification.path) {
          this.deleteValueAtPath(request, modification.path);
        }
        break;
    }

    return request;
  }

  /**
   * Set value at JSONPath
   */
  private setValueAtPath(obj: any, path: string, value: any): void {
    const parts = path.split('.');
    let current = obj;

    for (let i = 0; i < parts.length - 1; i++) {
      const part = parts[i];
      const arrayMatch = part.match(/(.+)\[(\d+)\]/);

      if (arrayMatch) {
        const [, key, index] = arrayMatch;
        if (!current[key]) current[key] = [];
        if (!current[key][parseInt(index)]) current[key][parseInt(index)] = {};
        current = current[key][parseInt(index)];
      } else {
        if (!current[part]) current[part] = {};
        current = current[part];
      }
    }

    const lastPart = parts[parts.length - 1];
    current[lastPart] = value;
  }

  /**
   * Delete value at JSONPath
   */
  private deleteValueAtPath(obj: any, path: string): void {
    const parts = path.split('.');
    let current = obj;

    for (let i = 0; i < parts.length - 1; i++) {
      if (!current[parts[i]]) return;
      current = current[parts[i]];
    }

    delete current[parts[parts.length - 1]];
  }

  /**
   * Get user-friendly error message with template substitution
   */
  getUserMessage(rule: ErrorRule, error: ConnectorError): string {
    let message = rule.user_message;

    // Replace {{context.field}} with actual values
    message = message.replace(/\{\{([^}]+)\}\}/g, (match, path) => {
      const value = this.getFieldValue(error, path);
      return value !== undefined ? String(value) : match;
    });

    return message;
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304

Integration with Existing Error Flow

Modified Worker Flow:

typescript

// In async-rest-connector.ts or openai-base-connector.ts

async processJob(jobData: Job): Promise<JobResult> {
  const ruleEngine = new ErrorRuleEngine();
  let currentRequest = this.buildRequest(jobData);
  let attemptCount = 0;
  const maxAttempts = 3;
  const recoveryHistory: RecoveryAttempt[] = [];

  while (attemptCount < maxAttempts) {
    try {
      const result = await this.executeRequest(currentRequest);
      return result;

    } catch (error) {
      attemptCount++;

      if (!(error instanceof ConnectorError)) {
        throw error; // Re-throw non-connector errors
      }

      // Find matching rule
      const rule = ruleEngine.findMatchingRule(error);

      if (!rule) {
        // No rule found - use default behavior
        logger.warn(`No error rule matched - using default error handling`);
        throw error;
      }

      // Log internal note for debugging
      logger.info(`📋 Error Rule Matched: ${rule.name}`, {
        rule_id: rule.id,
        internal_note: rule.internal_note
      });

      // Try recovery
      const recovery = await ruleEngine.applyRecoveryStrategy(
        currentRequest,
        rule,
        error
      );

      recoveryHistory.push({
        rule_id: rule.id,
        rule_name: rule.name,
        modifications: rule.recovery.modifications,
        result: recovery.shouldRetry ? 'attempted' : 'skipped'
      });

      if (!recovery.shouldRetry) {
        // Can't recover - enhance error with user message from rule
        error.context.user_message = ruleEngine.getUserMessage(rule, error);
        error.context.recovery_attempts = recoveryHistory;
        throw error;
      }

      // Apply delay if specified
      if (rule.recovery.delay) {
        const delayMs = this.calculateDelay(rule.recovery.delay, error);
        logger.info(`⏳ Waiting ${delayMs}ms before retry (${rule.name})`);
        await new Promise(resolve => setTimeout(resolve, delayMs));
      }

      // Update request for retry
      currentRequest = recovery.modifiedRequest;

      logger.info(`🔄 Retrying with modifications from rule ${rule.id} (attempt ${attemptCount}/${maxAttempts})`);

      // Continue loop to retry
    }
  }

  // If we get here, all retries failed
  throw new ConnectorError(
    FailureType.SYSTEM_ERROR,
    FailureReason.UNKNOWN_ERROR,
    `Failed after ${maxAttempts} attempts with rule-based recovery`,
    false,
    { recovery_attempts: recoveryHistory }
  );
}

Benefits of V2 System

1. Separation of Concerns

Engineers: Build error capture and rule engine (once)
Operations: Write YAML rules (no code changes needed)
Users: Get helpful, context-aware error messages

2. Continuous Improvement

yaml

# Add new rule without touching code
- id: openai-dall-e-nsfw-002
  name: "DALL-E NSFW Filter"
  match:
    failure_type: generation_refusal
    failure_reason: nsfw_content
    conditions:
      - field: context.modelUsed
        operator: equals
        value: "dall-e-3"

  user_message: |
    DALL-E detected potentially inappropriate content in your request.
    Try rephrasing without descriptive terms for people's appearance.

  recovery:
    enabled: false

3. Data-Driven Learning

Track which rules are matching most often:

typescript

// Analytics
interface RuleAnalytics {
  rule_id: string;
  match_count: number;
  recovery_success_rate: number;
  avg_retries_to_success: number;
}

4. A/B Testing Recovery Strategies

yaml

- id: openai-image-to-text-001-variant-a
  enabled: true
  traffic_percentage: 50  # Send 50% of matches here
  modifications:
    - type: prepend_to_prompt
      value: "Generate an actual image. "

- id: openai-image-to-text-001-variant-b
  enabled: true
  traffic_percentage: 50
  modifications:
    - type: add_system_message
      value: "Always use image_generation tool."

5. User Customization

Allow customers to override rules per workspace:

yaml

# /config/error-rules/overrides/customer-abc123.yaml
overrides:
  - rule_id: openai-rate-limit-001
    user_message: "Your workspace has reached its API limit. Contact billing@company.com"

Implementation Phases

Phase 1: Foundation (Week 1-2)

[x] V1 already captures comprehensive error data ✅
[ ] Build ErrorRuleEngine class
[ ] YAML rule loader
[ ] Basic rule matching (no recovery yet)
[ ] Template substitution for user messages

Phase 2: Recovery Engine (Week 3-4)

[ ] Request modification system
[ ] Retry logic with delays
[ ] Recovery attempt tracking
[ ] Integration with worker connectors

Phase 3: Rule Library (Week 5-6)

[ ] Create initial rule set for OpenAI
[ ] Create rules for ComfyUI
[ ] Create rules for other services
[ ] Documentation for writing rules

Phase 4: Analytics & Optimization (Week 7-8)

[ ] Rule match tracking
[ ] Success rate analytics
[ ] A/B testing framework
[ ] Auto-suggest new rules based on patterns

Example Rule Files

OpenAI Responses

/config/error-rules/openai-responses.yaml - as shown above

ComfyUI

yaml

version: 2
service: comfyui

rules:
  - id: comfyui-node-missing-001
    name: "Missing Custom Node"
    priority: 100

    match:
      failure_type: validation_error
      failure_reason: component_error
      conditions:
        - field: context.componentError
          operator: matches
          value: "Cannot find node class"

    user_message: |
      The ComfyUI workflow requires a custom node that isn't installed: {{context.componentName}}
      This workflow may have been created with different extensions than available on this system.

    internal_note: "Custom node missing - need to track which nodes are available on which machines"

    recovery:
      enabled: false  # Can't auto-fix missing nodes

Migration Path from V1

No Breaking Changes: V1 continues to work
Opt-in per Service: Add rule files service by service
Gradual Enhancement: Start with user messages, add recovery later
Analytics on Both: Compare V1 vs V2 error rates

Success Metrics

User Retry Rate: Should decrease as auto-recovery improves
Support Tickets: Fewer "what does this error mean" tickets
Error Message Clarity: User surveys on message helpfulness
Recovery Success Rate: % of errors that auto-recover
Time to Add New Rule: Should be < 10 minutes

Future Enhancements

ML-Based Rule Suggestions

typescript

// Analyze error patterns and suggest new rules
interface RuleSuggestion {
  pattern: ErrorPattern;
  occurrences: number;
  suggested_rule: Partial<ErrorRule>;
  confidence: number;
}

User Feedback Loop

typescript

// Let users rate error messages
interface ErrorFeedback {
  error_id: string;
  helpful: boolean;
  comment?: string;
}

Dynamic Rule Updates

typescript

// Hot-reload rules without restart
ruleEngine.watchRuleFiles();

Decision

Recommended: Implement V2 Error System with Declarative Rules

Rationale:

Separates error intelligence from code
Enables rapid iteration on error handling
Provides better user experience
Captures comprehensive data for learning
Supports automatic error recovery
Scales to new services easily

Next Steps:

Review and approve this ADR
Create /config/error-rules/ directory structure
Implement ErrorRuleEngine class
Create initial OpenAI rule set
Integrate with one connector (OpenAI) as proof of concept
Measure impact and iterate

Appendix: Current Implementation Analysis (2025-01-07)

Overview of Existing Error Handling ADRs

Three related ADRs exist:

ERROR_HANDLING_MODERNIZATION.md - Python → TypeScript error flow improvements
- Fix Python object serialization ([object Object] → proper JSON)
- Add structured error codes in Python (ErrorCode enum)
- Create ComfyUIErrorParser in TypeScript
- Status: Proposed, partially implemented
connector-error-handling-standard.md - Standardized connector error handling
- ConnectorError class with structured classification
- BaseConnector enforcement wrapper
- Protocol layers (HTTPConnector, WebSocketConnector)
- Status: Accepted, partially implemented
error-management-v2-declarative-rules.md (this document) - Declarative error recovery
- YAML-based error rules
- Automatic retry with modifications
- Pattern learning and A/B testing
- Status: Proposed, NOT implemented

Current Connector Error Handling State

✅ What's Working Well

BaseConnector (base-connector.ts:487-595):

processJob() wrapper catches ALL errors automatically
Converts errors to ConnectorError via ConnectorError.fromError()
Validates connectors don't return {success: false} (throws if they do)
Logs structured error data with telemetry
Reports errors to Redis for monitoring

AsyncHTTPConnector (async-http-connector.ts:242-245):

Protocol layer for HTTP-based connectors
handleServiceError() hook for service-specific error handling
Generic HTTP status mapping:
- 401/403 → AUTH_ERROR / INVALID_API_KEY
- 429 → RATE_LIMIT / REQUESTS_PER_MINUTE
- 4xx → VALIDATION_ERROR
- 5xx → SERVICE_ERROR
MIME type validation
Fetch-based with timeout handling

OpenAIResponsesConnector (openai-responses-connector.ts:156-381):

Semantic validation for content refusals
Rich error context (rawServiceOutput, rawServiceRequest)
MIME mismatch detection

Refusal pattern detection:

typescript

const refusalPatterns = [
  /I[''']m sorry,?\s+but I can[''']t assist with that/i,
  /against (OpenAI[''']s? )?content policy/i,
  // ...
];

❌ Issues & Inconsistencies

1. Inconsistent Error Throwing

Some connectors: throw new ConnectorError(...) ✅
Some connectors: throw new Error(...) ⚠️ (gets converted by BaseConnector)
Some connectors: return error in result object ⚠️ (gets converted by AsyncHTTPConnector)

Example from OpenAIBaseConnector (lines 768, 804, 823):

typescript

// ❌ Throws generic Error instead of ConnectorError
throw new Error(`OpenAI job ${openaiJobId} ${currentStatus}: ${errorMessage}`);

2. Service Error Hook Underutilized

AsyncHTTPConnector provides handleServiceError() hook
OpenAI connectors don't override it (missing OpenAI-specific error patterns)
ComfyUI connectors likely similar
Opportunity for service-specific classification before generic fallback

3. Missing Rich Context

Not all errors include rawRequest / rawResponse
Some errors lack serviceJobId, retryAfterSeconds, etc.
Context helps with debugging and forensic analysis

4. No Declarative Rules (V2)

This ADR proposed but not implemented
No automatic retry with modifications
No pattern learning
No YAML rule files
No user-friendly message templates

5. Python Error Codes Not Implemented

ERROR_HANDLING_MODERNIZATION.md ADR not implemented
Still getting [object Object] serialization issues from ComfyUI
No structured error codes from Python layer

Connector Error Handling Patterns Summary

Connector	Extends	Error Pattern	Service Hook	Context	Rating
BaseConnector	-	Catches all, converts to ConnectorError	✅	✅ Full	⭐⭐⭐⭐⭐
AsyncHTTPConnector	BaseConnector	Generic HTTP mapping + hook	✅ Provides	✅ Good	⭐⭐⭐⭐
OpenAIBaseConnector	BaseConnector	Throws generic Error	❌ None	⚠️ Partial	⭐⭐⭐
OpenAIResponsesConnector	AsyncHTTPConnector	Returns error in result	❌ None	✅ Good	⭐⭐⭐⭐
ComfyUIRestStreamConnector	AsyncHTTPConnector	Unknown	Unknown	Unknown	❓

Recommendations for V2 Implementation

1. Implement Service Error Hooks First

Before implementing declarative rules, standardize service-specific error handling:

typescript

// In OpenAIBaseConnector
protected handleServiceError(error: any, jobData: JobData): ConnectorError | null {
  // Handle OpenAI-specific error codes
  if (error.response?.data?.error?.code) {
    const errorCode = error.response.data.error.code;
    const errorMessage = error.response.data.error.message;

    switch (errorCode) {
      case 'rate_limit_exceeded':
        return new ConnectorError(
          FailureType.RATE_LIMIT,
          FailureReason.REQUESTS_PER_MINUTE,
          errorMessage,
          true,
          {
            serviceType: 'openai',
            httpStatus: error.response.status,
            rawResponse: error.response.data,
            retryAfterSeconds: error.response.headers?.['retry-after']
          }
        );

      case 'invalid_api_key':
        return new ConnectorError(
          FailureType.AUTH_ERROR,
          FailureReason.INVALID_API_KEY,
          errorMessage,
          false,
          { serviceType: 'openai', httpStatus: 401 }
        );

      case 'content_policy_violation':
        return new ConnectorError(
          FailureType.GENERATION_REFUSAL,
          FailureReason.SAFETY_FILTER,
          errorMessage,
          false,
          {
            serviceType: 'openai',
            rawRequest: jobData.payload,
            rawResponse: error.response.data
          }
        );
    }
  }

  return null; // Fall through to generic handling
}

2. Standardize on ConnectorError

Update OpenAIBaseConnector to throw ConnectorError instead of Error
Add rich context to all errors (rawRequest, rawResponse, serviceJobId)
Ensure all connectors follow the same pattern

3. Semantic Validation Framework

Generalize the refusal pattern detection from OpenAIResponsesConnector:

typescript

// packages/core/src/services/semantic-validator.ts
export class SemanticValidator {
  private static refusalPatterns = [
    /I[''']m sorry,?\s+but I can[''']t assist with that/i,
    /cannot generate/i,
    /unable to create/i,
    /policy violation/i,
    /content policy/i,
    // ... more patterns
  ];

  static detectRefusal(text: string): {
    isRefusal: boolean;
    confidence: number;
    matchedPattern?: string;
  } {
    // Pattern matching logic
  }

  static detectMimeTypeMismatch(
    requestedType: string,
    actualContent: any
  ): boolean {
    // MIME validation logic
  }
}

4. Phased V2 Implementation

Given the current state, recommend this approach:

Phase 1: Foundation (Week 1-2) - Focus on standardization first

[ ] Implement service error hooks in all connectors
[ ] Standardize ConnectorError usage
[ ] Add semantic validation framework
[ ] Ensure rich context in all errors

Phase 2: Rule Engine Core (Week 3-4) - Build the infrastructure

[ ] Implement ErrorRuleEngine class
[ ] YAML rule loader
[ ] Rule matching logic
[ ] Template substitution for user messages

Phase 3: Recovery Strategies (Week 5-6) - Add intelligence

[ ] Request modification system
[ ] Retry logic with delays
[ ] Recovery attempt tracking
[ ] Integration with connectors

Phase 4: Rule Library (Week 7-8) - Build the knowledge base

[ ] OpenAI rule set (based on observed patterns)
[ ] ComfyUI rule set
[ ] Documentation for writing rules
[ ] Rule testing framework

Phase 5: Analytics & Learning (Week 9-10) - Close the loop

[ ] Rule match tracking
[ ] Success rate analytics
[ ] A/B testing framework
[ ] Auto-suggest new rules based on patterns

5. Immediate Quick Wins

Before implementing V2, these changes provide immediate value:

Add service error hooks to OpenAI connectors (1-2 days)
Create semantic validator for LLM responses (1 day)
Standardize error throwing in OpenAIBaseConnector (1 day)
Add rich context to all errors (1 day)

Total: 1 week of work for significant improvement

Key Insights for V2 Design

1. Service Hook Pattern Works Well

AsyncHTTPConnector's handleServiceError() hook is a clean pattern
Allows service-specific logic without bloating base classes
Should be the integration point for declarative rules

2. Semantic Validation is Critical

OpenAIResponsesConnector's refusal detection prevents silent failures
Image-vs-text mismatch detection catches model misbehavior
This intelligence should be in rule engine, not individual connectors

3. Rich Context Enables Learning

Errors with rawRequest and rawResponse enable pattern analysis
Missing context makes it impossible to write good rules
Context must be standard before V2 can succeed

4. Connector Compliance is Key

V2 only works if connectors use ConnectorError consistently
BaseConnector validation catches {success: false} pattern
Need similar validation for generic Error throws

5. Start Simple, Evolve

Don't implement all V2 features at once
Start with static rules (no recovery)
Add recovery strategies incrementally
Add A/B testing and learning last

Migration Path to V2

Step 1: Current State Audit ✅ (Complete - this document)

[x] Document existing error handling patterns
[x] Identify gaps and inconsistencies
[x] Analyze connector implementations

Step 2: Standardization (Before V2)

[ ] Implement service error hooks in all connectors
[ ] Standardize ConnectorError usage
[ ] Add semantic validation framework
[ ] Ensure rich context in all errors

Step 3: V2 Foundation (Weeks 1-4)

[ ] Build ErrorRuleEngine core
[ ] Create YAML rule format
[ ] Implement rule matching
[ ] Add user message templates

Step 4: V2 Intelligence (Weeks 5-8)

[ ] Request modification system
[ ] Recovery strategies
[ ] Rule library for OpenAI and ComfyUI
[ ] Integration testing

Step 5: V2 Learning (Weeks 9-10)

[ ] Analytics and tracking
[ ] A/B testing
[ ] Auto-suggest rules
[ ] User feedback loop

Step 6: Python Integration (Optional - if ComfyUI priority)

[ ] Implement Python ErrorCode enum
[ ] Fix object serialization
[ ] Create ComfyUIErrorParser
[ ] Integrate with existing connectors

Success Metrics Tracking

To measure V2 effectiveness, track these metrics:

Before V2 (Baseline - capture now):

[ ] Error classification accuracy (manual audit of 100 errors)
[ ] % of errors with helpful user messages
[ ] Average retry attempts per error type
[ ] Support tickets related to error messages
[ ] Time to resolve production errors (MTTR)

After V2 Implementation:

[ ] Error classification accuracy improvement
[ ] User message helpfulness rating (survey)
[ ] Reduction in wasted retries (non-retryable errors)
[ ] Support ticket reduction
[ ] MTTR improvement
[ ] Auto-recovery success rate

Risk Assessment

Low Risk:

Adding service error hooks (backwards compatible)
Creating semantic validator (pure utility)
Standardizing ConnectorError usage (caught by BaseConnector)

Medium Risk:

Rule engine implementation (new system, needs testing)
Request modification (could make problems worse if buggy)
YAML parsing and validation (security considerations)

High Risk:

Automatic retries with modifications (could cause cascading failures)
A/B testing in production (could confuse users)
Python error code integration (requires ComfyUI changes)

Mitigation:

Feature flags for rule engine (enable per service)
Dry-run mode (log rule matches without taking action)
Gradual rollout (one service at a time)
Comprehensive testing (unit + integration + E2E)
Monitoring and alerting (track rule effectiveness)

Open Questions for V2

Rule Priority: How do we handle overlapping rules?
- Answer: Priority field (higher number = higher priority)
- First matching rule wins
Rule Updates: Hot-reload or restart required?
- Answer: Start with restart, add hot-reload later
- Use file watcher for development
Rule Testing: How do we test rules without production traffic?
- Answer: Dry-run mode + replay production errors
- Create rule test framework
Rule Versioning: How do we track rule changes over time?
- Answer: Git version control for YAML files
- Add version field to rule format
Rule Conflicts: What if multiple services use same error pattern?
- Answer: Service-specific rule files
- Rules are scoped to service_type
Recovery Limits: How many times should we retry with modifications?
- Answer: Configurable per rule (max_attempts)
- Default: 1 retry to avoid cascading
User Override: Should users be able to customize rules?
- Answer: Yes, via override files
- Workspace-specific customization
Telemetry: What data should we collect about rule effectiveness?
- Answer: Match count, success rate, average retries
- Store in Redis for analytics

Last Updated: 2025-01-07 Analyst: Claude Code Status: Analysis Complete - Ready for V2 Planning

ADR: Error Management System V2 - Declarative Response Rules ​

Problem Statement ​

V2 Architecture: Three-Layer Error Intelligence ​

Layer 1: Comprehensive Internal Capture (Existing - Enhanced) ​

Layer 2: Declarative Error Response Rules (NEW) ​

Layer 3: Rule Engine Execution (NEW) ​

Integration with Existing Error Flow ​

Benefits of V2 System ​

1. Separation of Concerns ​

2. Continuous Improvement ​

3. Data-Driven Learning ​

4. A/B Testing Recovery Strategies ​

5. User Customization ​

Implementation Phases ​

Phase 1: Foundation (Week 1-2) ​

Phase 2: Recovery Engine (Week 3-4) ​

Phase 3: Rule Library (Week 5-6) ​

Phase 4: Analytics & Optimization (Week 7-8) ​

Example Rule Files ​

OpenAI Responses ​

ComfyUI ​

Migration Path from V1 ​

Success Metrics ​

Future Enhancements ​

ML-Based Rule Suggestions ​

User Feedback Loop ​

Dynamic Rule Updates ​

Decision ​

Appendix: Current Implementation Analysis (2025-01-07) ​

Overview of Existing Error Handling ADRs ​

Current Connector Error Handling State ​

✅ What's Working Well ​

❌ Issues & Inconsistencies ​

Connector Error Handling Patterns Summary ​

Recommendations for V2 Implementation ​

Key Insights for V2 Design ​

Migration Path to V2 ​

Success Metrics Tracking ​

Risk Assessment ​

Open Questions for V2 ​

ADR: Error Management System V2 - Declarative Response Rules

Problem Statement

V2 Architecture: Three-Layer Error Intelligence

Layer 1: Comprehensive Internal Capture (Existing - Enhanced)

Layer 2: Declarative Error Response Rules (NEW)

Layer 3: Rule Engine Execution (NEW)

Integration with Existing Error Flow

Benefits of V2 System

1. Separation of Concerns

2. Continuous Improvement

3. Data-Driven Learning

4. A/B Testing Recovery Strategies

5. User Customization

Implementation Phases

Phase 1: Foundation (Week 1-2)

Phase 2: Recovery Engine (Week 3-4)

Phase 3: Rule Library (Week 5-6)

Phase 4: Analytics & Optimization (Week 7-8)

Example Rule Files

OpenAI Responses

ComfyUI

Migration Path from V1

Success Metrics

Future Enhancements

ML-Based Rule Suggestions

User Feedback Loop

Dynamic Rule Updates

Decision

Appendix: Current Implementation Analysis (2025-01-07)

Overview of Existing Error Handling ADRs

Current Connector Error Handling State

✅ What's Working Well

❌ Issues & Inconsistencies

Connector Error Handling Patterns Summary

Recommendations for V2 Implementation

Key Insights for V2 Design

Migration Path to V2

Success Metrics Tracking

Risk Assessment

Open Questions for V2