Skip to content

ADR: Error Management System V2 - Declarative Response Rules

Status: Proposed Date: 2025-11-07 Authors: Claude Code Context: Current error system provides structured errors but lacks intelligent error recovery and user-friendly messaging


Problem Statement

Current error system (V1) has fundamental limitations:

  1. Static User Messages: Generic messages like "Service returned unexpected content type" don't help users fix their prompts
  2. No Error Recovery: System can't automatically retry with modified prompts or adjusted parameters
  3. No Pattern Learning: Same errors repeat without system learning optimal responses
  4. Developer Burden: Every new error pattern requires code changes across multiple files
  5. Lost Intelligence: Raw error data is captured but not analyzed for actionable insights

Example Current Flow:

User: "turn my pfp into the locked in gamer meme"

OpenAI returns text instead of image

Error: "MIME type mismatch: requested image/png but received text"

User sees: "Service returned unexpected content type. Please try again."

User tries again... same error... gives up

What We Need:

User: "turn my pfp into the locked in gamer meme"

OpenAI returns text instead of image

System recognizes: Pattern #47 - "Image request returned JSON"

System applies rule: "Add explicit image generation instruction"

Auto-retry with: "Generate an image of... (DO NOT return JSON instructions)"

Success OR User sees: "The AI tried to explain how to create the image instead of
generating it. Try being more explicit: 'Create an actual image of...'"

V2 Architecture: Three-Layer Error Intelligence

Layer 1: Comprehensive Internal Capture (Existing - Enhanced)

Purpose: Capture EVERYTHING for forensic analysis

typescript
interface ErrorAttestation {
  // Core Classification
  failure_type: FailureType;
  failure_reason: FailureReason;
  retryable: boolean;

  // Full Context (unlimited detail for internal use)
  raw_request: object;           // Complete request payload
  raw_response: object;          // Complete response (with smart truncation for base64)
  requested_mime_type: string;
  actual_mime_type: string;
  http_status?: number;

  // Service Context
  service_type: string;          // 'openai_responses', 'comfyui', etc.
  model_used?: string;           // 'gpt-4.1', 'dall-e-3', etc.
  component_name?: string;       // Which part of pipeline failed

  // Job Context
  job_id: string;
  workflow_id?: string;
  retry_count: number;

  // Pattern Matching Keys (for rule engine)
  error_signature: string;       // Hash of error pattern for matching
  similar_errors_count: number;  // How many times we've seen this before

  // Recovery Attempts
  recovery_attempts: RecoveryAttempt[];
}

interface RecoveryAttempt {
  rule_id: string;
  rule_name: string;
  modifications: object;         // What we changed
  result: 'success' | 'failed' | 'skipped';
  new_error?: ErrorAttestation;  // If retry failed, what error?
}

Key Enhancement: Add error_signature generation

typescript
function generateErrorSignature(error: ConnectorError): string {
  // Create pattern hash for matching
  return hash({
    failure_type: error.failureType,
    failure_reason: error.failureReason,
    service_type: error.context.serviceType,
    model: error.context.modelUsed,
    requested_mime: error.context.requestedMimeType,
    actual_mime: error.context.actualMimeType,
    http_status: error.context.httpStatus
  });
}

Layer 2: Declarative Error Response Rules (NEW)

Purpose: Define error recovery strategies and user messaging as data, not code

Rule File: /config/error-rules/openai-responses.yaml

yaml
# Error Response Rules for OpenAI Responses Service
version: 2
service: openai_responses

rules:
  # Rule 1: Image request returned JSON/text explanation
  - id: openai-image-to-text-001
    name: "Image Request Returned Text Explanation"
    priority: 100

    # Pattern Matching
    match:
      failure_type: response_error
      failure_reason: unexpected_content_type
      conditions:
        - field: context.requestedMimeType
          operator: equals
          value: "image/png"
        - field: context.actualMimeType
          operator: in
          values: ["text", "application/json"]
        - field: context.modelUsed
          operator: startsWith
          value: "gpt-4"

    # User-Facing Message (generic, helpful)
    user_message: |
      The AI returned instructions on how to create the image instead of generating it.
      Try being more explicit: "Generate an actual image of [your request]"

    # Internal Note (for our debugging)
    internal_note: |
      GPT-4.1 with image_generation tool sometimes returns JSON prompt instead of
      calling the tool. This happens when the user request is ambiguous about whether
      they want an image or image instructions.

    # Automatic Recovery Strategy
    recovery:
      enabled: true
      max_attempts: 1

      # Modify the request before retry
      modifications:
        - type: prepend_to_prompt
          value: "IMPORTANT: Generate an actual image file, do not return text instructions or JSON. "

        - type: add_system_message
          value: "You must use the image_generation tool to create an actual image. Never return text descriptions or JSON prompts."

        - type: set_parameter
          path: "tools[0].strict"
          value: true

      # Only retry if conditions met
      retry_conditions:
        - field: retry_count
          operator: less_than
          value: 2
        - field: workflow_id
          operator: exists
          value: true  # Only auto-retry in workflows

  # Rule 2: Rate limit with retry-after
  - id: openai-rate-limit-001
    name: "OpenAI Rate Limit with Retry-After"
    priority: 90

    match:
      failure_type: rate_limit
      failure_reason: requests_per_minute
      conditions:
        - field: context.retryAfterSeconds
          operator: exists
          value: true

    user_message: |
      Rate limit reached. Your request will automatically retry in {{context.retryAfterSeconds}} seconds.

    internal_note: "Standard OpenAI rate limit - should auto-retry"

    recovery:
      enabled: true
      max_attempts: 3

      # Wait before retry
      delay:
        type: from_header
        header: retry-after
        fallback_seconds: 60

      modifications: []  # No changes needed, just retry as-is

      retry_conditions:
        - field: retry_count
          operator: less_than
          value: 3

  # Rule 3: Model not found - suggest alternatives
  - id: openai-model-404-001
    name: "Model Not Found"
    priority: 80

    match:
      failure_type: validation_error
      failure_reason: model_not_found

    user_message: |
      The requested AI model is not available. Common alternatives:
      - For text: gpt-4-turbo, gpt-3.5-turbo
      - For images: dall-e-3
      Please contact support if you need a specific model.

    internal_note: "Model name typo or deprecated model"

    recovery:
      enabled: false  # Don't auto-retry - needs user decision

  # Rule 4: MIME mismatch - generic fallback
  - id: generic-mime-mismatch-001
    name: "Generic MIME Type Mismatch"
    priority: 10  # Low priority - catches anything not matched above

    match:
      failure_type: response_error
      failure_reason: unexpected_content_type

    user_message: |
      The AI returned {{context.actualMimeType}} instead of the expected {{context.requestedMimeType}}.
      This usually means the AI didn't understand the format you wanted.
      Try rephrasing your request more explicitly.

    internal_note: "Unhandled MIME mismatch - investigate pattern"

    recovery:
      enabled: false

  # Rule 5: Content Policy Violation - Clear guidance
  - id: openai-content-policy-001
    name: "Content Policy Violation"
    priority: 95

    match:
      failure_type: generation_refusal
      failure_reason: safety_filter
      conditions:
        - field: context.rawResponse.error.code
          operator: equals
          value: "content_policy_violation"

    user_message: |
      Your request was blocked by content safety filters. This can happen if:
      - The prompt contains potentially harmful content
      - The request could generate inappropriate material
      - Certain words triggered automated filters

      Try rephrasing your request in a different way.

    internal_note: "OpenAI content filter triggered"

    recovery:
      enabled: false  # Don't retry - will fail again

Layer 3: Rule Engine Execution (NEW)

Purpose: Match errors to rules and execute recovery strategies

typescript
// /packages/core/src/services/error-rule-engine.ts

interface ErrorRule {
  id: string;
  name: string;
  priority: number;
  match: RuleMatch;
  user_message: string;
  internal_note: string;
  recovery: RecoveryStrategy;
}

interface RuleMatch {
  failure_type: FailureType;
  failure_reason: FailureReason;
  conditions?: MatchCondition[];
}

interface MatchCondition {
  field: string;           // JSONPath to field in error context
  operator: 'equals' | 'in' | 'exists' | 'startsWith' | 'matches' | 'less_than' | 'greater_than';
  value: any;
  values?: any[];
}

interface RecoveryStrategy {
  enabled: boolean;
  max_attempts: number;
  delay?: {
    type: 'fixed' | 'exponential' | 'from_header';
    seconds?: number;
    header?: string;
    fallback_seconds?: number;
  };
  modifications: RequestModification[];
  retry_conditions?: MatchCondition[];
}

interface RequestModification {
  type: 'prepend_to_prompt' | 'append_to_prompt' | 'add_system_message' | 'set_parameter' | 'remove_parameter';
  value?: any;
  path?: string;  // JSONPath for set/remove operations
}

class ErrorRuleEngine {
  private rules: Map<string, ErrorRule[]> = new Map(); // service_type -> rules

  constructor() {
    this.loadRules();
  }

  /**
   * Load all error rules from YAML configs
   */
  private async loadRules(): Promise<void> {
    const ruleFiles = await glob('/config/error-rules/*.yaml');

    for (const file of ruleFiles) {
      const config = await YAML.parse(file);
      this.rules.set(config.service, config.rules);
    }
  }

  /**
   * Find matching rule for an error
   */
  findMatchingRule(error: ConnectorError): ErrorRule | null {
    const serviceRules = this.rules.get(error.context.serviceType) || [];

    // Sort by priority (highest first)
    const sortedRules = serviceRules.sort((a, b) => b.priority - a.priority);

    for (const rule of sortedRules) {
      if (this.matchesRule(error, rule)) {
        logger.info(`📋 Matched error to rule: ${rule.id} - ${rule.name}`);
        return rule;
      }
    }

    logger.warn(`⚠️ No matching rule found for error: ${error.failureReason}`);
    return null;
  }

  /**
   * Check if error matches rule conditions
   */
  private matchesRule(error: ConnectorError, rule: ErrorRule): boolean {
    // Check basic match
    if (error.failureType !== rule.match.failure_type) return false;
    if (error.failureReason !== rule.match.failure_reason) return false;

    // Check additional conditions
    if (!rule.match.conditions) return true;

    for (const condition of rule.match.conditions) {
      if (!this.evaluateCondition(error, condition)) {
        return false;
      }
    }

    return true;
  }

  /**
   * Evaluate a single condition
   */
  private evaluateCondition(error: ConnectorError, condition: MatchCondition): boolean {
    const actualValue = this.getFieldValue(error, condition.field);

    switch (condition.operator) {
      case 'equals':
        return actualValue === condition.value;
      case 'in':
        return condition.values?.includes(actualValue) ?? false;
      case 'exists':
        return (actualValue !== undefined && actualValue !== null) === condition.value;
      case 'startsWith':
        return typeof actualValue === 'string' && actualValue.startsWith(condition.value);
      case 'matches':
        return typeof actualValue === 'string' && new RegExp(condition.value).test(actualValue);
      case 'less_than':
        return typeof actualValue === 'number' && actualValue < condition.value;
      case 'greater_than':
        return typeof actualValue === 'number' && actualValue > condition.value;
      default:
        return false;
    }
  }

  /**
   * Get field value from error using JSONPath
   */
  private getFieldValue(error: ConnectorError, path: string): any {
    // Simple JSONPath implementation
    const parts = path.split('.');
    let value: any = error;

    for (const part of parts) {
      if (value && typeof value === 'object') {
        value = value[part];
      } else {
        return undefined;
      }
    }

    return value;
  }

  /**
   * Apply recovery strategy and return modified request
   */
  async applyRecoveryStrategy(
    originalRequest: any,
    rule: ErrorRule,
    error: ConnectorError
  ): Promise<{ shouldRetry: boolean; modifiedRequest?: any; reason?: string }> {

    if (!rule.recovery.enabled) {
      return { shouldRetry: false, reason: 'Recovery disabled for this rule' };
    }

    // Check retry conditions
    if (rule.recovery.retry_conditions) {
      for (const condition of rule.recovery.retry_conditions) {
        if (!this.evaluateCondition(error, condition)) {
          return { shouldRetry: false, reason: `Retry condition failed: ${condition.field}` };
        }
      }
    }

    // Apply modifications
    let modifiedRequest = JSON.parse(JSON.stringify(originalRequest));

    for (const modification of rule.recovery.modifications) {
      modifiedRequest = this.applyModification(modifiedRequest, modification);
    }

    logger.info(`✨ Applied recovery modifications from rule ${rule.id}`, {
      modifications: rule.recovery.modifications.map(m => m.type)
    });

    return { shouldRetry: true, modifiedRequest };
  }

  /**
   * Apply a single modification to request
   */
  private applyModification(request: any, modification: RequestModification): any {
    switch (modification.type) {
      case 'prepend_to_prompt':
        // Find the prompt in request structure and prepend
        if (request.input && Array.isArray(request.input)) {
          const userMessage = request.input.find((msg: any) => msg.role === 'user');
          if (userMessage && userMessage.content) {
            if (typeof userMessage.content === 'string') {
              userMessage.content = modification.value + userMessage.content;
            } else if (Array.isArray(userMessage.content)) {
              const textContent = userMessage.content.find((c: any) => c.type === 'input_text');
              if (textContent) {
                textContent.text = modification.value + textContent.text;
              }
            }
          }
        }
        break;

      case 'append_to_prompt':
        // Similar to prepend but at end
        if (request.input && Array.isArray(request.input)) {
          const userMessage = request.input.find((msg: any) => msg.role === 'user');
          if (userMessage && userMessage.content) {
            if (typeof userMessage.content === 'string') {
              userMessage.content = userMessage.content + modification.value;
            } else if (Array.isArray(userMessage.content)) {
              const textContent = userMessage.content.find((c: any) => c.type === 'input_text');
              if (textContent) {
                textContent.text = textContent.text + modification.value;
              }
            }
          }
        }
        break;

      case 'add_system_message':
        if (request.input && Array.isArray(request.input)) {
          request.input.unshift({
            role: 'system',
            content: modification.value
          });
        }
        break;

      case 'set_parameter':
        if (modification.path) {
          this.setValueAtPath(request, modification.path, modification.value);
        }
        break;

      case 'remove_parameter':
        if (modification.path) {
          this.deleteValueAtPath(request, modification.path);
        }
        break;
    }

    return request;
  }

  /**
   * Set value at JSONPath
   */
  private setValueAtPath(obj: any, path: string, value: any): void {
    const parts = path.split('.');
    let current = obj;

    for (let i = 0; i < parts.length - 1; i++) {
      const part = parts[i];
      const arrayMatch = part.match(/(.+)\[(\d+)\]/);

      if (arrayMatch) {
        const [, key, index] = arrayMatch;
        if (!current[key]) current[key] = [];
        if (!current[key][parseInt(index)]) current[key][parseInt(index)] = {};
        current = current[key][parseInt(index)];
      } else {
        if (!current[part]) current[part] = {};
        current = current[part];
      }
    }

    const lastPart = parts[parts.length - 1];
    current[lastPart] = value;
  }

  /**
   * Delete value at JSONPath
   */
  private deleteValueAtPath(obj: any, path: string): void {
    const parts = path.split('.');
    let current = obj;

    for (let i = 0; i < parts.length - 1; i++) {
      if (!current[parts[i]]) return;
      current = current[parts[i]];
    }

    delete current[parts[parts.length - 1]];
  }

  /**
   * Get user-friendly error message with template substitution
   */
  getUserMessage(rule: ErrorRule, error: ConnectorError): string {
    let message = rule.user_message;

    // Replace {{context.field}} with actual values
    message = message.replace(/\{\{([^}]+)\}\}/g, (match, path) => {
      const value = this.getFieldValue(error, path);
      return value !== undefined ? String(value) : match;
    });

    return message;
  }
}

Integration with Existing Error Flow

Modified Worker Flow:

typescript
// In async-rest-connector.ts or openai-base-connector.ts

async processJob(jobData: Job): Promise<JobResult> {
  const ruleEngine = new ErrorRuleEngine();
  let currentRequest = this.buildRequest(jobData);
  let attemptCount = 0;
  const maxAttempts = 3;
  const recoveryHistory: RecoveryAttempt[] = [];

  while (attemptCount < maxAttempts) {
    try {
      const result = await this.executeRequest(currentRequest);
      return result;

    } catch (error) {
      attemptCount++;

      if (!(error instanceof ConnectorError)) {
        throw error; // Re-throw non-connector errors
      }

      // Find matching rule
      const rule = ruleEngine.findMatchingRule(error);

      if (!rule) {
        // No rule found - use default behavior
        logger.warn(`No error rule matched - using default error handling`);
        throw error;
      }

      // Log internal note for debugging
      logger.info(`📋 Error Rule Matched: ${rule.name}`, {
        rule_id: rule.id,
        internal_note: rule.internal_note
      });

      // Try recovery
      const recovery = await ruleEngine.applyRecoveryStrategy(
        currentRequest,
        rule,
        error
      );

      recoveryHistory.push({
        rule_id: rule.id,
        rule_name: rule.name,
        modifications: rule.recovery.modifications,
        result: recovery.shouldRetry ? 'attempted' : 'skipped'
      });

      if (!recovery.shouldRetry) {
        // Can't recover - enhance error with user message from rule
        error.context.user_message = ruleEngine.getUserMessage(rule, error);
        error.context.recovery_attempts = recoveryHistory;
        throw error;
      }

      // Apply delay if specified
      if (rule.recovery.delay) {
        const delayMs = this.calculateDelay(rule.recovery.delay, error);
        logger.info(`⏳ Waiting ${delayMs}ms before retry (${rule.name})`);
        await new Promise(resolve => setTimeout(resolve, delayMs));
      }

      // Update request for retry
      currentRequest = recovery.modifiedRequest;

      logger.info(`🔄 Retrying with modifications from rule ${rule.id} (attempt ${attemptCount}/${maxAttempts})`);

      // Continue loop to retry
    }
  }

  // If we get here, all retries failed
  throw new ConnectorError(
    FailureType.SYSTEM_ERROR,
    FailureReason.UNKNOWN_ERROR,
    `Failed after ${maxAttempts} attempts with rule-based recovery`,
    false,
    { recovery_attempts: recoveryHistory }
  );
}

Benefits of V2 System

1. Separation of Concerns

  • Engineers: Build error capture and rule engine (once)
  • Operations: Write YAML rules (no code changes needed)
  • Users: Get helpful, context-aware error messages

2. Continuous Improvement

yaml
# Add new rule without touching code
- id: openai-dall-e-nsfw-002
  name: "DALL-E NSFW Filter"
  match:
    failure_type: generation_refusal
    failure_reason: nsfw_content
    conditions:
      - field: context.modelUsed
        operator: equals
        value: "dall-e-3"

  user_message: |
    DALL-E detected potentially inappropriate content in your request.
    Try rephrasing without descriptive terms for people's appearance.

  recovery:
    enabled: false

3. Data-Driven Learning

Track which rules are matching most often:

typescript
// Analytics
interface RuleAnalytics {
  rule_id: string;
  match_count: number;
  recovery_success_rate: number;
  avg_retries_to_success: number;
}

4. A/B Testing Recovery Strategies

yaml
- id: openai-image-to-text-001-variant-a
  enabled: true
  traffic_percentage: 50  # Send 50% of matches here
  modifications:
    - type: prepend_to_prompt
      value: "Generate an actual image. "

- id: openai-image-to-text-001-variant-b
  enabled: true
  traffic_percentage: 50
  modifications:
    - type: add_system_message
      value: "Always use image_generation tool."

5. User Customization

Allow customers to override rules per workspace:

yaml
# /config/error-rules/overrides/customer-abc123.yaml
overrides:
  - rule_id: openai-rate-limit-001
    user_message: "Your workspace has reached its API limit. Contact billing@company.com"

Implementation Phases

Phase 1: Foundation (Week 1-2)

  • [x] V1 already captures comprehensive error data ✅
  • [ ] Build ErrorRuleEngine class
  • [ ] YAML rule loader
  • [ ] Basic rule matching (no recovery yet)
  • [ ] Template substitution for user messages

Phase 2: Recovery Engine (Week 3-4)

  • [ ] Request modification system
  • [ ] Retry logic with delays
  • [ ] Recovery attempt tracking
  • [ ] Integration with worker connectors

Phase 3: Rule Library (Week 5-6)

  • [ ] Create initial rule set for OpenAI
  • [ ] Create rules for ComfyUI
  • [ ] Create rules for other services
  • [ ] Documentation for writing rules

Phase 4: Analytics & Optimization (Week 7-8)

  • [ ] Rule match tracking
  • [ ] Success rate analytics
  • [ ] A/B testing framework
  • [ ] Auto-suggest new rules based on patterns

Example Rule Files

OpenAI Responses

/config/error-rules/openai-responses.yaml - as shown above

ComfyUI

yaml
version: 2
service: comfyui

rules:
  - id: comfyui-node-missing-001
    name: "Missing Custom Node"
    priority: 100

    match:
      failure_type: validation_error
      failure_reason: component_error
      conditions:
        - field: context.componentError
          operator: matches
          value: "Cannot find node class"

    user_message: |
      The ComfyUI workflow requires a custom node that isn't installed: {{context.componentName}}
      This workflow may have been created with different extensions than available on this system.

    internal_note: "Custom node missing - need to track which nodes are available on which machines"

    recovery:
      enabled: false  # Can't auto-fix missing nodes

Migration Path from V1

  1. No Breaking Changes: V1 continues to work
  2. Opt-in per Service: Add rule files service by service
  3. Gradual Enhancement: Start with user messages, add recovery later
  4. Analytics on Both: Compare V1 vs V2 error rates

Success Metrics

  • User Retry Rate: Should decrease as auto-recovery improves
  • Support Tickets: Fewer "what does this error mean" tickets
  • Error Message Clarity: User surveys on message helpfulness
  • Recovery Success Rate: % of errors that auto-recover
  • Time to Add New Rule: Should be < 10 minutes

Future Enhancements

ML-Based Rule Suggestions

typescript
// Analyze error patterns and suggest new rules
interface RuleSuggestion {
  pattern: ErrorPattern;
  occurrences: number;
  suggested_rule: Partial<ErrorRule>;
  confidence: number;
}

User Feedback Loop

typescript
// Let users rate error messages
interface ErrorFeedback {
  error_id: string;
  helpful: boolean;
  comment?: string;
}

Dynamic Rule Updates

typescript
// Hot-reload rules without restart
ruleEngine.watchRuleFiles();

Decision

Recommended: Implement V2 Error System with Declarative Rules

Rationale:

  1. Separates error intelligence from code
  2. Enables rapid iteration on error handling
  3. Provides better user experience
  4. Captures comprehensive data for learning
  5. Supports automatic error recovery
  6. Scales to new services easily

Next Steps:

  1. Review and approve this ADR
  2. Create /config/error-rules/ directory structure
  3. Implement ErrorRuleEngine class
  4. Create initial OpenAI rule set
  5. Integrate with one connector (OpenAI) as proof of concept
  6. Measure impact and iterate

Appendix: Current Implementation Analysis (2025-01-07)

Overview of Existing Error Handling ADRs

Three related ADRs exist:

  1. ERROR_HANDLING_MODERNIZATION.md - Python → TypeScript error flow improvements

    • Fix Python object serialization ([object Object] → proper JSON)
    • Add structured error codes in Python (ErrorCode enum)
    • Create ComfyUIErrorParser in TypeScript
    • Status: Proposed, partially implemented
  2. connector-error-handling-standard.md - Standardized connector error handling

    • ConnectorError class with structured classification
    • BaseConnector enforcement wrapper
    • Protocol layers (HTTPConnector, WebSocketConnector)
    • Status: Accepted, partially implemented
  3. error-management-v2-declarative-rules.md (this document) - Declarative error recovery

    • YAML-based error rules
    • Automatic retry with modifications
    • Pattern learning and A/B testing
    • Status: Proposed, NOT implemented

Current Connector Error Handling State

✅ What's Working Well

BaseConnector (base-connector.ts:487-595):

  • processJob() wrapper catches ALL errors automatically
  • Converts errors to ConnectorError via ConnectorError.fromError()
  • Validates connectors don't return {success: false} (throws if they do)
  • Logs structured error data with telemetry
  • Reports errors to Redis for monitoring

AsyncHTTPConnector (async-http-connector.ts:242-245):

  • Protocol layer for HTTP-based connectors
  • handleServiceError() hook for service-specific error handling
  • Generic HTTP status mapping:
    • 401/403 → AUTH_ERROR / INVALID_API_KEY
    • 429 → RATE_LIMIT / REQUESTS_PER_MINUTE
    • 4xx → VALIDATION_ERROR
    • 5xx → SERVICE_ERROR
  • MIME type validation
  • Fetch-based with timeout handling

OpenAIResponsesConnector (openai-responses-connector.ts:156-381):

  • Semantic validation for content refusals
  • Rich error context (rawServiceOutput, rawServiceRequest)
  • MIME mismatch detection
  • Refusal pattern detection:
    typescript
    const refusalPatterns = [
      /I[''']m sorry,?\s+but I can[''']t assist with that/i,
      /against (OpenAI[''']s? )?content policy/i,
      // ...
    ];

❌ Issues & Inconsistencies

1. Inconsistent Error Throwing

  • Some connectors: throw new ConnectorError(...)
  • Some connectors: throw new Error(...) ⚠️ (gets converted by BaseConnector)
  • Some connectors: return error in result object ⚠️ (gets converted by AsyncHTTPConnector)

Example from OpenAIBaseConnector (lines 768, 804, 823):

typescript
// ❌ Throws generic Error instead of ConnectorError
throw new Error(`OpenAI job ${openaiJobId} ${currentStatus}: ${errorMessage}`);

2. Service Error Hook Underutilized

  • AsyncHTTPConnector provides handleServiceError() hook
  • OpenAI connectors don't override it (missing OpenAI-specific error patterns)
  • ComfyUI connectors likely similar
  • Opportunity for service-specific classification before generic fallback

3. Missing Rich Context

  • Not all errors include rawRequest / rawResponse
  • Some errors lack serviceJobId, retryAfterSeconds, etc.
  • Context helps with debugging and forensic analysis

4. No Declarative Rules (V2)

  • This ADR proposed but not implemented
  • No automatic retry with modifications
  • No pattern learning
  • No YAML rule files
  • No user-friendly message templates

5. Python Error Codes Not Implemented

  • ERROR_HANDLING_MODERNIZATION.md ADR not implemented
  • Still getting [object Object] serialization issues from ComfyUI
  • No structured error codes from Python layer

Connector Error Handling Patterns Summary

ConnectorExtendsError PatternService HookContextRating
BaseConnector-Catches all, converts to ConnectorError✅ Full⭐⭐⭐⭐⭐
AsyncHTTPConnectorBaseConnectorGeneric HTTP mapping + hook✅ Provides✅ Good⭐⭐⭐⭐
OpenAIBaseConnectorBaseConnectorThrows generic Error❌ None⚠️ Partial⭐⭐⭐
OpenAIResponsesConnectorAsyncHTTPConnectorReturns error in result❌ None✅ Good⭐⭐⭐⭐
ComfyUIRestStreamConnectorAsyncHTTPConnectorUnknownUnknownUnknown

Recommendations for V2 Implementation

1. Implement Service Error Hooks First

Before implementing declarative rules, standardize service-specific error handling:

typescript
// In OpenAIBaseConnector
protected handleServiceError(error: any, jobData: JobData): ConnectorError | null {
  // Handle OpenAI-specific error codes
  if (error.response?.data?.error?.code) {
    const errorCode = error.response.data.error.code;
    const errorMessage = error.response.data.error.message;

    switch (errorCode) {
      case 'rate_limit_exceeded':
        return new ConnectorError(
          FailureType.RATE_LIMIT,
          FailureReason.REQUESTS_PER_MINUTE,
          errorMessage,
          true,
          {
            serviceType: 'openai',
            httpStatus: error.response.status,
            rawResponse: error.response.data,
            retryAfterSeconds: error.response.headers?.['retry-after']
          }
        );

      case 'invalid_api_key':
        return new ConnectorError(
          FailureType.AUTH_ERROR,
          FailureReason.INVALID_API_KEY,
          errorMessage,
          false,
          { serviceType: 'openai', httpStatus: 401 }
        );

      case 'content_policy_violation':
        return new ConnectorError(
          FailureType.GENERATION_REFUSAL,
          FailureReason.SAFETY_FILTER,
          errorMessage,
          false,
          {
            serviceType: 'openai',
            rawRequest: jobData.payload,
            rawResponse: error.response.data
          }
        );
    }
  }

  return null; // Fall through to generic handling
}

2. Standardize on ConnectorError

  • Update OpenAIBaseConnector to throw ConnectorError instead of Error
  • Add rich context to all errors (rawRequest, rawResponse, serviceJobId)
  • Ensure all connectors follow the same pattern

3. Semantic Validation Framework

Generalize the refusal pattern detection from OpenAIResponsesConnector:

typescript
// packages/core/src/services/semantic-validator.ts
export class SemanticValidator {
  private static refusalPatterns = [
    /I[''']m sorry,?\s+but I can[''']t assist with that/i,
    /cannot generate/i,
    /unable to create/i,
    /policy violation/i,
    /content policy/i,
    // ... more patterns
  ];

  static detectRefusal(text: string): {
    isRefusal: boolean;
    confidence: number;
    matchedPattern?: string;
  } {
    // Pattern matching logic
  }

  static detectMimeTypeMismatch(
    requestedType: string,
    actualContent: any
  ): boolean {
    // MIME validation logic
  }
}

4. Phased V2 Implementation

Given the current state, recommend this approach:

Phase 1: Foundation (Week 1-2) - Focus on standardization first

  • [ ] Implement service error hooks in all connectors
  • [ ] Standardize ConnectorError usage
  • [ ] Add semantic validation framework
  • [ ] Ensure rich context in all errors

Phase 2: Rule Engine Core (Week 3-4) - Build the infrastructure

  • [ ] Implement ErrorRuleEngine class
  • [ ] YAML rule loader
  • [ ] Rule matching logic
  • [ ] Template substitution for user messages

Phase 3: Recovery Strategies (Week 5-6) - Add intelligence

  • [ ] Request modification system
  • [ ] Retry logic with delays
  • [ ] Recovery attempt tracking
  • [ ] Integration with connectors

Phase 4: Rule Library (Week 7-8) - Build the knowledge base

  • [ ] OpenAI rule set (based on observed patterns)
  • [ ] ComfyUI rule set
  • [ ] Documentation for writing rules
  • [ ] Rule testing framework

Phase 5: Analytics & Learning (Week 9-10) - Close the loop

  • [ ] Rule match tracking
  • [ ] Success rate analytics
  • [ ] A/B testing framework
  • [ ] Auto-suggest new rules based on patterns

5. Immediate Quick Wins

Before implementing V2, these changes provide immediate value:

  1. Add service error hooks to OpenAI connectors (1-2 days)
  2. Create semantic validator for LLM responses (1 day)
  3. Standardize error throwing in OpenAIBaseConnector (1 day)
  4. Add rich context to all errors (1 day)

Total: 1 week of work for significant improvement


Key Insights for V2 Design

1. Service Hook Pattern Works Well

  • AsyncHTTPConnector's handleServiceError() hook is a clean pattern
  • Allows service-specific logic without bloating base classes
  • Should be the integration point for declarative rules

2. Semantic Validation is Critical

  • OpenAIResponsesConnector's refusal detection prevents silent failures
  • Image-vs-text mismatch detection catches model misbehavior
  • This intelligence should be in rule engine, not individual connectors

3. Rich Context Enables Learning

  • Errors with rawRequest and rawResponse enable pattern analysis
  • Missing context makes it impossible to write good rules
  • Context must be standard before V2 can succeed

4. Connector Compliance is Key

  • V2 only works if connectors use ConnectorError consistently
  • BaseConnector validation catches {success: false} pattern
  • Need similar validation for generic Error throws

5. Start Simple, Evolve

  • Don't implement all V2 features at once
  • Start with static rules (no recovery)
  • Add recovery strategies incrementally
  • Add A/B testing and learning last

Migration Path to V2

Step 1: Current State Audit ✅ (Complete - this document)

  • [x] Document existing error handling patterns
  • [x] Identify gaps and inconsistencies
  • [x] Analyze connector implementations

Step 2: Standardization (Before V2)

  • [ ] Implement service error hooks in all connectors
  • [ ] Standardize ConnectorError usage
  • [ ] Add semantic validation framework
  • [ ] Ensure rich context in all errors

Step 3: V2 Foundation (Weeks 1-4)

  • [ ] Build ErrorRuleEngine core
  • [ ] Create YAML rule format
  • [ ] Implement rule matching
  • [ ] Add user message templates

Step 4: V2 Intelligence (Weeks 5-8)

  • [ ] Request modification system
  • [ ] Recovery strategies
  • [ ] Rule library for OpenAI and ComfyUI
  • [ ] Integration testing

Step 5: V2 Learning (Weeks 9-10)

  • [ ] Analytics and tracking
  • [ ] A/B testing
  • [ ] Auto-suggest rules
  • [ ] User feedback loop

Step 6: Python Integration (Optional - if ComfyUI priority)

  • [ ] Implement Python ErrorCode enum
  • [ ] Fix object serialization
  • [ ] Create ComfyUIErrorParser
  • [ ] Integrate with existing connectors

Success Metrics Tracking

To measure V2 effectiveness, track these metrics:

Before V2 (Baseline - capture now):

  • [ ] Error classification accuracy (manual audit of 100 errors)
  • [ ] % of errors with helpful user messages
  • [ ] Average retry attempts per error type
  • [ ] Support tickets related to error messages
  • [ ] Time to resolve production errors (MTTR)

After V2 Implementation:

  • [ ] Error classification accuracy improvement
  • [ ] User message helpfulness rating (survey)
  • [ ] Reduction in wasted retries (non-retryable errors)
  • [ ] Support ticket reduction
  • [ ] MTTR improvement
  • [ ] Auto-recovery success rate

Risk Assessment

Low Risk:

  • Adding service error hooks (backwards compatible)
  • Creating semantic validator (pure utility)
  • Standardizing ConnectorError usage (caught by BaseConnector)

Medium Risk:

  • Rule engine implementation (new system, needs testing)
  • Request modification (could make problems worse if buggy)
  • YAML parsing and validation (security considerations)

High Risk:

  • Automatic retries with modifications (could cause cascading failures)
  • A/B testing in production (could confuse users)
  • Python error code integration (requires ComfyUI changes)

Mitigation:

  • Feature flags for rule engine (enable per service)
  • Dry-run mode (log rule matches without taking action)
  • Gradual rollout (one service at a time)
  • Comprehensive testing (unit + integration + E2E)
  • Monitoring and alerting (track rule effectiveness)

Open Questions for V2

  1. Rule Priority: How do we handle overlapping rules?

    • Answer: Priority field (higher number = higher priority)
    • First matching rule wins
  2. Rule Updates: Hot-reload or restart required?

    • Answer: Start with restart, add hot-reload later
    • Use file watcher for development
  3. Rule Testing: How do we test rules without production traffic?

    • Answer: Dry-run mode + replay production errors
    • Create rule test framework
  4. Rule Versioning: How do we track rule changes over time?

    • Answer: Git version control for YAML files
    • Add version field to rule format
  5. Rule Conflicts: What if multiple services use same error pattern?

    • Answer: Service-specific rule files
    • Rules are scoped to service_type
  6. Recovery Limits: How many times should we retry with modifications?

    • Answer: Configurable per rule (max_attempts)
    • Default: 1 retry to avoid cascading
  7. User Override: Should users be able to customize rules?

    • Answer: Yes, via override files
    • Workspace-specific customization
  8. Telemetry: What data should we collect about rule effectiveness?

    • Answer: Match count, success rate, average retries
    • Store in Redis for analytics

Last Updated: 2025-01-07 Analyst: Claude Code Status: Analysis Complete - Ready for V2 Planning

Released under the MIT License.