Skip to content

Redis-Driven Error Pattern Classification - ADR

Status: Proposed Date: 2025-01-09 Author: System Architecture Related: ERROR_HANDLING_MODERNIZATION.md, connector-error-handling-standard.md Supersedes: database-driven-error-classification.md (PostgreSQL approach)


Context

The emp-job-queue system processes logs from multiple external services (ComfyUI, Ollama, Stable Diffusion) and must classify error messages as fatal (fail the job), non-fatal (log but continue), or ignore (regular logs).

Current Approach: Hardcoded Catalogs

Error classification logic is embedded in each connector's TypeScript code:

typescript
private classifyLogMessage(message: string): 'fatal' | 'non-fatal' | 'ignore' {
  const messageLower = message.toLowerCase();

  // Hardcoded patterns
  if (messageLower.includes('custom validation failed')) return 'fatal';
  if (messageLower.includes('dash0.com')) return 'non-fatal';
  if (messageLower.includes('out of memory')) return 'fatal';
  // ... 20+ more patterns

  return 'ignore';
}

Problems with Hardcoded Patterns

  1. Deployment Required: Every new error pattern requires code change + rebuild + redeploy
  2. No Hot-Fixing: Can't quickly reclassify errors in production without deployment
  3. Limited Collaboration: Only engineers with repo access can update patterns
  4. No Analytics: Can't track which patterns match most frequently
  5. Environment Inconsistency: Dev/staging/prod might need different classifications
  6. Pattern Sprawl: As we discover new errors, code becomes increasingly complex

Real-World Scenario

🚨 Production Alert: Jobs failing due to new ComfyUI error message
   "RuntimeError: CUDA driver version mismatch"

Current process:
1. Engineer identifies pattern needs to be cataloged (5 min)
2. Update TypeScript code to add pattern (10 min)
3. Run tests, commit, push (15 min)
4. CI/CD pipeline builds + deploys (20 min)
5. Total time to fix: 50+ minutes

Desired process:
1. Admin runs CLI/API command to add pattern
2. Redis SET operation completes instantly
3. Workers refresh cache within 30 seconds (or immediately via Pub/Sub)
4. Total time to fix: <1 minute

Decision

Implement Redis-based error pattern classification with the following characteristics:

1. Redis Data Structure

Store error patterns in Redis hashes organized by connector type:

typescript
// Key structure: error_patterns:{connector_type}
// Value: Hash { pattern_id → JSON }

// Global patterns (all connectors)
Key: "error_patterns:global"
Hash: {
  "global_001": JSON.stringify({
    pattern: "out of memory",
    match_type: "contains",
    case_sensitive: false,
    classification: "fatal",
    priority: 95,
    description: "Memory exhaustion (all services)",
    example_message: "CUDA out of memory",
    active: true,
    created_at: "2025-01-09T12:00:00Z",
    updated_at: "2025-01-09T12:00:00Z",
    created_by: "system"
  })
}

// ComfyUI-specific patterns
Key: "error_patterns:comfyui"
Hash: {
  "comfyui_001": JSON.stringify({
    pattern: "custom validation failed",
    match_type: "contains",
    case_sensitive: false,
    classification: "fatal",
    priority: 100,
    description: "ComfyUI custom node validation",
    example_message: "Custom validation failed: missing required parameter",
    active: true,
    created_at: "2025-01-09T12:00:00Z"
  })
}

// Pattern version tracking (for cache invalidation)
Key: "error_patterns:version"
Value: "1736424000000" (Unix timestamp)

2. Worker Implementation - Zero-Overhead Pattern Matching

In-memory cache with instant Redis updates - patterns loaded once at startup:

typescript
class ComfyUIRestStreamConnector {
  // Pre-compiled patterns in memory
  private fatalPatterns: CompiledPattern[] = [];
  private nonFatalPatterns: CompiledPattern[] = [];
  private lastPatternVersion: string = '0';
  private readonly connectorType = 'comfyui';

  async initialize() {
    // Load patterns ONCE at worker startup
    await this.loadPatternsFromRedis();

    // Subscribe to pattern update events (instant invalidation)
    this.subscribeToPatternUpdates();
  }

  private async loadPatternsFromRedis() {
    // Fetch global patterns
    const globalPatterns = await redis.hgetall('error_patterns:global');

    // Fetch connector-specific patterns
    const connectorPatterns = await redis.hgetall(`error_patterns:${this.connectorType}`);

    // Merge and compile patterns
    const allPatterns = [
      ...Object.values(globalPatterns).map(p => JSON.parse(p)),
      ...Object.values(connectorPatterns).map(p => JSON.parse(p))
    ].filter(p => p.active);

    // Pre-compile into fast lookup structures (sorted by priority)
    this.fatalPatterns = allPatterns
      .filter(p => p.classification === 'fatal')
      .sort((a, b) => b.priority - a.priority)
      .map(p => this.compilePattern(p));

    this.nonFatalPatterns = allPatterns
      .filter(p => p.classification === 'non-fatal')
      .sort((a, b) => b.priority - a.priority)
      .map(p => this.compilePattern(p));

    // Update cache version
    this.lastPatternVersion = await redis.get('error_patterns:version') || '0';

    logger.info(`Loaded ${allPatterns.length} error patterns from Redis`, {
      fatal: this.fatalPatterns.length,
      nonFatal: this.nonFatalPatterns.length,
      version: this.lastPatternVersion
    });
  }

  private subscribeToPatternUpdates() {
    // Subscribe to Redis Pub/Sub for instant cache invalidation
    const subscriber = new Redis(this.redisConfig);

    subscriber.subscribe('error_patterns:updated', (err) => {
      if (err) {
        logger.error('Failed to subscribe to pattern updates', { error: err });
        return;
      }
    });

    subscriber.on('message', async (channel, message) => {
      if (channel === 'error_patterns:updated') {
        const newVersion = message;
        if (newVersion !== this.lastPatternVersion) {
          logger.info('Pattern update detected, reloading cache', {
            oldVersion: this.lastPatternVersion,
            newVersion
          });
          await this.loadPatternsFromRedis();
        }
      }
    });
  }

  private classifyLogMessage(message: string): 'fatal' | 'non-fatal' | 'ignore' {
    // Fast in-memory matching - NO REDIS QUERIES
    const messageLower = message.toLowerCase();

    // Check non-fatal first (common infrastructure noise)
    for (const pattern of this.nonFatalPatterns) {
      if (this.matches(messageLower, pattern)) return 'non-fatal';
    }

    // Check fatal patterns
    for (const pattern of this.fatalPatterns) {
      if (this.matches(messageLower, pattern)) return 'fatal';
    }

    return 'ignore';
  }

  private matches(message: string, pattern: CompiledPattern): boolean {
    // Pure in-memory string operations - VERY FAST (~0.001ms)
    if (pattern.regex) {
      // Regex with timeout protection
      return this.regexMatchWithTimeout(pattern.regex, message, 10); // 10ms timeout
    }
    return message.includes(pattern.pattern);
  }

  private regexMatchWithTimeout(regex: RegExp, text: string, timeoutMs: number): boolean {
    // Protect against catastrophic backtracking
    const start = Date.now();
    try {
      const match = regex.test(text);
      const elapsed = Date.now() - start;
      if (elapsed > timeoutMs) {
        logger.warn('Regex match exceeded timeout', {
          pattern: regex.source,
          elapsed
        });
      }
      return match;
    } catch (error) {
      logger.error('Regex match failed', { error, pattern: regex.source });
      return false;
    }
  }

  private compilePattern(pattern: ErrorPattern): CompiledPattern {
    return {
      pattern: pattern.pattern.toLowerCase(),
      regex: pattern.match_type === 'regex' ? new RegExp(pattern.pattern, pattern.case_sensitive ? '' : 'i') : null,
      matchType: pattern.match_type,
      caseSensitive: pattern.case_sensitive
    };
  }
}

3. Pattern Management API/CLI

Simple Redis commands for pattern management:

typescript
// Add a new pattern
async function addErrorPattern(
  connectorType: string | 'global',
  pattern: string,
  classification: 'fatal' | 'non-fatal' | 'ignore',
  options?: {
    matchType?: 'contains' | 'regex' | 'exact';
    caseSensitive?: boolean;
    priority?: number;
    description?: string;
    exampleMessage?: string;
  }
) {
  const patternId = `${connectorType}_${Date.now()}`;
  const key = `error_patterns:${connectorType}`;

  const patternData = {
    pattern,
    match_type: options?.matchType || 'contains',
    case_sensitive: options?.caseSensitive || false,
    classification,
    priority: options?.priority || 100,
    description: options?.description,
    example_message: options?.exampleMessage,
    active: true,
    created_at: new Date().toISOString(),
    updated_at: new Date().toISOString(),
    created_by: 'admin'
  };

  // Store pattern
  await redis.hset(key, patternId, JSON.stringify(patternData));

  // Update version (triggers worker cache refresh)
  const version = Date.now().toString();
  await redis.set('error_patterns:version', version);

  // Publish update event (instant worker refresh via Pub/Sub)
  await redis.publish('error_patterns:updated', version);

  logger.info('Error pattern added', { patternId, connectorType, classification });

  return patternId;
}

// Update pattern
async function updateErrorPattern(
  connectorType: string,
  patternId: string,
  updates: Partial<ErrorPattern>
) {
  const key = `error_patterns:${connectorType}`;
  const existing = await redis.hget(key, patternId);

  if (!existing) {
    throw new Error(`Pattern not found: ${patternId}`);
  }

  const patternData = {
    ...JSON.parse(existing),
    ...updates,
    updated_at: new Date().toISOString()
  };

  await redis.hset(key, patternId, JSON.stringify(patternData));

  const version = Date.now().toString();
  await redis.set('error_patterns:version', version);
  await redis.publish('error_patterns:updated', version);
}

// Delete pattern (soft delete by setting active: false)
async function deactivateErrorPattern(connectorType: string, patternId: string) {
  await updateErrorPattern(connectorType, patternId, { active: false });
}

// List all patterns
async function listErrorPatterns(connectorType?: string) {
  const keys = connectorType
    ? [`error_patterns:${connectorType}`]
    : await redis.keys('error_patterns:*');

  const patterns = [];
  for (const key of keys) {
    const hash = await redis.hgetall(key);
    for (const [id, data] of Object.entries(hash)) {
      patterns.push({ id, ...JSON.parse(data as string) });
    }
  }

  return patterns.sort((a, b) => b.priority - a.priority);
}

Alternatives Considered

Alternative 1: PostgreSQL Database (Original Proposal)

Pros:

  • Structured schema with validation
  • SQL queries for analytics
  • Transaction support

Cons:

  • Additional database connection required
  • 5-50ms query latency vs <1ms Redis
  • Complex cache refresh logic
  • Migration/schema versioning overhead
  • Doesn't leverage existing Redis infrastructure

Rejected: Redis is already in the stack, faster, and simpler

Alternative 2: Configuration File (YAML/JSON)

Pros:

  • Human-readable
  • Version controlled

Cons:

  • Still requires deployment
  • No UI for non-engineers
  • No instant updates

Rejected: Doesn't solve the deployment problem

Alternative 3: In-Memory Only (No Persistence)

Pros:

  • Ultra-fast
  • No external dependencies

Cons:

  • Data loss on restart
  • No sharing between workers
  • No persistence

Rejected: Patterns are configuration, must be persistent


Consequences

Positive

Instant Hot-Fixing: Add/update patterns in <1 second via Redis ✅ Sub-Millisecond Matching: In-memory cache, zero Redis queries during job processing ✅ Instant Propagation: Pub/Sub notifies workers immediately (<1 second) ✅ No New Dependencies: Redis already used for job matching ✅ Simpler Architecture: No database migrations, no schema versioning ✅ Space Efficient: 1000 patterns = ~300KB (negligible) ✅ Connector-Specific: Separate namespaces per service ✅ Environment-Specific: Different Redis instances for dev/staging/prod ✅ Analytics Ready: Redis sorted sets for pattern match tracking ✅ Graceful Degradation: Falls back to hardcoded patterns if Redis unavailable

Negative

⚠️ Redis Dependency: Workers need Redis at startup (already required) ⚠️ No SQL Analytics: Must use Redis queries or export to database ⚠️ Manual Backups: Need to export patterns for version control ⚠️ Regex Security: Pattern injection could cause ReDoS attacks

Mitigations

  • Redis Unavailable: Workers use hardcoded fallback patterns
  • Analytics: Export patterns to PostgreSQL periodically for SQL queries
  • Backups: Automated script to dump patterns to JSON daily
  • Regex Security: Validate regex complexity before storing, timeout on matching

Implementation Plan

Phase 1: Redis Schema Setup (Week 1)

1.1 Redis Key Design

typescript
// packages/core/src/types/error-patterns.ts
export interface ErrorPattern {
  pattern: string;
  match_type: 'contains' | 'regex' | 'exact';
  case_sensitive: boolean;
  classification: 'fatal' | 'non-fatal' | 'ignore';
  priority: number;
  description?: string;
  example_message?: string;
  active: boolean;
  created_at: string;
  updated_at: string;
  created_by?: string;
}

export interface CompiledPattern {
  pattern: string;
  regex: RegExp | null;
  matchType: 'contains' | 'regex' | 'exact';
  caseSensitive: boolean;
}

1.2 Seed Existing Patterns

  • [ ] Extract all hardcoded patterns from classifyLogMessage() methods
  • [ ] Create seed script: scripts/seed-error-patterns.ts
  • [ ] Seed patterns to Redis (dev/staging/prod)
typescript
// Example seed data
const globalPatterns = [
  {
    pattern: "out of memory",
    match_type: "contains",
    classification: "fatal",
    priority: 95,
    description: "Memory exhaustion (all services)"
  },
  {
    pattern: "dash0.com",
    match_type: "contains",
    classification: "non-fatal",
    priority: 100,
    description: "OpenTelemetry export (non-critical)"
  }
];

const comfyUIPatterns = [
  {
    pattern: "custom validation failed",
    match_type: "contains",
    classification: "fatal",
    priority: 100,
    description: "ComfyUI custom node validation"
  }
];

Phase 2: Worker Integration (Week 2)

2.1 Pattern Cache Implementation

  • [ ] Add loadPatternsFromRedis() to BaseConnector
  • [ ] Implement pattern compilation (regex pre-compilation)
  • [ ] Add Pub/Sub subscription for instant updates
  • [ ] Add fallback to hardcoded patterns if Redis fails
  • [ ] Add regex timeout protection (10ms max)

2.2 Classification Integration

  • [ ] Update classifyLogMessage() to use cached patterns
  • [ ] Add performance logging (cache load time, match time)
  • [ ] Add metrics (pattern cache hits/misses)

2.3 Testing

  • [ ] Unit tests for pattern matching
  • [ ] Integration tests with Redis
  • [ ] Performance benchmarks (target: <0.01ms per match)
  • [ ] Test Redis unavailable scenario (graceful degradation)
  • [ ] Regression tests (old vs new classifications must match 100%)

Phase 3: Management CLI/API (Week 3)

3.1 CLI Commands

bash
# Add pattern
pnpm error-patterns:add --connector=comfyui --pattern="cuda driver mismatch" --classification=fatal

# List patterns
pnpm error-patterns:list --connector=comfyui

# Update pattern
pnpm error-patterns:update --id=comfyui_001 --priority=90

# Deactivate pattern
pnpm error-patterns:deactivate --id=comfyui_001

# Export patterns (for backup)
pnpm error-patterns:export --output=patterns.json

# Import patterns
pnpm error-patterns:import --input=patterns.json

3.2 API Endpoints

typescript
// GET /api/error-patterns?connector_type=comfyui
// POST /api/error-patterns
// PUT /api/error-patterns/:id
// DELETE /api/error-patterns/:id (soft delete)

Phase 4: Observability (Week 3-4)

4.1 Pattern Analytics (Redis Sorted Sets)

typescript
// Track pattern match frequency
// Key: error_patterns:analytics:{connector_type}
// Sorted Set: { score=match_count, member=pattern_id }

async function trackPatternMatch(connectorType: string, patternId: string) {
  await redis.zincrby(`error_patterns:analytics:${connectorType}`, 1, patternId);
}

async function getTopMatchedPatterns(connectorType: string, limit: number = 10) {
  return redis.zrevrange(`error_patterns:analytics:${connectorType}`, 0, limit - 1, 'WITHSCORES');
}

4.2 Logging & Metrics

  • [ ] Log pattern cache loads
  • [ ] Metric: error_pattern_cache_load_time_ms
  • [ ] Metric: error_pattern_matches_total (by pattern_id)
  • [ ] Metric: error_pattern_cache_size (number of patterns loaded)
  • [ ] Alert: Pattern cache load failures

4.3 Monitoring Dashboard

  • [ ] Pattern match frequency chart
  • [ ] Top 10 matched patterns
  • [ ] Patterns never matched (candidates for removal)
  • [ ] Pattern update history

Phase 5: Migration & Rollout

5.1 Feature Flag

typescript
const USE_REDIS_PATTERNS = process.env.USE_REDIS_ERROR_PATTERNS === 'true';

if (USE_REDIS_PATTERNS && await this.redisAvailable()) {
  await this.loadPatternsFromRedis();
} else {
  this.useHardcodedPatterns();
}

5.2 Rollout Plan

  1. Deploy to dev with feature flag ON
  2. Run regression tests (old vs new classifications must match)
  3. Deploy to staging with flag ON
  4. Monitor for 48 hours
  5. Deploy to production with flag OFF (hardcoded)
  6. Gradually enable flag for 10% → 50% → 100% of workers
  7. Remove hardcoded patterns after 30 days of stable operation

5.3 Backup Strategy

bash
# Daily automated backup
0 2 * * * /usr/local/bin/pnpm error-patterns:export --output=/backups/error-patterns-$(date +\%Y\%m\%d).json

Performance Benchmarks

Target Metrics

OperationTargetAcceptable
Pattern cache load (Redis fetch)<10ms<50ms
Pattern match (contains)<0.001ms<0.01ms
Pattern match (regex)<0.01ms<0.1ms
Pattern update propagation (Pub/Sub)<1s<5s
Redis HGETALL (100 patterns)<5ms<20ms

Load Testing

Test with realistic workload:

  • 100 patterns loaded
  • 1000 log messages/second
  • Mixed fatal/non-fatal/ignore

Expected:

  • Pattern matching: <1ms total (for all 1000 messages)
  • Memory overhead: <500KB for pattern cache
  • Zero Redis queries during job processing
  • Pattern updates propagate in <1 second via Pub/Sub

Space Estimation

Per pattern: ~300 bytes (JSON)
100 patterns: 30KB
1,000 patterns: 300KB
10,000 patterns: 3MB

Redis memory usage: Negligible

Success Criteria

Technical

✅ Workers load patterns from Redis at startup (<50ms) ✅ Pattern matching is <0.01ms per message (in-memory) ✅ Pattern updates propagate in <1 second (Pub/Sub) ✅ Graceful fallback if Redis unavailable ✅ Zero regressions in existing error classification ✅ Regex timeout protection (no catastrophic backtracking)

Operational

✅ Support team can add patterns without engineering help ✅ Pattern changes take effect within 1 second ✅ Production error classification can be hot-fixed instantly ✅ Analytics show which patterns match most frequently ✅ Different patterns for dev/staging/prod environments ✅ Automated daily backups of patterns


Future Enhancements

Machine Learning Classification

Once we have pattern match analytics, train ML model:

typescript
// Analyze historical pattern matches (export Redis analytics to PostgreSQL)
const trainingData = await getPatternMatchHistory();

// Train classifier to suggest patterns
const suggestedPattern = await mlClassifier.suggest(newErrorMessage);

// Admin reviews and approves

Auto-Discovery

Detect new error patterns automatically:

typescript
// Detect frequent unclassified errors
if (errorFrequency > threshold && classification === 'ignore') {
  await redis.sadd('error_patterns:candidates', errorMessage);
  notifyAdmin(`New error pattern detected: ${errorMessage}`);
}

A/B Testing

Test different classifications:

typescript
// Route 10% of traffic to experimental classification
const experiment = await redis.hget('error_patterns:experiments', patternId);
if (experiment && Math.random() < 0.1) {
  return JSON.parse(experiment).classification;
}

References


Appendix: Redis CLI Examples

Add Patterns via Redis CLI

bash
# Add global pattern
redis-cli HSET error_patterns:global global_001 '{"pattern":"out of memory","match_type":"contains","case_sensitive":false,"classification":"fatal","priority":95,"active":true}'

# Add ComfyUI-specific pattern
redis-cli HSET error_patterns:comfyui comfyui_001 '{"pattern":"custom validation failed","match_type":"contains","case_sensitive":false,"classification":"fatal","priority":100,"active":true}'

# Update version (triggers worker refresh)
redis-cli SET error_patterns:version $(date +%s)000

# Publish update event (instant Pub/Sub notification)
redis-cli PUBLISH error_patterns:updated $(date +%s)000

# List all ComfyUI patterns
redis-cli HGETALL error_patterns:comfyui

# Get pattern analytics
redis-cli ZREVRANGE error_patterns:analytics:comfyui 0 9 WITHSCORES

Export/Import Patterns

bash
# Export all patterns to JSON
redis-cli --raw KEYS 'error_patterns:*' | while read key; do
  redis-cli --raw HGETALL "$key" | jq -R -s 'split("\n") | map(select(length > 0)) | {key: .[0], patterns: .[1:]}'
done > patterns.json

# Import patterns from JSON
cat patterns.json | jq -r '.patterns[]' | while read pattern; do
  redis-cli HSET error_patterns:global "$(echo $pattern | jq -r '.id')" "$pattern"
done

Released under the MIT License.