Skip to content

Database-Driven Error Pattern Classification - ADR

Status: Proposed Date: 2025-11-08 Author: System Architecture Related: ERROR_HANDLING_MODERNIZATION.md, connector-error-handling-standard.md


Context

The emp-job-queue system processes logs from multiple external services (ComfyUI, Ollama, Stable Diffusion) and must classify error messages as fatal (fail the job), non-fatal (log but continue), or ignore (regular logs).

Current Approach: Hardcoded Catalogs

Error classification logic is embedded in each connector's TypeScript code:

typescript
private classifyLogMessage(message: string): 'fatal' | 'non-fatal' | 'ignore' {
  const messageLower = message.toLowerCase();

  // Hardcoded patterns
  if (messageLower.includes('custom validation failed')) return 'fatal';
  if (messageLower.includes('dash0.com')) return 'non-fatal';
  if (messageLower.includes('out of memory')) return 'fatal';
  // ... 20+ more patterns

  return 'ignore';
}

Problems with Hardcoded Patterns

  1. Deployment Required: Every new error pattern requires code change + rebuild + redeploy
  2. No Hot-Fixing: Can't quickly reclassify errors in production without deployment
  3. Limited Collaboration: Only engineers with repo access can update patterns
  4. No Analytics: Can't track which patterns match most frequently
  5. Environment Inconsistency: Dev/staging/prod might need different classifications
  6. Pattern Sprawl: As we discover new errors, code becomes increasingly complex

Real-World Scenario

🚨 Production Alert: Jobs failing due to new ComfyUI error message
   "RuntimeError: CUDA driver version mismatch"

Current process:
1. Engineer identifies pattern needs to be cataloged (5 min)
2. Update TypeScript code to add pattern (10 min)
3. Run tests, commit, push (15 min)
4. CI/CD pipeline builds + deploys (20 min)
5. Total time to fix: 50+ minutes

Desired process:
1. Admin opens error management UI
2. Adds pattern: "cuda driver version mismatch" → fatal
3. Workers pick up change within 5 minutes
4. Total time to fix: 5 minutes

Decision

Implement a database-driven error pattern classification system with the following characteristics:

1. Database Schema

Store error patterns in a central database accessible by all workers:

sql
CREATE TABLE error_patterns (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),

  -- Connector specificity
  connector_type VARCHAR(50),  -- 'comfyui' | 'ollama' | NULL (global)

  -- Pattern matching
  pattern TEXT NOT NULL,
  match_type VARCHAR(20) DEFAULT 'contains',  -- 'contains' | 'regex' | 'exact'
  case_sensitive BOOLEAN DEFAULT false,

  -- Classification
  classification VARCHAR(20) NOT NULL,  -- 'fatal' | 'non-fatal' | 'ignore'
  priority INT DEFAULT 100,  -- Higher = checked first

  -- Metadata
  description TEXT,
  example_message TEXT,

  -- Management
  active BOOLEAN DEFAULT true,
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW(),
  created_by VARCHAR(255)
);

CREATE INDEX idx_connector_active_priority
  ON error_patterns (connector_type, active, priority DESC);

2. Performance-Optimized Worker Implementation

Zero database queries during job processing - patterns loaded once at startup into in-memory cache:

typescript
class ComfyUIRestStreamConnector {
  // Pre-compiled patterns in memory
  private fatalPatterns: CompiledPattern[] = [];
  private nonFatalPatterns: CompiledPattern[] = [];
  private lastPatternRefresh: number = 0;
  private readonly connectorType = 'comfyui';

  async initialize() {
    // Load patterns ONCE at worker startup
    await this.buildPatternCache();
  }

  private async buildPatternCache() {
    // Single database query
    const patterns = await prisma.error_patterns.findMany({
      where: {
        active: true,
        OR: [
          { connector_type: this.connectorType },  // ComfyUI-specific
          { connector_type: null }                 // Global patterns
        ]
      },
      orderBy: { priority: 'desc' }
    });

    // Pre-compile into fast lookup structures
    this.fatalPatterns = patterns
      .filter(p => p.classification === 'fatal')
      .map(p => this.compilePattern(p));

    this.nonFatalPatterns = patterns
      .filter(p => p.classification === 'non-fatal')
      .map(p => this.compilePattern(p));

    this.lastPatternRefresh = Date.now();
  }

  private classifyLogMessage(message: string): 'fatal' | 'non-fatal' | 'ignore' {
    // Async background refresh (never blocks jobs)
    if (Date.now() - this.lastPatternRefresh > 5 * 60 * 1000) {
      this.buildPatternCache().catch(() => { /* use existing cache */ });
    }

    // Fast in-memory matching - NO DATABASE QUERIES
    const messageLower = message.toLowerCase();

    // Check non-fatal first (common infrastructure noise)
    for (const pattern of this.nonFatalPatterns) {
      if (this.matches(messageLower, pattern)) return 'non-fatal';
    }

    // Check fatal patterns
    for (const pattern of this.fatalPatterns) {
      if (this.matches(messageLower, pattern)) return 'fatal';
    }

    return 'ignore';
  }

  private matches(message: string, pattern: CompiledPattern): boolean {
    // Pure in-memory string operations - VERY FAST (~0.001ms)
    if (pattern.regex) return pattern.regex.test(message);
    return message.includes(pattern.pattern);
  }
}

3. Admin UI (Future)

Management interface for non-engineers to update patterns:

tsx
<ErrorPatternManager>
  <ConnectorFilter options={['all', 'comfyui', 'ollama']} />
  <PatternList patterns={patterns} />
  <AddPatternForm onSubmit={savePattern} />
  <PatternAnalytics topMatches={topMatchingPatterns} />
</ErrorPatternManager>

Alternatives Considered

Alternative 1: Keep Hardcoded Patterns

Pros:

  • Simple, no database required
  • Zero runtime overhead

Cons:

  • Requires deployment for every change
  • No hot-fixing in production
  • Limited to engineers with repo access

Rejected: Too inflexible for production use

Alternative 2: Configuration File (YAML/JSON)

Pros:

  • Human-readable
  • Version controlled

Cons:

  • Still requires deployment
  • No UI for non-engineers
  • No analytics capability

Rejected: Doesn't solve the deployment problem

Alternative 3: Redis Cache Only (No Database)

Pros:

  • Very fast
  • Distributed cache

Cons:

  • Data loss on Redis restart
  • No persistent history
  • No schema validation

Rejected: Patterns are configuration, not cache

Alternative 4: External Config Service (e.g., LaunchDarkly)

Pros:

  • Purpose-built for feature flags
  • Excellent admin UI

Cons:

  • Additional vendor dependency
  • Overkill for simple pattern matching
  • Cost

Rejected: Over-engineered for our needs


Consequences

Positive

Hot-Fix Production: Reclassify errors in minutes, not hours ✅ Non-Engineer Access: Support/ops can manage patterns via UI ✅ Environment-Specific: Different patterns for dev/staging/prod ✅ Analytics: Track which patterns match most frequently ✅ Connector-Specific: Different classifications per service ✅ Performance: Sub-millisecond matching via in-memory cache ✅ Graceful Degradation: Falls back to defaults if DB unavailable ✅ Scalability: Can handle thousands of patterns via Trie if needed

Negative

⚠️ Database Dependency: Workers need DB connection at startup ⚠️ Cache Staleness: 5-minute delay before workers pick up changes ⚠️ Migration Required: Need to seed DB with existing patterns ⚠️ Complexity: Additional schema, queries, cache management

Mitigations

  • DB Unavailable: Workers use hardcoded fallback patterns
  • Cache Staleness: Acceptable tradeoff for operational flexibility
  • Migration: Automated script to seed from current code
  • Complexity: Well-contained in single connector method

Implementation Plan

Phase 1: Foundation (Week 1)

1.1 Database Schema

  • [ ] Create error_patterns table migration
  • [ ] Add indexes for performance
  • [ ] Create seed data from current hardcoded patterns

1.2 Types & Interfaces

typescript
// packages/core/src/types/error-patterns.ts
export interface ErrorPattern {
  id: string;
  connector_type: string | null;
  pattern: string;
  match_type: 'contains' | 'regex' | 'exact';
  case_sensitive: boolean;
  classification: 'fatal' | 'non-fatal' | 'ignore';
  priority: number;
  description?: string;
  example_message?: string;
  active: boolean;
}

export interface CompiledPattern {
  pattern: string;
  regex: RegExp | null;
  matchType: 'contains' | 'regex' | 'exact';
  caseSensitive: boolean;
}

1.3 Seed Existing Patterns

sql
-- ComfyUI-specific
INSERT INTO error_patterns (connector_type, pattern, classification, priority, description)
VALUES
  ('comfyui', 'custom validation failed', 'fatal', 100, 'ComfyUI custom node validation'),
  ('comfyui', 'checkpoint', 'fatal', 90, 'ComfyUI model loading'),
  ('comfyui', 'extra_pnginfo', 'non-fatal', 80, 'ComfyUI metadata (non-critical)'),
  ('comfyui', 'cuda out of memory', 'fatal', 95, 'GPU memory exhaustion'),
  ('comfyui', 'node.*does not exist', 'fatal', 90, 'Missing custom node');

-- Global patterns
INSERT INTO error_patterns (connector_type, pattern, classification, priority, description)
VALUES
  (NULL, 'dash0.com', 'non-fatal', 100, 'OpenTelemetry export (all services)'),
  (NULL, 'opentelemetry', 'non-fatal', 90, 'OTEL infrastructure (all services)'),
  (NULL, 'out of memory', 'fatal', 85, 'Memory exhaustion (all services)');

Phase 2: Worker Integration (Week 2)

2.1 Pattern Cache Implementation

  • [ ] Add buildPatternCache() method to BaseConnector
  • [ ] Implement pattern compilation (regex pre-compilation)
  • [ ] Add background refresh logic (5-minute interval)
  • [ ] Add fallback to hardcoded patterns if DB fails

2.2 Classification Integration

  • [ ] Update classifyLogMessage() to use cached patterns
  • [ ] Add performance logging (cache build time, match time)
  • [ ] Add metrics (pattern cache hits/misses)

2.3 Testing

  • [ ] Unit tests for pattern matching
  • [ ] Integration tests with DB
  • [ ] Performance benchmarks (target: <0.01ms per match)
  • [ ] Test DB unavailable scenario (graceful degradation)

Phase 3: Observability (Week 3)

3.1 Pattern Analytics

typescript
// Track pattern match frequency
interface PatternMatchEvent {
  pattern_id: string;
  connector_type: string;
  job_id: string;
  matched_at: timestamp;
}

3.2 Logging & Metrics

  • [ ] Log pattern cache builds
  • [ ] Metric: error_pattern_cache_build_time_ms
  • [ ] Metric: error_pattern_matches_total (by pattern_id)
  • [ ] Metric: error_pattern_cache_size (number of patterns loaded)

3.3 Monitoring Dashboard

  • [ ] Pattern match frequency chart
  • [ ] Top 10 matched patterns
  • [ ] Patterns never matched (candidates for removal)

Phase 4: Admin UI (Week 4+)

4.1 API Endpoints

typescript
// GET /api/error-patterns?connector_type=comfyui
// POST /api/error-patterns
// PUT /api/error-patterns/:id
// DELETE /api/error-patterns/:id

4.2 UI Components

  • [ ] Pattern list with search/filter
  • [ ] Add/edit pattern form
  • [ ] Pattern preview (test against sample messages)
  • [ ] Analytics dashboard

4.3 Cache Invalidation

typescript
// Option 1: Time-based (current approach)
// Workers refresh every 5 minutes

// Option 2: Event-based (future)
// Publish Redis event when pattern updated
// Workers subscribe and refresh immediately

Phase 5: Migration & Rollout

5.1 Migration Script

typescript
// scripts/migrate-error-patterns.ts
// Reads current hardcoded patterns from code
// Seeds database with initial patterns

5.2 Feature Flag

typescript
const USE_DB_PATTERNS = process.env.USE_DB_ERROR_PATTERNS === 'true';

if (USE_DB_PATTERNS) {
  await this.buildPatternCache();
} else {
  this.useHardcodedPatterns();
}

5.3 Rollout Plan

  1. Deploy to dev with feature flag ON
  2. Verify performance metrics
  3. Deploy to staging with flag ON
  4. Monitor for 48 hours
  5. Deploy to production with flag OFF (hardcoded)
  6. Gradually enable flag for 10% → 50% → 100% of workers
  7. Remove hardcoded patterns after 30 days

Performance Benchmarks

Target Metrics

OperationTargetAcceptable
Pattern cache build<50ms<200ms
Pattern match (contains)<0.001ms<0.01ms
Pattern match (regex)<0.01ms<0.1ms
Cache refresh (background)Non-blockingNon-blocking
DB query (startup)<100ms<500ms

Load Testing

Test with realistic workload:

  • 100 patterns loaded
  • 1000 log messages/second
  • Mixed fatal/non-fatal/ignore

Expected:

  • Pattern matching: <1ms total (for all 1000 messages)
  • Memory overhead: <1MB for pattern cache
  • Zero database queries during job processing

Success Criteria

Technical

✅ Workers load patterns from database at startup ✅ Pattern matching is <0.01ms per message (in-memory) ✅ Background refresh doesn't impact job processing ✅ Graceful fallback if database unavailable ✅ Zero regressions in existing error classification

Operational

✅ Support team can add patterns without engineering help ✅ Pattern changes take effect within 5 minutes ✅ Production error classification can be hot-fixed ✅ Analytics show which patterns match most frequently ✅ Different patterns for dev/staging/prod environments


Future Enhancements

Machine Learning Classification

Once we have pattern match analytics, train ML model:

typescript
// Analyze historical pattern matches
const trainingData = await getPatternMatchHistory();

// Train classifier to suggest patterns
const suggestedPattern = await mlClassifier.suggest(newErrorMessage);

// Admin reviews and approves

Auto-Discovery

Detect new error patterns automatically:

typescript
// Detect frequent unclassified errors
if (errorFrequency > threshold && classification === 'ignore') {
  notifyAdmin(`New error pattern detected: ${errorMessage}`);
}

A/B Testing

Test different classifications:

typescript
// Route 10% of traffic to new classification
if (Math.random() < 0.1) {
  return experimentalClassification;
}

References


Appendix: Example Patterns

ComfyUI Patterns

sql
-- Fatal errors
('comfyui', 'custom validation failed', 'fatal', 100),
('comfyui', 'is required for', 'fatal', 95),
('comfyui', 'cuda out of memory', 'fatal', 95),
('comfyui', 'checkpoint.*not found', 'fatal', 90),
('comfyui', 'node.*does not exist', 'fatal', 90),
('comfyui', 'keyerror:', 'fatal', 85),
('comfyui', 'valueerror:', 'fatal', 85),

-- Non-fatal errors
('comfyui', 'extra_pnginfo', 'non-fatal', 80);

Global Patterns

sql
-- Non-fatal infrastructure
(NULL, 'dash0.com', 'non-fatal', 100),
(NULL, 'opentelemetry', 'non-fatal', 90),
(NULL, 'statuscode.unavailable', 'non-fatal', 85),

-- Fatal resource errors
(NULL, 'out of memory', 'fatal', 95),
(NULL, 'memoryerror', 'fatal', 90);

Released under the MIT License.