Database-Driven Error Pattern Classification - ADR
Status: Proposed Date: 2025-11-08 Author: System Architecture Related: ERROR_HANDLING_MODERNIZATION.md, connector-error-handling-standard.md
Context
The emp-job-queue system processes logs from multiple external services (ComfyUI, Ollama, Stable Diffusion) and must classify error messages as fatal (fail the job), non-fatal (log but continue), or ignore (regular logs).
Current Approach: Hardcoded Catalogs
Error classification logic is embedded in each connector's TypeScript code:
private classifyLogMessage(message: string): 'fatal' | 'non-fatal' | 'ignore' {
const messageLower = message.toLowerCase();
// Hardcoded patterns
if (messageLower.includes('custom validation failed')) return 'fatal';
if (messageLower.includes('dash0.com')) return 'non-fatal';
if (messageLower.includes('out of memory')) return 'fatal';
// ... 20+ more patterns
return 'ignore';
}Problems with Hardcoded Patterns
- Deployment Required: Every new error pattern requires code change + rebuild + redeploy
- No Hot-Fixing: Can't quickly reclassify errors in production without deployment
- Limited Collaboration: Only engineers with repo access can update patterns
- No Analytics: Can't track which patterns match most frequently
- Environment Inconsistency: Dev/staging/prod might need different classifications
- Pattern Sprawl: As we discover new errors, code becomes increasingly complex
Real-World Scenario
🚨 Production Alert: Jobs failing due to new ComfyUI error message
"RuntimeError: CUDA driver version mismatch"
Current process:
1. Engineer identifies pattern needs to be cataloged (5 min)
2. Update TypeScript code to add pattern (10 min)
3. Run tests, commit, push (15 min)
4. CI/CD pipeline builds + deploys (20 min)
5. Total time to fix: 50+ minutes
Desired process:
1. Admin opens error management UI
2. Adds pattern: "cuda driver version mismatch" → fatal
3. Workers pick up change within 5 minutes
4. Total time to fix: 5 minutesDecision
Implement a database-driven error pattern classification system with the following characteristics:
1. Database Schema
Store error patterns in a central database accessible by all workers:
CREATE TABLE error_patterns (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
-- Connector specificity
connector_type VARCHAR(50), -- 'comfyui' | 'ollama' | NULL (global)
-- Pattern matching
pattern TEXT NOT NULL,
match_type VARCHAR(20) DEFAULT 'contains', -- 'contains' | 'regex' | 'exact'
case_sensitive BOOLEAN DEFAULT false,
-- Classification
classification VARCHAR(20) NOT NULL, -- 'fatal' | 'non-fatal' | 'ignore'
priority INT DEFAULT 100, -- Higher = checked first
-- Metadata
description TEXT,
example_message TEXT,
-- Management
active BOOLEAN DEFAULT true,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW(),
created_by VARCHAR(255)
);
CREATE INDEX idx_connector_active_priority
ON error_patterns (connector_type, active, priority DESC);2. Performance-Optimized Worker Implementation
Zero database queries during job processing - patterns loaded once at startup into in-memory cache:
class ComfyUIRestStreamConnector {
// Pre-compiled patterns in memory
private fatalPatterns: CompiledPattern[] = [];
private nonFatalPatterns: CompiledPattern[] = [];
private lastPatternRefresh: number = 0;
private readonly connectorType = 'comfyui';
async initialize() {
// Load patterns ONCE at worker startup
await this.buildPatternCache();
}
private async buildPatternCache() {
// Single database query
const patterns = await prisma.error_patterns.findMany({
where: {
active: true,
OR: [
{ connector_type: this.connectorType }, // ComfyUI-specific
{ connector_type: null } // Global patterns
]
},
orderBy: { priority: 'desc' }
});
// Pre-compile into fast lookup structures
this.fatalPatterns = patterns
.filter(p => p.classification === 'fatal')
.map(p => this.compilePattern(p));
this.nonFatalPatterns = patterns
.filter(p => p.classification === 'non-fatal')
.map(p => this.compilePattern(p));
this.lastPatternRefresh = Date.now();
}
private classifyLogMessage(message: string): 'fatal' | 'non-fatal' | 'ignore' {
// Async background refresh (never blocks jobs)
if (Date.now() - this.lastPatternRefresh > 5 * 60 * 1000) {
this.buildPatternCache().catch(() => { /* use existing cache */ });
}
// Fast in-memory matching - NO DATABASE QUERIES
const messageLower = message.toLowerCase();
// Check non-fatal first (common infrastructure noise)
for (const pattern of this.nonFatalPatterns) {
if (this.matches(messageLower, pattern)) return 'non-fatal';
}
// Check fatal patterns
for (const pattern of this.fatalPatterns) {
if (this.matches(messageLower, pattern)) return 'fatal';
}
return 'ignore';
}
private matches(message: string, pattern: CompiledPattern): boolean {
// Pure in-memory string operations - VERY FAST (~0.001ms)
if (pattern.regex) return pattern.regex.test(message);
return message.includes(pattern.pattern);
}
}3. Admin UI (Future)
Management interface for non-engineers to update patterns:
<ErrorPatternManager>
<ConnectorFilter options={['all', 'comfyui', 'ollama']} />
<PatternList patterns={patterns} />
<AddPatternForm onSubmit={savePattern} />
<PatternAnalytics topMatches={topMatchingPatterns} />
</ErrorPatternManager>Alternatives Considered
Alternative 1: Keep Hardcoded Patterns
Pros:
- Simple, no database required
- Zero runtime overhead
Cons:
- Requires deployment for every change
- No hot-fixing in production
- Limited to engineers with repo access
Rejected: Too inflexible for production use
Alternative 2: Configuration File (YAML/JSON)
Pros:
- Human-readable
- Version controlled
Cons:
- Still requires deployment
- No UI for non-engineers
- No analytics capability
Rejected: Doesn't solve the deployment problem
Alternative 3: Redis Cache Only (No Database)
Pros:
- Very fast
- Distributed cache
Cons:
- Data loss on Redis restart
- No persistent history
- No schema validation
Rejected: Patterns are configuration, not cache
Alternative 4: External Config Service (e.g., LaunchDarkly)
Pros:
- Purpose-built for feature flags
- Excellent admin UI
Cons:
- Additional vendor dependency
- Overkill for simple pattern matching
- Cost
Rejected: Over-engineered for our needs
Consequences
Positive
✅ Hot-Fix Production: Reclassify errors in minutes, not hours ✅ Non-Engineer Access: Support/ops can manage patterns via UI ✅ Environment-Specific: Different patterns for dev/staging/prod ✅ Analytics: Track which patterns match most frequently ✅ Connector-Specific: Different classifications per service ✅ Performance: Sub-millisecond matching via in-memory cache ✅ Graceful Degradation: Falls back to defaults if DB unavailable ✅ Scalability: Can handle thousands of patterns via Trie if needed
Negative
⚠️ Database Dependency: Workers need DB connection at startup ⚠️ Cache Staleness: 5-minute delay before workers pick up changes ⚠️ Migration Required: Need to seed DB with existing patterns ⚠️ Complexity: Additional schema, queries, cache management
Mitigations
- DB Unavailable: Workers use hardcoded fallback patterns
- Cache Staleness: Acceptable tradeoff for operational flexibility
- Migration: Automated script to seed from current code
- Complexity: Well-contained in single connector method
Implementation Plan
Phase 1: Foundation (Week 1)
1.1 Database Schema
- [ ] Create
error_patternstable migration - [ ] Add indexes for performance
- [ ] Create seed data from current hardcoded patterns
1.2 Types & Interfaces
// packages/core/src/types/error-patterns.ts
export interface ErrorPattern {
id: string;
connector_type: string | null;
pattern: string;
match_type: 'contains' | 'regex' | 'exact';
case_sensitive: boolean;
classification: 'fatal' | 'non-fatal' | 'ignore';
priority: number;
description?: string;
example_message?: string;
active: boolean;
}
export interface CompiledPattern {
pattern: string;
regex: RegExp | null;
matchType: 'contains' | 'regex' | 'exact';
caseSensitive: boolean;
}1.3 Seed Existing Patterns
-- ComfyUI-specific
INSERT INTO error_patterns (connector_type, pattern, classification, priority, description)
VALUES
('comfyui', 'custom validation failed', 'fatal', 100, 'ComfyUI custom node validation'),
('comfyui', 'checkpoint', 'fatal', 90, 'ComfyUI model loading'),
('comfyui', 'extra_pnginfo', 'non-fatal', 80, 'ComfyUI metadata (non-critical)'),
('comfyui', 'cuda out of memory', 'fatal', 95, 'GPU memory exhaustion'),
('comfyui', 'node.*does not exist', 'fatal', 90, 'Missing custom node');
-- Global patterns
INSERT INTO error_patterns (connector_type, pattern, classification, priority, description)
VALUES
(NULL, 'dash0.com', 'non-fatal', 100, 'OpenTelemetry export (all services)'),
(NULL, 'opentelemetry', 'non-fatal', 90, 'OTEL infrastructure (all services)'),
(NULL, 'out of memory', 'fatal', 85, 'Memory exhaustion (all services)');Phase 2: Worker Integration (Week 2)
2.1 Pattern Cache Implementation
- [ ] Add
buildPatternCache()method to BaseConnector - [ ] Implement pattern compilation (regex pre-compilation)
- [ ] Add background refresh logic (5-minute interval)
- [ ] Add fallback to hardcoded patterns if DB fails
2.2 Classification Integration
- [ ] Update
classifyLogMessage()to use cached patterns - [ ] Add performance logging (cache build time, match time)
- [ ] Add metrics (pattern cache hits/misses)
2.3 Testing
- [ ] Unit tests for pattern matching
- [ ] Integration tests with DB
- [ ] Performance benchmarks (target: <0.01ms per match)
- [ ] Test DB unavailable scenario (graceful degradation)
Phase 3: Observability (Week 3)
3.1 Pattern Analytics
// Track pattern match frequency
interface PatternMatchEvent {
pattern_id: string;
connector_type: string;
job_id: string;
matched_at: timestamp;
}3.2 Logging & Metrics
- [ ] Log pattern cache builds
- [ ] Metric:
error_pattern_cache_build_time_ms - [ ] Metric:
error_pattern_matches_total(by pattern_id) - [ ] Metric:
error_pattern_cache_size(number of patterns loaded)
3.3 Monitoring Dashboard
- [ ] Pattern match frequency chart
- [ ] Top 10 matched patterns
- [ ] Patterns never matched (candidates for removal)
Phase 4: Admin UI (Week 4+)
4.1 API Endpoints
// GET /api/error-patterns?connector_type=comfyui
// POST /api/error-patterns
// PUT /api/error-patterns/:id
// DELETE /api/error-patterns/:id4.2 UI Components
- [ ] Pattern list with search/filter
- [ ] Add/edit pattern form
- [ ] Pattern preview (test against sample messages)
- [ ] Analytics dashboard
4.3 Cache Invalidation
// Option 1: Time-based (current approach)
// Workers refresh every 5 minutes
// Option 2: Event-based (future)
// Publish Redis event when pattern updated
// Workers subscribe and refresh immediatelyPhase 5: Migration & Rollout
5.1 Migration Script
// scripts/migrate-error-patterns.ts
// Reads current hardcoded patterns from code
// Seeds database with initial patterns5.2 Feature Flag
const USE_DB_PATTERNS = process.env.USE_DB_ERROR_PATTERNS === 'true';
if (USE_DB_PATTERNS) {
await this.buildPatternCache();
} else {
this.useHardcodedPatterns();
}5.3 Rollout Plan
- Deploy to dev with feature flag ON
- Verify performance metrics
- Deploy to staging with flag ON
- Monitor for 48 hours
- Deploy to production with flag OFF (hardcoded)
- Gradually enable flag for 10% → 50% → 100% of workers
- Remove hardcoded patterns after 30 days
Performance Benchmarks
Target Metrics
| Operation | Target | Acceptable |
|---|---|---|
| Pattern cache build | <50ms | <200ms |
| Pattern match (contains) | <0.001ms | <0.01ms |
| Pattern match (regex) | <0.01ms | <0.1ms |
| Cache refresh (background) | Non-blocking | Non-blocking |
| DB query (startup) | <100ms | <500ms |
Load Testing
Test with realistic workload:
- 100 patterns loaded
- 1000 log messages/second
- Mixed fatal/non-fatal/ignore
Expected:
- Pattern matching: <1ms total (for all 1000 messages)
- Memory overhead: <1MB for pattern cache
- Zero database queries during job processing
Success Criteria
Technical
✅ Workers load patterns from database at startup ✅ Pattern matching is <0.01ms per message (in-memory) ✅ Background refresh doesn't impact job processing ✅ Graceful fallback if database unavailable ✅ Zero regressions in existing error classification
Operational
✅ Support team can add patterns without engineering help ✅ Pattern changes take effect within 5 minutes ✅ Production error classification can be hot-fixed ✅ Analytics show which patterns match most frequently ✅ Different patterns for dev/staging/prod environments
Future Enhancements
Machine Learning Classification
Once we have pattern match analytics, train ML model:
// Analyze historical pattern matches
const trainingData = await getPatternMatchHistory();
// Train classifier to suggest patterns
const suggestedPattern = await mlClassifier.suggest(newErrorMessage);
// Admin reviews and approvesAuto-Discovery
Detect new error patterns automatically:
// Detect frequent unclassified errors
if (errorFrequency > threshold && classification === 'ignore') {
notifyAdmin(`New error pattern detected: ${errorMessage}`);
}A/B Testing
Test different classifications:
// Route 10% of traffic to new classification
if (Math.random() < 0.1) {
return experimentalClassification;
}References
- ERROR_HANDLING_MODERNIZATION.md - Error handling architecture
- connector-error-handling-standard.md - Connector standards
- Pattern Matching Performance - Benchmarks
Appendix: Example Patterns
ComfyUI Patterns
-- Fatal errors
('comfyui', 'custom validation failed', 'fatal', 100),
('comfyui', 'is required for', 'fatal', 95),
('comfyui', 'cuda out of memory', 'fatal', 95),
('comfyui', 'checkpoint.*not found', 'fatal', 90),
('comfyui', 'node.*does not exist', 'fatal', 90),
('comfyui', 'keyerror:', 'fatal', 85),
('comfyui', 'valueerror:', 'fatal', 85),
-- Non-fatal errors
('comfyui', 'extra_pnginfo', 'non-fatal', 80);Global Patterns
-- Non-fatal infrastructure
(NULL, 'dash0.com', 'non-fatal', 100),
(NULL, 'opentelemetry', 'non-fatal', 90),
(NULL, 'statuscode.unavailable', 'non-fatal', 85),
-- Fatal resource errors
(NULL, 'out of memory', 'fatal', 95),
(NULL, 'memoryerror', 'fatal', 90);