Redis-Driven Error Pattern Classification - ADR
Status: Proposed Date: 2025-01-09 Author: System Architecture Related: ERROR_HANDLING_MODERNIZATION.md, connector-error-handling-standard.md Supersedes: database-driven-error-classification.md (PostgreSQL approach)
Context
The emp-job-queue system processes logs from multiple external services (ComfyUI, Ollama, Stable Diffusion) and must classify error messages as fatal (fail the job), non-fatal (log but continue), or ignore (regular logs).
Current Approach: Hardcoded Catalogs
Error classification logic is embedded in each connector's TypeScript code:
private classifyLogMessage(message: string): 'fatal' | 'non-fatal' | 'ignore' {
const messageLower = message.toLowerCase();
// Hardcoded patterns
if (messageLower.includes('custom validation failed')) return 'fatal';
if (messageLower.includes('dash0.com')) return 'non-fatal';
if (messageLower.includes('out of memory')) return 'fatal';
// ... 20+ more patterns
return 'ignore';
}Problems with Hardcoded Patterns
- Deployment Required: Every new error pattern requires code change + rebuild + redeploy
- No Hot-Fixing: Can't quickly reclassify errors in production without deployment
- Limited Collaboration: Only engineers with repo access can update patterns
- No Analytics: Can't track which patterns match most frequently
- Environment Inconsistency: Dev/staging/prod might need different classifications
- Pattern Sprawl: As we discover new errors, code becomes increasingly complex
Real-World Scenario
🚨 Production Alert: Jobs failing due to new ComfyUI error message
"RuntimeError: CUDA driver version mismatch"
Current process:
1. Engineer identifies pattern needs to be cataloged (5 min)
2. Update TypeScript code to add pattern (10 min)
3. Run tests, commit, push (15 min)
4. CI/CD pipeline builds + deploys (20 min)
5. Total time to fix: 50+ minutes
Desired process:
1. Admin runs CLI/API command to add pattern
2. Redis SET operation completes instantly
3. Workers refresh cache within 30 seconds (or immediately via Pub/Sub)
4. Total time to fix: <1 minuteDecision
Implement Redis-based error pattern classification with the following characteristics:
1. Redis Data Structure
Store error patterns in Redis hashes organized by connector type:
// Key structure: error_patterns:{connector_type}
// Value: Hash { pattern_id → JSON }
// Global patterns (all connectors)
Key: "error_patterns:global"
Hash: {
"global_001": JSON.stringify({
pattern: "out of memory",
match_type: "contains",
case_sensitive: false,
classification: "fatal",
priority: 95,
description: "Memory exhaustion (all services)",
example_message: "CUDA out of memory",
active: true,
created_at: "2025-01-09T12:00:00Z",
updated_at: "2025-01-09T12:00:00Z",
created_by: "system"
})
}
// ComfyUI-specific patterns
Key: "error_patterns:comfyui"
Hash: {
"comfyui_001": JSON.stringify({
pattern: "custom validation failed",
match_type: "contains",
case_sensitive: false,
classification: "fatal",
priority: 100,
description: "ComfyUI custom node validation",
example_message: "Custom validation failed: missing required parameter",
active: true,
created_at: "2025-01-09T12:00:00Z"
})
}
// Pattern version tracking (for cache invalidation)
Key: "error_patterns:version"
Value: "1736424000000" (Unix timestamp)2. Worker Implementation - Zero-Overhead Pattern Matching
In-memory cache with instant Redis updates - patterns loaded once at startup:
class ComfyUIRestStreamConnector {
// Pre-compiled patterns in memory
private fatalPatterns: CompiledPattern[] = [];
private nonFatalPatterns: CompiledPattern[] = [];
private lastPatternVersion: string = '0';
private readonly connectorType = 'comfyui';
async initialize() {
// Load patterns ONCE at worker startup
await this.loadPatternsFromRedis();
// Subscribe to pattern update events (instant invalidation)
this.subscribeToPatternUpdates();
}
private async loadPatternsFromRedis() {
// Fetch global patterns
const globalPatterns = await redis.hgetall('error_patterns:global');
// Fetch connector-specific patterns
const connectorPatterns = await redis.hgetall(`error_patterns:${this.connectorType}`);
// Merge and compile patterns
const allPatterns = [
...Object.values(globalPatterns).map(p => JSON.parse(p)),
...Object.values(connectorPatterns).map(p => JSON.parse(p))
].filter(p => p.active);
// Pre-compile into fast lookup structures (sorted by priority)
this.fatalPatterns = allPatterns
.filter(p => p.classification === 'fatal')
.sort((a, b) => b.priority - a.priority)
.map(p => this.compilePattern(p));
this.nonFatalPatterns = allPatterns
.filter(p => p.classification === 'non-fatal')
.sort((a, b) => b.priority - a.priority)
.map(p => this.compilePattern(p));
// Update cache version
this.lastPatternVersion = await redis.get('error_patterns:version') || '0';
logger.info(`Loaded ${allPatterns.length} error patterns from Redis`, {
fatal: this.fatalPatterns.length,
nonFatal: this.nonFatalPatterns.length,
version: this.lastPatternVersion
});
}
private subscribeToPatternUpdates() {
// Subscribe to Redis Pub/Sub for instant cache invalidation
const subscriber = new Redis(this.redisConfig);
subscriber.subscribe('error_patterns:updated', (err) => {
if (err) {
logger.error('Failed to subscribe to pattern updates', { error: err });
return;
}
});
subscriber.on('message', async (channel, message) => {
if (channel === 'error_patterns:updated') {
const newVersion = message;
if (newVersion !== this.lastPatternVersion) {
logger.info('Pattern update detected, reloading cache', {
oldVersion: this.lastPatternVersion,
newVersion
});
await this.loadPatternsFromRedis();
}
}
});
}
private classifyLogMessage(message: string): 'fatal' | 'non-fatal' | 'ignore' {
// Fast in-memory matching - NO REDIS QUERIES
const messageLower = message.toLowerCase();
// Check non-fatal first (common infrastructure noise)
for (const pattern of this.nonFatalPatterns) {
if (this.matches(messageLower, pattern)) return 'non-fatal';
}
// Check fatal patterns
for (const pattern of this.fatalPatterns) {
if (this.matches(messageLower, pattern)) return 'fatal';
}
return 'ignore';
}
private matches(message: string, pattern: CompiledPattern): boolean {
// Pure in-memory string operations - VERY FAST (~0.001ms)
if (pattern.regex) {
// Regex with timeout protection
return this.regexMatchWithTimeout(pattern.regex, message, 10); // 10ms timeout
}
return message.includes(pattern.pattern);
}
private regexMatchWithTimeout(regex: RegExp, text: string, timeoutMs: number): boolean {
// Protect against catastrophic backtracking
const start = Date.now();
try {
const match = regex.test(text);
const elapsed = Date.now() - start;
if (elapsed > timeoutMs) {
logger.warn('Regex match exceeded timeout', {
pattern: regex.source,
elapsed
});
}
return match;
} catch (error) {
logger.error('Regex match failed', { error, pattern: regex.source });
return false;
}
}
private compilePattern(pattern: ErrorPattern): CompiledPattern {
return {
pattern: pattern.pattern.toLowerCase(),
regex: pattern.match_type === 'regex' ? new RegExp(pattern.pattern, pattern.case_sensitive ? '' : 'i') : null,
matchType: pattern.match_type,
caseSensitive: pattern.case_sensitive
};
}
}3. Pattern Management API/CLI
Simple Redis commands for pattern management:
// Add a new pattern
async function addErrorPattern(
connectorType: string | 'global',
pattern: string,
classification: 'fatal' | 'non-fatal' | 'ignore',
options?: {
matchType?: 'contains' | 'regex' | 'exact';
caseSensitive?: boolean;
priority?: number;
description?: string;
exampleMessage?: string;
}
) {
const patternId = `${connectorType}_${Date.now()}`;
const key = `error_patterns:${connectorType}`;
const patternData = {
pattern,
match_type: options?.matchType || 'contains',
case_sensitive: options?.caseSensitive || false,
classification,
priority: options?.priority || 100,
description: options?.description,
example_message: options?.exampleMessage,
active: true,
created_at: new Date().toISOString(),
updated_at: new Date().toISOString(),
created_by: 'admin'
};
// Store pattern
await redis.hset(key, patternId, JSON.stringify(patternData));
// Update version (triggers worker cache refresh)
const version = Date.now().toString();
await redis.set('error_patterns:version', version);
// Publish update event (instant worker refresh via Pub/Sub)
await redis.publish('error_patterns:updated', version);
logger.info('Error pattern added', { patternId, connectorType, classification });
return patternId;
}
// Update pattern
async function updateErrorPattern(
connectorType: string,
patternId: string,
updates: Partial<ErrorPattern>
) {
const key = `error_patterns:${connectorType}`;
const existing = await redis.hget(key, patternId);
if (!existing) {
throw new Error(`Pattern not found: ${patternId}`);
}
const patternData = {
...JSON.parse(existing),
...updates,
updated_at: new Date().toISOString()
};
await redis.hset(key, patternId, JSON.stringify(patternData));
const version = Date.now().toString();
await redis.set('error_patterns:version', version);
await redis.publish('error_patterns:updated', version);
}
// Delete pattern (soft delete by setting active: false)
async function deactivateErrorPattern(connectorType: string, patternId: string) {
await updateErrorPattern(connectorType, patternId, { active: false });
}
// List all patterns
async function listErrorPatterns(connectorType?: string) {
const keys = connectorType
? [`error_patterns:${connectorType}`]
: await redis.keys('error_patterns:*');
const patterns = [];
for (const key of keys) {
const hash = await redis.hgetall(key);
for (const [id, data] of Object.entries(hash)) {
patterns.push({ id, ...JSON.parse(data as string) });
}
}
return patterns.sort((a, b) => b.priority - a.priority);
}Alternatives Considered
Alternative 1: PostgreSQL Database (Original Proposal)
Pros:
- Structured schema with validation
- SQL queries for analytics
- Transaction support
Cons:
- Additional database connection required
- 5-50ms query latency vs <1ms Redis
- Complex cache refresh logic
- Migration/schema versioning overhead
- Doesn't leverage existing Redis infrastructure
Rejected: Redis is already in the stack, faster, and simpler
Alternative 2: Configuration File (YAML/JSON)
Pros:
- Human-readable
- Version controlled
Cons:
- Still requires deployment
- No UI for non-engineers
- No instant updates
Rejected: Doesn't solve the deployment problem
Alternative 3: In-Memory Only (No Persistence)
Pros:
- Ultra-fast
- No external dependencies
Cons:
- Data loss on restart
- No sharing between workers
- No persistence
Rejected: Patterns are configuration, must be persistent
Consequences
Positive
✅ Instant Hot-Fixing: Add/update patterns in <1 second via Redis ✅ Sub-Millisecond Matching: In-memory cache, zero Redis queries during job processing ✅ Instant Propagation: Pub/Sub notifies workers immediately (<1 second) ✅ No New Dependencies: Redis already used for job matching ✅ Simpler Architecture: No database migrations, no schema versioning ✅ Space Efficient: 1000 patterns = ~300KB (negligible) ✅ Connector-Specific: Separate namespaces per service ✅ Environment-Specific: Different Redis instances for dev/staging/prod ✅ Analytics Ready: Redis sorted sets for pattern match tracking ✅ Graceful Degradation: Falls back to hardcoded patterns if Redis unavailable
Negative
⚠️ Redis Dependency: Workers need Redis at startup (already required) ⚠️ No SQL Analytics: Must use Redis queries or export to database ⚠️ Manual Backups: Need to export patterns for version control ⚠️ Regex Security: Pattern injection could cause ReDoS attacks
Mitigations
- Redis Unavailable: Workers use hardcoded fallback patterns
- Analytics: Export patterns to PostgreSQL periodically for SQL queries
- Backups: Automated script to dump patterns to JSON daily
- Regex Security: Validate regex complexity before storing, timeout on matching
Implementation Plan
Phase 1: Redis Schema Setup (Week 1)
1.1 Redis Key Design
// packages/core/src/types/error-patterns.ts
export interface ErrorPattern {
pattern: string;
match_type: 'contains' | 'regex' | 'exact';
case_sensitive: boolean;
classification: 'fatal' | 'non-fatal' | 'ignore';
priority: number;
description?: string;
example_message?: string;
active: boolean;
created_at: string;
updated_at: string;
created_by?: string;
}
export interface CompiledPattern {
pattern: string;
regex: RegExp | null;
matchType: 'contains' | 'regex' | 'exact';
caseSensitive: boolean;
}1.2 Seed Existing Patterns
- [ ] Extract all hardcoded patterns from
classifyLogMessage()methods - [ ] Create seed script:
scripts/seed-error-patterns.ts - [ ] Seed patterns to Redis (dev/staging/prod)
// Example seed data
const globalPatterns = [
{
pattern: "out of memory",
match_type: "contains",
classification: "fatal",
priority: 95,
description: "Memory exhaustion (all services)"
},
{
pattern: "dash0.com",
match_type: "contains",
classification: "non-fatal",
priority: 100,
description: "OpenTelemetry export (non-critical)"
}
];
const comfyUIPatterns = [
{
pattern: "custom validation failed",
match_type: "contains",
classification: "fatal",
priority: 100,
description: "ComfyUI custom node validation"
}
];Phase 2: Worker Integration (Week 2)
2.1 Pattern Cache Implementation
- [ ] Add
loadPatternsFromRedis()to BaseConnector - [ ] Implement pattern compilation (regex pre-compilation)
- [ ] Add Pub/Sub subscription for instant updates
- [ ] Add fallback to hardcoded patterns if Redis fails
- [ ] Add regex timeout protection (10ms max)
2.2 Classification Integration
- [ ] Update
classifyLogMessage()to use cached patterns - [ ] Add performance logging (cache load time, match time)
- [ ] Add metrics (pattern cache hits/misses)
2.3 Testing
- [ ] Unit tests for pattern matching
- [ ] Integration tests with Redis
- [ ] Performance benchmarks (target: <0.01ms per match)
- [ ] Test Redis unavailable scenario (graceful degradation)
- [ ] Regression tests (old vs new classifications must match 100%)
Phase 3: Management CLI/API (Week 3)
3.1 CLI Commands
# Add pattern
pnpm error-patterns:add --connector=comfyui --pattern="cuda driver mismatch" --classification=fatal
# List patterns
pnpm error-patterns:list --connector=comfyui
# Update pattern
pnpm error-patterns:update --id=comfyui_001 --priority=90
# Deactivate pattern
pnpm error-patterns:deactivate --id=comfyui_001
# Export patterns (for backup)
pnpm error-patterns:export --output=patterns.json
# Import patterns
pnpm error-patterns:import --input=patterns.json3.2 API Endpoints
// GET /api/error-patterns?connector_type=comfyui
// POST /api/error-patterns
// PUT /api/error-patterns/:id
// DELETE /api/error-patterns/:id (soft delete)Phase 4: Observability (Week 3-4)
4.1 Pattern Analytics (Redis Sorted Sets)
// Track pattern match frequency
// Key: error_patterns:analytics:{connector_type}
// Sorted Set: { score=match_count, member=pattern_id }
async function trackPatternMatch(connectorType: string, patternId: string) {
await redis.zincrby(`error_patterns:analytics:${connectorType}`, 1, patternId);
}
async function getTopMatchedPatterns(connectorType: string, limit: number = 10) {
return redis.zrevrange(`error_patterns:analytics:${connectorType}`, 0, limit - 1, 'WITHSCORES');
}4.2 Logging & Metrics
- [ ] Log pattern cache loads
- [ ] Metric:
error_pattern_cache_load_time_ms - [ ] Metric:
error_pattern_matches_total(by pattern_id) - [ ] Metric:
error_pattern_cache_size(number of patterns loaded) - [ ] Alert: Pattern cache load failures
4.3 Monitoring Dashboard
- [ ] Pattern match frequency chart
- [ ] Top 10 matched patterns
- [ ] Patterns never matched (candidates for removal)
- [ ] Pattern update history
Phase 5: Migration & Rollout
5.1 Feature Flag
const USE_REDIS_PATTERNS = process.env.USE_REDIS_ERROR_PATTERNS === 'true';
if (USE_REDIS_PATTERNS && await this.redisAvailable()) {
await this.loadPatternsFromRedis();
} else {
this.useHardcodedPatterns();
}5.2 Rollout Plan
- Deploy to dev with feature flag ON
- Run regression tests (old vs new classifications must match)
- Deploy to staging with flag ON
- Monitor for 48 hours
- Deploy to production with flag OFF (hardcoded)
- Gradually enable flag for 10% → 50% → 100% of workers
- Remove hardcoded patterns after 30 days of stable operation
5.3 Backup Strategy
# Daily automated backup
0 2 * * * /usr/local/bin/pnpm error-patterns:export --output=/backups/error-patterns-$(date +\%Y\%m\%d).jsonPerformance Benchmarks
Target Metrics
| Operation | Target | Acceptable |
|---|---|---|
| Pattern cache load (Redis fetch) | <10ms | <50ms |
| Pattern match (contains) | <0.001ms | <0.01ms |
| Pattern match (regex) | <0.01ms | <0.1ms |
| Pattern update propagation (Pub/Sub) | <1s | <5s |
| Redis HGETALL (100 patterns) | <5ms | <20ms |
Load Testing
Test with realistic workload:
- 100 patterns loaded
- 1000 log messages/second
- Mixed fatal/non-fatal/ignore
Expected:
- Pattern matching: <1ms total (for all 1000 messages)
- Memory overhead: <500KB for pattern cache
- Zero Redis queries during job processing
- Pattern updates propagate in <1 second via Pub/Sub
Space Estimation
Per pattern: ~300 bytes (JSON)
100 patterns: 30KB
1,000 patterns: 300KB
10,000 patterns: 3MB
Redis memory usage: NegligibleSuccess Criteria
Technical
✅ Workers load patterns from Redis at startup (<50ms) ✅ Pattern matching is <0.01ms per message (in-memory) ✅ Pattern updates propagate in <1 second (Pub/Sub) ✅ Graceful fallback if Redis unavailable ✅ Zero regressions in existing error classification ✅ Regex timeout protection (no catastrophic backtracking)
Operational
✅ Support team can add patterns without engineering help ✅ Pattern changes take effect within 1 second ✅ Production error classification can be hot-fixed instantly ✅ Analytics show which patterns match most frequently ✅ Different patterns for dev/staging/prod environments ✅ Automated daily backups of patterns
Future Enhancements
Machine Learning Classification
Once we have pattern match analytics, train ML model:
// Analyze historical pattern matches (export Redis analytics to PostgreSQL)
const trainingData = await getPatternMatchHistory();
// Train classifier to suggest patterns
const suggestedPattern = await mlClassifier.suggest(newErrorMessage);
// Admin reviews and approvesAuto-Discovery
Detect new error patterns automatically:
// Detect frequent unclassified errors
if (errorFrequency > threshold && classification === 'ignore') {
await redis.sadd('error_patterns:candidates', errorMessage);
notifyAdmin(`New error pattern detected: ${errorMessage}`);
}A/B Testing
Test different classifications:
// Route 10% of traffic to experimental classification
const experiment = await redis.hget('error_patterns:experiments', patternId);
if (experiment && Math.random() < 0.1) {
return JSON.parse(experiment).classification;
}References
- ERROR_HANDLING_MODERNIZATION.md - Error handling architecture
- connector-error-handling-standard.md - Connector standards
- Redis Hash Commands - Redis documentation
- Redis Pub/Sub - Redis documentation
Appendix: Redis CLI Examples
Add Patterns via Redis CLI
# Add global pattern
redis-cli HSET error_patterns:global global_001 '{"pattern":"out of memory","match_type":"contains","case_sensitive":false,"classification":"fatal","priority":95,"active":true}'
# Add ComfyUI-specific pattern
redis-cli HSET error_patterns:comfyui comfyui_001 '{"pattern":"custom validation failed","match_type":"contains","case_sensitive":false,"classification":"fatal","priority":100,"active":true}'
# Update version (triggers worker refresh)
redis-cli SET error_patterns:version $(date +%s)000
# Publish update event (instant Pub/Sub notification)
redis-cli PUBLISH error_patterns:updated $(date +%s)000
# List all ComfyUI patterns
redis-cli HGETALL error_patterns:comfyui
# Get pattern analytics
redis-cli ZREVRANGE error_patterns:analytics:comfyui 0 9 WITHSCORESExport/Import Patterns
# Export all patterns to JSON
redis-cli --raw KEYS 'error_patterns:*' | while read key; do
redis-cli --raw HGETALL "$key" | jq -R -s 'split("\n") | map(select(length > 0)) | {key: .[0], patterns: .[1:]}'
done > patterns.json
# Import patterns from JSON
cat patterns.json | jq -r '.patterns[]' | while read pattern; do
redis-cli HSET error_patterns:global "$(echo $pattern | jq -r '.id')" "$pattern"
done