ADR: Job Retry/Reset Mechanism Simplification
Date: 2025-11-27 Status: Proposed Deciders: Engineering Team Tags: api, emprops-api, monitor, job-retry, reliability
Context
Problem Statement
The current job retry mechanism has become complex and confusing, with two separate buttons (Retry and Reset) in the job forensics page that do fundamentally different things, leading to operator confusion and unreliable job recovery.
Current System:
| Button | API Route | Location | What It Does |
|---|---|---|---|
| Retry | POST /api/jobs/{id}/retry | Monitor -> EmProps API | Full re-execution: backs up job state, resets to pending, starts GeneratorV2, fires webhooks on completion |
| Reset | POST /api/jobs/{id}/reset | Monitor -> Direct DB | Simple status flip: sets status='failed', clears timestamps, sets progress=0 |
Issues with Current Architecture
Semantic Confusion:
- "Retry" implies re-running the job (and does)
- "Reset" sounds like it would re-run the job, but it only changes status
- After "Reset", the job just sits there - nothing picks it up
Reset Does Nothing Useful:
typescript// Reset endpoint - just flips DB state await prisma.job.update({ where: { id: jobId }, data: { status: 'failed', // Set to 'failed', not 'pending' started_at: null, completed_at: null, error_message: 'Job manually reset for retry', progress: 0 } });- Sets status to
'failed'- no job processor picks up failed jobs automatically - Comment says "reset to failed so it can be retried" but no automatic retry happens
- Operator must then click Retry anyway
- Sets status to
Retry is Complex and Fragile:
- 400+ lines of code in
retryJob()function - Creates backup records (
job_retry_backup) - Instantiates new
GeneratorV2with full event handlers - Has its own completion/failure handlers with webhook logic
- Duplicates webhook prevention logic inline
- Only works for
collection_generationjob type - Has retry count limits that can block legitimate retries
- 400+ lines of code in
Dual Webhook Paths:
- Retry completion sends webhooks via inline handlers
- Normal job completion sends webhooks via different path
- Both have duplicate prevention logic implemented separately
Code Complexity Analysis
Monitor UI (JobForensics.tsx):
// Two separate functions, nearly identical in structure
const retryJob = async (jobId: string) => { ... } // 50 lines
const resetJob = async (jobId: string) => { ... } // 45 linesEmProps API (routes/jobs/index.ts):
retryJob(): Lines 302-682 (380 lines!)- Creates its own GeneratorV2 instance
- Has full event handling for
node_started,node_progress,node_completed,complete,error - Manually manages job state transitions
- Duplicates webhook sending logic
Monitor Reset Route (api/jobs/[jobId]/reset/route.ts):
- 53 lines
- Just does a simple Prisma update
- No connection to job queue or processor
Impact
- Operator Confusion: "Which button do I click?"
- Failed Retries: Reset doesn't actually retry anything
- Maintenance Burden: 400+ lines of duplicated job execution logic
- Inconsistent Behavior: Retry behaves differently than initial job submission
- Hard to Debug: Two completely different code paths for the same goal
Decision
Proposed Simplification
Replace both buttons with a single "Resubmit Job" action that:
- Creates a new job with the same parameters (like a fresh submission)
- Links to original via
original_job_idfield for traceability - Uses existing job submission flow (no special retry logic)
- Archives the old job to a terminal state
New Architecture
User clicks "Resubmit Job"
↓
Monitor calls POST /api/jobs/{id}/resubmit
↓
EmProps API:
1. Load original job data (collectionId, variables)
2. Create NEW job with same parameters
3. Update original job: status='superseded', superseded_by=new_job_id
4. Return new job ID
↓
Job enters normal queue → normal execution → normal webhooksProposed Implementation
New Endpoint: POST /api/jobs/{id}/resubmit
export const resubmitJob = (prisma: PrismaClient) => {
return async (req: Request, res: Response) => {
const originalJobId = req.params.id;
// 1. Load original job
const originalJob = await prisma.job.findUnique({
where: { id: originalJobId }
});
if (!originalJob) {
return res.status(404).json({ error: 'Job not found' });
}
// 2. Extract parameters for resubmission
const jobData = originalJob.data as { collectionId?: string; variables?: Record<string, any> };
if (!jobData.collectionId) {
return res.status(400).json({ error: 'Cannot resubmit: missing collection ID' });
}
// 3. Create new job (same as normal job creation)
const newJob = await prisma.job.create({
data: {
name: `${originalJob.name} (resubmit)`,
description: originalJob.description,
job_type: originalJob.job_type,
user_id: originalJob.user_id,
priority: originalJob.priority,
data: {
collectionId: jobData.collectionId,
variables: jobData.variables,
resubmitted_from: originalJobId,
},
status: 'pending',
retry_count: 0,
max_retries: originalJob.max_retries,
}
});
// 4. Mark original as superseded
await prisma.job.update({
where: { id: originalJobId },
data: {
status: 'superseded',
data: {
...jobData,
superseded_by: newJob.id,
superseded_at: new Date().toISOString(),
}
}
});
// 5. Submit to normal job queue
// (Use existing job submission logic)
await submitJobToQueue(newJob);
return res.status(200).json({
data: {
original_job_id: originalJobId,
new_job_id: newJob.id,
message: 'Job resubmitted successfully'
}
});
};
};Database Changes
Add new status value:
-- Add 'superseded' to valid job statuses
ALTER TYPE job_status ADD VALUE 'superseded';UI Changes
Replace two buttons with one:
// Before: Two confusing buttons
<Button onClick={() => retryJob(jobId)}>Retry</Button>
<Button onClick={() => resetJob(jobId)}>Reset</Button>
// After: One clear action
<Button onClick={() => resubmitJob(jobId)}>
Resubmit Job
</Button>Rationale
Why Resubmit Instead of Retry?
- Semantic Clarity: "Resubmit" clearly means "try again from scratch"
- Uses Proven Path: New job goes through same flow as original
- No Special Logic: No 400-line retry handler needed
- Full Traceability: Original job preserved, linked to new one
- No Retry Limits: No artificial "max_retries" blocking legitimate attempts
Why Archive Original Instead of Modifying?
- Audit Trail: Original job state preserved for debugging
- No State Corruption: Don't risk corrupting existing job data
- Clear Lineage: Can trace resubmission chain via IDs
- Simple Rollback: Easy to identify what was superseded
Why Not Just Fix Reset?
Reset has a fundamental design problem: it assumes something will pick up failed jobs. Our architecture doesn't have that - jobs only enter the queue on initial submission.
Consequences
Positive
- Single Clear Action: One button, one mental model
- Code Reduction: Delete ~500 lines of retry/reset code
- Consistent Behavior: Resubmitted jobs behave exactly like new jobs
- Better Debugging: Clear job lineage via
superseded_by/resubmitted_from - No Retry Limits: Operators can resubmit as many times as needed
Negative
- New Job IDs: Each resubmit creates a new ID (could affect external integrations)
- Migration: Need to update UI and educate operators
- Status Addition: New
supersededstatus needs schema migration
Neutral
- Historical Jobs: Old retry_count fields become less meaningful
- Job Backups:
job_retry_backuptable becomes deprecated
Migration Plan
Phase 1: Add Resubmit (Non-Breaking)
- Add
POST /api/jobs/{id}/resubmitendpoint - Add "Resubmit" button to forensics UI (alongside existing)
- Add
supersededstatus to schema
Phase 2: Deprecate Old Buttons
- Add deprecation warning to Retry/Reset buttons
- Log usage of old endpoints
Phase 3: Remove Old Code
- Remove Retry button and
retryJob()function - Remove Reset button and
resetJob()function - Remove 380-line
retryJobendpoint from emprops-api - Archive (don't delete)
job_retry_backuptable
Files to Modify
| File | Change |
|---|---|
apps/emprops-api/src/routes/jobs/index.ts | Add resubmitJob() (~50 lines), eventually delete retryJob() (~380 lines) |
apps/emprops-api/src/index.ts | Add route registration |
apps/monitor/src/app/api/jobs/[jobId]/resubmit/route.ts | New proxy route |
apps/monitor/src/components/JobForensics.tsx | Replace buttons |
packages/database-schema/schema.prisma | Add superseded status |
Success Criteria
- Operators have single, clear action for job recovery
- Resubmitted jobs complete at same rate as new jobs
- Job lineage clearly visible in forensics
- 400+ lines of retry code removed
Future Considerations
- Bulk resubmit for failed jobs
- Automatic resubmit rules (e.g., "resubmit on timeout")
- Rate limiting on resubmits
- Cost tracking for resubmitted jobs
