ADR: Job Retry/Reset Mechanism Simplification

Date: 2025-11-27 Status: Proposed Deciders: Engineering Team Tags: api, emprops-api, monitor, job-retry, reliability

Context

Problem Statement

The current job retry mechanism has become complex and confusing, with two separate buttons (Retry and Reset) in the job forensics page that do fundamentally different things, leading to operator confusion and unreliable job recovery.

Current System:

Button	API Route	Location	What It Does
Retry	`POST /api/jobs/{id}/retry`	Monitor -> EmProps API	Full re-execution: backs up job state, resets to pending, starts GeneratorV2, fires webhooks on completion
Reset	`POST /api/jobs/{id}/reset`	Monitor -> Direct DB	Simple status flip: sets `status='failed'`, clears timestamps, sets `progress=0`

Issues with Current Architecture

Semantic Confusion:
- "Retry" implies re-running the job (and does)
- "Reset" sounds like it would re-run the job, but it only changes status
- After "Reset", the job just sits there - nothing picks it up

Reset Does Nothing Useful:

typescript

// Reset endpoint - just flips DB state
await prisma.job.update({
  where: { id: jobId },
  data: {
    status: 'failed',  // Set to 'failed', not 'pending'
    started_at: null,
    completed_at: null,
    error_message: 'Job manually reset for retry',
    progress: 0
  }
});

Sets status to 'failed' - no job processor picks up failed jobs automatically
Comment says "reset to failed so it can be retried" but no automatic retry happens
Operator must then click Retry anyway

Retry is Complex and Fragile:
- 400+ lines of code in retryJob() function
- Creates backup records (job_retry_backup)
- Instantiates new GeneratorV2 with full event handlers
- Has its own completion/failure handlers with webhook logic
- Duplicates webhook prevention logic inline
- Only works for collection_generation job type
- Has retry count limits that can block legitimate retries
Dual Webhook Paths:
- Retry completion sends webhooks via inline handlers
- Normal job completion sends webhooks via different path
- Both have duplicate prevention logic implemented separately

Code Complexity Analysis

Monitor UI (JobForensics.tsx):

typescript

// Two separate functions, nearly identical in structure
const retryJob = async (jobId: string) => { ... }  // 50 lines
const resetJob = async (jobId: string) => { ... }  // 45 lines

EmProps API (routes/jobs/index.ts):

retryJob(): Lines 302-682 (380 lines!)
Creates its own GeneratorV2 instance
Has full event handling for node_started, node_progress, node_completed, complete, error
Manually manages job state transitions
Duplicates webhook sending logic

Monitor Reset Route (api/jobs/[jobId]/reset/route.ts):

53 lines
Just does a simple Prisma update
No connection to job queue or processor

Impact

Operator Confusion: "Which button do I click?"
Failed Retries: Reset doesn't actually retry anything
Maintenance Burden: 400+ lines of duplicated job execution logic
Inconsistent Behavior: Retry behaves differently than initial job submission
Hard to Debug: Two completely different code paths for the same goal

Decision

Proposed Simplification

Replace both buttons with a single "Resubmit Job" action that:

Creates a new job with the same parameters (like a fresh submission)
Links to original via original_job_id field for traceability
Uses existing job submission flow (no special retry logic)
Archives the old job to a terminal state

New Architecture

User clicks "Resubmit Job"
    ↓
Monitor calls POST /api/jobs/{id}/resubmit
    ↓
EmProps API:
    1. Load original job data (collectionId, variables)
    2. Create NEW job with same parameters
    3. Update original job: status='superseded', superseded_by=new_job_id
    4. Return new job ID
    ↓
Job enters normal queue → normal execution → normal webhooks

Proposed Implementation

New Endpoint: POST /api/jobs/{id}/resubmit

typescript

export const resubmitJob = (prisma: PrismaClient) => {
  return async (req: Request, res: Response) => {
    const originalJobId = req.params.id;

    // 1. Load original job
    const originalJob = await prisma.job.findUnique({
      where: { id: originalJobId }
    });

    if (!originalJob) {
      return res.status(404).json({ error: 'Job not found' });
    }

    // 2. Extract parameters for resubmission
    const jobData = originalJob.data as { collectionId?: string; variables?: Record<string, any> };

    if (!jobData.collectionId) {
      return res.status(400).json({ error: 'Cannot resubmit: missing collection ID' });
    }

    // 3. Create new job (same as normal job creation)
    const newJob = await prisma.job.create({
      data: {
        name: `${originalJob.name} (resubmit)`,
        description: originalJob.description,
        job_type: originalJob.job_type,
        user_id: originalJob.user_id,
        priority: originalJob.priority,
        data: {
          collectionId: jobData.collectionId,
          variables: jobData.variables,
          resubmitted_from: originalJobId,
        },
        status: 'pending',
        retry_count: 0,
        max_retries: originalJob.max_retries,
      }
    });

    // 4. Mark original as superseded
    await prisma.job.update({
      where: { id: originalJobId },
      data: {
        status: 'superseded',
        data: {
          ...jobData,
          superseded_by: newJob.id,
          superseded_at: new Date().toISOString(),
        }
      }
    });

    // 5. Submit to normal job queue
    // (Use existing job submission logic)
    await submitJobToQueue(newJob);

    return res.status(200).json({
      data: {
        original_job_id: originalJobId,
        new_job_id: newJob.id,
        message: 'Job resubmitted successfully'
      }
    });
  };
};

Database Changes

Add new status value:

sql

-- Add 'superseded' to valid job statuses
ALTER TYPE job_status ADD VALUE 'superseded';

UI Changes

Replace two buttons with one:

tsx

// Before: Two confusing buttons
<Button onClick={() => retryJob(jobId)}>Retry</Button>
<Button onClick={() => resetJob(jobId)}>Reset</Button>

// After: One clear action
<Button onClick={() => resubmitJob(jobId)}>
  Resubmit Job
</Button>

Rationale

Why Resubmit Instead of Retry?

Semantic Clarity: "Resubmit" clearly means "try again from scratch"
Uses Proven Path: New job goes through same flow as original
No Special Logic: No 400-line retry handler needed
Full Traceability: Original job preserved, linked to new one
No Retry Limits: No artificial "max_retries" blocking legitimate attempts

Why Archive Original Instead of Modifying?

Audit Trail: Original job state preserved for debugging
No State Corruption: Don't risk corrupting existing job data
Clear Lineage: Can trace resubmission chain via IDs
Simple Rollback: Easy to identify what was superseded

Why Not Just Fix Reset?

Reset has a fundamental design problem: it assumes something will pick up failed jobs. Our architecture doesn't have that - jobs only enter the queue on initial submission.

Consequences

Positive

Single Clear Action: One button, one mental model
Code Reduction: Delete ~500 lines of retry/reset code
Consistent Behavior: Resubmitted jobs behave exactly like new jobs
Better Debugging: Clear job lineage via superseded_by/resubmitted_from
No Retry Limits: Operators can resubmit as many times as needed

Negative

New Job IDs: Each resubmit creates a new ID (could affect external integrations)
Migration: Need to update UI and educate operators
Status Addition: New superseded status needs schema migration

Neutral

Historical Jobs: Old retry_count fields become less meaningful
Job Backups: job_retry_backup table becomes deprecated

Migration Plan

Phase 1: Add Resubmit (Non-Breaking)

Add POST /api/jobs/{id}/resubmit endpoint
Add "Resubmit" button to forensics UI (alongside existing)
Add superseded status to schema

Phase 2: Deprecate Old Buttons

Add deprecation warning to Retry/Reset buttons
Log usage of old endpoints

Phase 3: Remove Old Code

Remove Retry button and retryJob() function
Remove Reset button and resetJob() function
Remove 380-line retryJob endpoint from emprops-api
Archive (don't delete) job_retry_backup table

Files to Modify

File	Change
`apps/emprops-api/src/routes/jobs/index.ts`	Add `resubmitJob()` (~50 lines), eventually delete `retryJob()` (~380 lines)
`apps/emprops-api/src/index.ts`	Add route registration
`apps/monitor/src/app/api/jobs/[jobId]/resubmit/route.ts`	New proxy route
`apps/monitor/src/components/JobForensics.tsx`	Replace buttons
`packages/database-schema/schema.prisma`	Add `superseded` status

Success Criteria

Operators have single, clear action for job recovery
Resubmitted jobs complete at same rate as new jobs
Job lineage clearly visible in forensics
400+ lines of retry code removed

Future Considerations

Bulk resubmit for failed jobs
Automatic resubmit rules (e.g., "resubmit on timeout")
Rate limiting on resubmits
Cost tracking for resubmitted jobs

ADR: Job Retry/Reset Mechanism Simplification ​

Context ​

Problem Statement ​

Issues with Current Architecture ​

Code Complexity Analysis ​

Impact ​

Decision ​

Proposed Simplification ​

New Architecture ​

Proposed Implementation ​

Database Changes ​

UI Changes ​

Rationale ​

Why Resubmit Instead of Retry? ​

Why Archive Original Instead of Modifying? ​

Why Not Just Fix Reset? ​

Consequences ​

Positive ​

Negative ​

Neutral ​

Migration Plan ​

Phase 1: Add Resubmit (Non-Breaking) ​

Phase 2: Deprecate Old Buttons ​

Phase 3: Remove Old Code ​

Files to Modify ​

Success Criteria ​

Future Considerations ​

ADR: Job Retry/Reset Mechanism Simplification

Context

Problem Statement

Issues with Current Architecture

Code Complexity Analysis

Impact

Decision

Proposed Simplification

New Architecture

Proposed Implementation

Database Changes

UI Changes

Rationale

Why Resubmit Instead of Retry?

Why Archive Original Instead of Modifying?

Why Not Just Fix Reset?

Consequences

Positive

Negative

Neutral

Migration Plan

Phase 1: Add Resubmit (Non-Breaking)

Phase 2: Deprecate Old Buttons

Phase 3: Remove Old Code

Files to Modify

Success Criteria

Future Considerations