Skip to content

ADR: Job Retry/Reset Mechanism Simplification

Date: 2025-11-27 Status: Proposed Deciders: Engineering Team Tags: api, emprops-api, monitor, job-retry, reliability

Context

Problem Statement

The current job retry mechanism has become complex and confusing, with two separate buttons (Retry and Reset) in the job forensics page that do fundamentally different things, leading to operator confusion and unreliable job recovery.

Current System:

ButtonAPI RouteLocationWhat It Does
RetryPOST /api/jobs/{id}/retryMonitor -> EmProps APIFull re-execution: backs up job state, resets to pending, starts GeneratorV2, fires webhooks on completion
ResetPOST /api/jobs/{id}/resetMonitor -> Direct DBSimple status flip: sets status='failed', clears timestamps, sets progress=0

Issues with Current Architecture

  1. Semantic Confusion:

    • "Retry" implies re-running the job (and does)
    • "Reset" sounds like it would re-run the job, but it only changes status
    • After "Reset", the job just sits there - nothing picks it up
  2. Reset Does Nothing Useful:

    typescript
    // Reset endpoint - just flips DB state
    await prisma.job.update({
      where: { id: jobId },
      data: {
        status: 'failed',  // Set to 'failed', not 'pending'
        started_at: null,
        completed_at: null,
        error_message: 'Job manually reset for retry',
        progress: 0
      }
    });
    • Sets status to 'failed' - no job processor picks up failed jobs automatically
    • Comment says "reset to failed so it can be retried" but no automatic retry happens
    • Operator must then click Retry anyway
  3. Retry is Complex and Fragile:

    • 400+ lines of code in retryJob() function
    • Creates backup records (job_retry_backup)
    • Instantiates new GeneratorV2 with full event handlers
    • Has its own completion/failure handlers with webhook logic
    • Duplicates webhook prevention logic inline
    • Only works for collection_generation job type
    • Has retry count limits that can block legitimate retries
  4. Dual Webhook Paths:

    • Retry completion sends webhooks via inline handlers
    • Normal job completion sends webhooks via different path
    • Both have duplicate prevention logic implemented separately

Code Complexity Analysis

Monitor UI (JobForensics.tsx):

typescript
// Two separate functions, nearly identical in structure
const retryJob = async (jobId: string) => { ... }  // 50 lines
const resetJob = async (jobId: string) => { ... }  // 45 lines

EmProps API (routes/jobs/index.ts):

  • retryJob(): Lines 302-682 (380 lines!)
  • Creates its own GeneratorV2 instance
  • Has full event handling for node_started, node_progress, node_completed, complete, error
  • Manually manages job state transitions
  • Duplicates webhook sending logic

Monitor Reset Route (api/jobs/[jobId]/reset/route.ts):

  • 53 lines
  • Just does a simple Prisma update
  • No connection to job queue or processor

Impact

  1. Operator Confusion: "Which button do I click?"
  2. Failed Retries: Reset doesn't actually retry anything
  3. Maintenance Burden: 400+ lines of duplicated job execution logic
  4. Inconsistent Behavior: Retry behaves differently than initial job submission
  5. Hard to Debug: Two completely different code paths for the same goal

Decision

Proposed Simplification

Replace both buttons with a single "Resubmit Job" action that:

  1. Creates a new job with the same parameters (like a fresh submission)
  2. Links to original via original_job_id field for traceability
  3. Uses existing job submission flow (no special retry logic)
  4. Archives the old job to a terminal state

New Architecture

User clicks "Resubmit Job"

Monitor calls POST /api/jobs/{id}/resubmit

EmProps API:
    1. Load original job data (collectionId, variables)
    2. Create NEW job with same parameters
    3. Update original job: status='superseded', superseded_by=new_job_id
    4. Return new job ID

Job enters normal queue → normal execution → normal webhooks

Proposed Implementation

New Endpoint: POST /api/jobs/{id}/resubmit

typescript
export const resubmitJob = (prisma: PrismaClient) => {
  return async (req: Request, res: Response) => {
    const originalJobId = req.params.id;

    // 1. Load original job
    const originalJob = await prisma.job.findUnique({
      where: { id: originalJobId }
    });

    if (!originalJob) {
      return res.status(404).json({ error: 'Job not found' });
    }

    // 2. Extract parameters for resubmission
    const jobData = originalJob.data as { collectionId?: string; variables?: Record<string, any> };

    if (!jobData.collectionId) {
      return res.status(400).json({ error: 'Cannot resubmit: missing collection ID' });
    }

    // 3. Create new job (same as normal job creation)
    const newJob = await prisma.job.create({
      data: {
        name: `${originalJob.name} (resubmit)`,
        description: originalJob.description,
        job_type: originalJob.job_type,
        user_id: originalJob.user_id,
        priority: originalJob.priority,
        data: {
          collectionId: jobData.collectionId,
          variables: jobData.variables,
          resubmitted_from: originalJobId,
        },
        status: 'pending',
        retry_count: 0,
        max_retries: originalJob.max_retries,
      }
    });

    // 4. Mark original as superseded
    await prisma.job.update({
      where: { id: originalJobId },
      data: {
        status: 'superseded',
        data: {
          ...jobData,
          superseded_by: newJob.id,
          superseded_at: new Date().toISOString(),
        }
      }
    });

    // 5. Submit to normal job queue
    // (Use existing job submission logic)
    await submitJobToQueue(newJob);

    return res.status(200).json({
      data: {
        original_job_id: originalJobId,
        new_job_id: newJob.id,
        message: 'Job resubmitted successfully'
      }
    });
  };
};

Database Changes

Add new status value:

sql
-- Add 'superseded' to valid job statuses
ALTER TYPE job_status ADD VALUE 'superseded';

UI Changes

Replace two buttons with one:

tsx
// Before: Two confusing buttons
<Button onClick={() => retryJob(jobId)}>Retry</Button>
<Button onClick={() => resetJob(jobId)}>Reset</Button>

// After: One clear action
<Button onClick={() => resubmitJob(jobId)}>
  Resubmit Job
</Button>

Rationale

Why Resubmit Instead of Retry?

  1. Semantic Clarity: "Resubmit" clearly means "try again from scratch"
  2. Uses Proven Path: New job goes through same flow as original
  3. No Special Logic: No 400-line retry handler needed
  4. Full Traceability: Original job preserved, linked to new one
  5. No Retry Limits: No artificial "max_retries" blocking legitimate attempts

Why Archive Original Instead of Modifying?

  1. Audit Trail: Original job state preserved for debugging
  2. No State Corruption: Don't risk corrupting existing job data
  3. Clear Lineage: Can trace resubmission chain via IDs
  4. Simple Rollback: Easy to identify what was superseded

Why Not Just Fix Reset?

Reset has a fundamental design problem: it assumes something will pick up failed jobs. Our architecture doesn't have that - jobs only enter the queue on initial submission.

Consequences

Positive

  1. Single Clear Action: One button, one mental model
  2. Code Reduction: Delete ~500 lines of retry/reset code
  3. Consistent Behavior: Resubmitted jobs behave exactly like new jobs
  4. Better Debugging: Clear job lineage via superseded_by/resubmitted_from
  5. No Retry Limits: Operators can resubmit as many times as needed

Negative

  1. New Job IDs: Each resubmit creates a new ID (could affect external integrations)
  2. Migration: Need to update UI and educate operators
  3. Status Addition: New superseded status needs schema migration

Neutral

  1. Historical Jobs: Old retry_count fields become less meaningful
  2. Job Backups: job_retry_backup table becomes deprecated

Migration Plan

Phase 1: Add Resubmit (Non-Breaking)

  1. Add POST /api/jobs/{id}/resubmit endpoint
  2. Add "Resubmit" button to forensics UI (alongside existing)
  3. Add superseded status to schema

Phase 2: Deprecate Old Buttons

  1. Add deprecation warning to Retry/Reset buttons
  2. Log usage of old endpoints

Phase 3: Remove Old Code

  1. Remove Retry button and retryJob() function
  2. Remove Reset button and resetJob() function
  3. Remove 380-line retryJob endpoint from emprops-api
  4. Archive (don't delete) job_retry_backup table

Files to Modify

FileChange
apps/emprops-api/src/routes/jobs/index.tsAdd resubmitJob() (~50 lines), eventually delete retryJob() (~380 lines)
apps/emprops-api/src/index.tsAdd route registration
apps/monitor/src/app/api/jobs/[jobId]/resubmit/route.tsNew proxy route
apps/monitor/src/components/JobForensics.tsxReplace buttons
packages/database-schema/schema.prismaAdd superseded status

Success Criteria

  1. Operators have single, clear action for job recovery
  2. Resubmitted jobs complete at same rate as new jobs
  3. Job lineage clearly visible in forensics
  4. 400+ lines of retry code removed

Future Considerations

  1. Bulk resubmit for failed jobs
  2. Automatic resubmit rules (e.g., "resubmit on timeout")
  3. Rate limiting on resubmits
  4. Cost tracking for resubmitted jobs

Released under the MIT License.