Skip to content

ADR: Job Output Recovery Endpoint

Date: 2025-11-18 Status: Proposed Deciders: Engineering Team Tags: api, emprops-api, data-recovery, workflow-output

Context

Problem Statement

We have a recurring production issue where jobs complete successfully but have workflow_output set to null in the database, despite the output existing in Redis. This breaks the user experience as they cannot access their generated content.

Symptoms:

  • Job status: completed
  • Job workflow_output: null
  • Output does exist in Redis at api:workflow:completion:{workflow_id}
  • User sees "Job completed" but no image/video to download

Root Causes (requires separate investigation):

  1. Race condition during job completion webhook processing
  2. Network failure between job-queue API and emprops-API during webhook
  3. Database transaction failure after Redis write but before DB update
  4. Webhook retry giving up before workflow_output is captured

Impact

  • User Experience: Users see completed jobs with no output - appears broken
  • Support Burden: Manual database updates required to fix individual jobs
  • Data Loss Risk: Redis data expires (30-day TTL), after which recovery is impossible
  • Trust Erosion: Users lose confidence in the platform

Constraints

  1. Time Constraint: Cannot afford deep investigation right now - need quick fix
  2. Data Availability: Output data IS available in Redis (30-day TTL)
  3. Recovery Window: Must recover before Redis expires the attestation data
  4. API Contract: External callers shouldn't know they're triggering a fix

Decision

Create a GET /jobs/{id}/output endpoint in emprops-API that:

  1. Returns workflow_output if it exists (fast path)

  2. Attempts recovery from Redis if null (slow path):

    • Fetches workflow completion attestation from job-queue API
    • Extracts workflow_outputs array from attestation
    • Validates MIME type (image or video expected)
    • Updates job record with recovered URL
    • Returns the recovered output
  3. Marks job as failed if recovery impossible:

    • Redis data not found (expired or never written)
    • workflow_outputs array empty
    • No valid URL in outputs
    • Invalid MIME type

API Design

GET /jobs/{id}/output

Response (Success - Existing):
{
  "data": {
    "job_id": "91c43718...",
    "workflow_output": "https://...",
    "status": "completed"
  },
  "error": null
}

Response (Success - Recovered):
{
  "data": {
    "job_id": "91c43718...",
    "workflow_output": "https://...",
    "mime_type": "image/png",
    "status": "completed",
    "recovered": true
  },
  "error": null
}

Response (Failure):
{
  "data": null,
  "error": "Workflow output not found in Redis"
}
// Job status updated to "failed" in database

Rationale

Why GET instead of POST?

From the caller's perspective, they're simply retrieving the job output. They don't know (and shouldn't need to know) that the endpoint might perform recovery internally. The recovery is an implementation detail, not part of the API contract.

Why Update Job Status on Failure?

If we can't recover the output from Redis, the job is effectively failed even though it was marked completed. Updating the status reflects reality and prevents repeated recovery attempts.

Why Not Fix Root Cause First?

Time vs. Risk Trade-off: Root cause investigation takes days/weeks, but users need relief immediately. We can pursue root cause fix in parallel.

Implementation

Endpoint Logic

typescript
export const getJobOutput = (prisma: PrismaClient) => {
  return async (req: Request, res: Response) => {
    // 1. Load job from database
    // 2. Check permissions
    // 3. Verify job is completed
    // 4. If workflow_output exists → return immediately
    // 5. If workflow_output is null:
    //    a. Fetch from Redis: api:workflow:completion:{workflow_id}
    //    b. Parse workflow_outputs array
    //    c. Validate MIME type
    //    d. Update job.workflow_output
    //    e. Add job_history entry
    //    f. Return recovered URL
    // 6. If recovery fails → mark job failed, return 404
  };
};

Files to Modify

  1. apps/emprops-api/src/routes/jobs/index.ts

    • Add getJobOutput() function
  2. apps/emprops-api/src/index.ts

    • Register GET /jobs/:id/output route

Dependencies

  • Job Queue API: POST {JOB_QUEUE_API_URL}/redis/get
  • Redis key: api:workflow:completion:{workflow_id}
  • Environment: JOB_QUEUE_API_URL

Validation Rules

typescript
const validMimeTypes = [
  "image/png", "image/jpeg", "image/jpg",
  "image/webp", "image/gif",
  "video/mp4", "video/webm", "video/quicktime"
];

Consequences

Positive

  1. Immediate User Relief: Users can access outputs without support intervention
  2. Self-Healing: System automatically recovers from transient failures
  3. Audit Trail: Recovery logged in job_history
  4. Transparent: API callers don't need code changes

Negative

  1. Not a Root Cause Fix: Problem will continue occurring
  2. Performance: Additional Redis fetch on first call for null outputs
  3. TTL Dependency: Cannot recover after 30-day Redis expiration
  4. Non-Idempotent GET: Mutates database state

Alternatives Considered

  1. Manual Database Updates - Doesn't scale
  2. Background Batch Job - Doesn't help users immediately
  3. Improved Webhook Retry - Doesn't fix existing broken jobs

Success Criteria

  1. Users can access outputs for jobs with null workflow_output
  2. Recovery success rate >90% (within Redis TTL window)
  3. No user-reported "missing output" tickets

Future Work

  1. Root cause investigation
  2. Improved webhook reliability
  3. Proactive detection and fixing
  4. Remove endpoint once root cause fixed

Released under the MIT License.