ADR: Job Output Recovery Endpoint

Date: 2025-11-18 Status: Proposed Deciders: Engineering Team Tags: api, emprops-api, data-recovery, workflow-output

Context

Problem Statement

We have a recurring production issue where jobs complete successfully but have workflow_output set to null in the database, despite the output existing in Redis. This breaks the user experience as they cannot access their generated content.

Symptoms:

Job status: completed
Job workflow_output: null
Output does exist in Redis at api:workflow:completion:{workflow_id}
User sees "Job completed" but no image/video to download

Root Causes (requires separate investigation):

Race condition during job completion webhook processing
Network failure between job-queue API and emprops-API during webhook
Database transaction failure after Redis write but before DB update
Webhook retry giving up before workflow_output is captured

Impact

User Experience: Users see completed jobs with no output - appears broken
Support Burden: Manual database updates required to fix individual jobs
Data Loss Risk: Redis data expires (30-day TTL), after which recovery is impossible
Trust Erosion: Users lose confidence in the platform

Constraints

Time Constraint: Cannot afford deep investigation right now - need quick fix
Data Availability: Output data IS available in Redis (30-day TTL)
Recovery Window: Must recover before Redis expires the attestation data
API Contract: External callers shouldn't know they're triggering a fix

Decision

Create a GET /jobs/{id}/output endpoint in emprops-API that:

Returns workflow_output if it exists (fast path)
Attempts recovery from Redis if null (slow path):
- Fetches workflow completion attestation from job-queue API
- Extracts workflow_outputs array from attestation
- Validates MIME type (image or video expected)
- Updates job record with recovered URL
- Returns the recovered output
Marks job as failed if recovery impossible:
- Redis data not found (expired or never written)
- workflow_outputs array empty
- No valid URL in outputs
- Invalid MIME type

API Design

GET /jobs/{id}/output

Response (Success - Existing):
{
  "data": {
    "job_id": "91c43718...",
    "workflow_output": "https://...",
    "status": "completed"
  },
  "error": null
}

Response (Success - Recovered):
{
  "data": {
    "job_id": "91c43718...",
    "workflow_output": "https://...",
    "mime_type": "image/png",
    "status": "completed",
    "recovered": true
  },
  "error": null
}

Response (Failure):
{
  "data": null,
  "error": "Workflow output not found in Redis"
}
// Job status updated to "failed" in database

Rationale

Why GET instead of POST?

From the caller's perspective, they're simply retrieving the job output. They don't know (and shouldn't need to know) that the endpoint might perform recovery internally. The recovery is an implementation detail, not part of the API contract.

Why Update Job Status on Failure?

If we can't recover the output from Redis, the job is effectively failed even though it was marked completed. Updating the status reflects reality and prevents repeated recovery attempts.

Why Not Fix Root Cause First?

Time vs. Risk Trade-off: Root cause investigation takes days/weeks, but users need relief immediately. We can pursue root cause fix in parallel.

Implementation

Endpoint Logic

typescript

export const getJobOutput = (prisma: PrismaClient) => {
  return async (req: Request, res: Response) => {
    // 1. Load job from database
    // 2. Check permissions
    // 3. Verify job is completed
    // 4. If workflow_output exists → return immediately
    // 5. If workflow_output is null:
    //    a. Fetch from Redis: api:workflow:completion:{workflow_id}
    //    b. Parse workflow_outputs array
    //    c. Validate MIME type
    //    d. Update job.workflow_output
    //    e. Add job_history entry
    //    f. Return recovered URL
    // 6. If recovery fails → mark job failed, return 404
  };
};

Files to Modify

apps/emprops-api/src/routes/jobs/index.ts
- Add getJobOutput() function
apps/emprops-api/src/index.ts
- Register GET /jobs/:id/output route

Dependencies

Job Queue API: POST {JOB_QUEUE_API_URL}/redis/get
Redis key: api:workflow:completion:{workflow_id}
Environment: JOB_QUEUE_API_URL

Validation Rules

typescript

const validMimeTypes = [
  "image/png", "image/jpeg", "image/jpg",
  "image/webp", "image/gif",
  "video/mp4", "video/webm", "video/quicktime"
];

Consequences

Positive

Immediate User Relief: Users can access outputs without support intervention
Self-Healing: System automatically recovers from transient failures
Audit Trail: Recovery logged in job_history
Transparent: API callers don't need code changes

Negative

Not a Root Cause Fix: Problem will continue occurring
Performance: Additional Redis fetch on first call for null outputs
TTL Dependency: Cannot recover after 30-day Redis expiration
Non-Idempotent GET: Mutates database state

Alternatives Considered

Manual Database Updates - Doesn't scale
Background Batch Job - Doesn't help users immediately
Improved Webhook Retry - Doesn't fix existing broken jobs

Success Criteria

Users can access outputs for jobs with null workflow_output
Recovery success rate >90% (within Redis TTL window)
No user-reported "missing output" tickets

Future Work

Root cause investigation
Improved webhook reliability
Proactive detection and fixing
Remove endpoint once root cause fixed

ADR: Job Output Recovery Endpoint ​

Context ​

Problem Statement ​

Impact ​

Constraints ​

Decision ​

API Design ​

Rationale ​

Why GET instead of POST? ​

Why Update Job Status on Failure? ​

Why Not Fix Root Cause First? ​

Implementation ​

Endpoint Logic ​

Files to Modify ​

Dependencies ​

Validation Rules ​

Consequences ​

Positive ​

Negative ​

Alternatives Considered ​

Success Criteria ​

Future Work ​