ADR: Job Output Recovery Endpoint
Date: 2025-11-18 Status: Proposed Deciders: Engineering Team Tags: api, emprops-api, data-recovery, workflow-output
Context
Problem Statement
We have a recurring production issue where jobs complete successfully but have workflow_output set to null in the database, despite the output existing in Redis. This breaks the user experience as they cannot access their generated content.
Symptoms:
- Job status:
completed - Job
workflow_output:null - Output does exist in Redis at
api:workflow:completion:{workflow_id} - User sees "Job completed" but no image/video to download
Root Causes (requires separate investigation):
- Race condition during job completion webhook processing
- Network failure between job-queue API and emprops-API during webhook
- Database transaction failure after Redis write but before DB update
- Webhook retry giving up before workflow_output is captured
Impact
- User Experience: Users see completed jobs with no output - appears broken
- Support Burden: Manual database updates required to fix individual jobs
- Data Loss Risk: Redis data expires (30-day TTL), after which recovery is impossible
- Trust Erosion: Users lose confidence in the platform
Constraints
- Time Constraint: Cannot afford deep investigation right now - need quick fix
- Data Availability: Output data IS available in Redis (30-day TTL)
- Recovery Window: Must recover before Redis expires the attestation data
- API Contract: External callers shouldn't know they're triggering a fix
Decision
Create a GET /jobs/{id}/output endpoint in emprops-API that:
Returns workflow_output if it exists (fast path)
Attempts recovery from Redis if null (slow path):
- Fetches workflow completion attestation from job-queue API
- Extracts
workflow_outputsarray from attestation - Validates MIME type (image or video expected)
- Updates job record with recovered URL
- Returns the recovered output
Marks job as failed if recovery impossible:
- Redis data not found (expired or never written)
- workflow_outputs array empty
- No valid URL in outputs
- Invalid MIME type
API Design
GET /jobs/{id}/output
Response (Success - Existing):
{
"data": {
"job_id": "91c43718...",
"workflow_output": "https://...",
"status": "completed"
},
"error": null
}
Response (Success - Recovered):
{
"data": {
"job_id": "91c43718...",
"workflow_output": "https://...",
"mime_type": "image/png",
"status": "completed",
"recovered": true
},
"error": null
}
Response (Failure):
{
"data": null,
"error": "Workflow output not found in Redis"
}
// Job status updated to "failed" in databaseRationale
Why GET instead of POST?
From the caller's perspective, they're simply retrieving the job output. They don't know (and shouldn't need to know) that the endpoint might perform recovery internally. The recovery is an implementation detail, not part of the API contract.
Why Update Job Status on Failure?
If we can't recover the output from Redis, the job is effectively failed even though it was marked completed. Updating the status reflects reality and prevents repeated recovery attempts.
Why Not Fix Root Cause First?
Time vs. Risk Trade-off: Root cause investigation takes days/weeks, but users need relief immediately. We can pursue root cause fix in parallel.
Implementation
Endpoint Logic
export const getJobOutput = (prisma: PrismaClient) => {
return async (req: Request, res: Response) => {
// 1. Load job from database
// 2. Check permissions
// 3. Verify job is completed
// 4. If workflow_output exists → return immediately
// 5. If workflow_output is null:
// a. Fetch from Redis: api:workflow:completion:{workflow_id}
// b. Parse workflow_outputs array
// c. Validate MIME type
// d. Update job.workflow_output
// e. Add job_history entry
// f. Return recovered URL
// 6. If recovery fails → mark job failed, return 404
};
};Files to Modify
apps/emprops-api/src/routes/jobs/index.ts
- Add
getJobOutput()function
- Add
apps/emprops-api/src/index.ts
- Register
GET /jobs/:id/outputroute
- Register
Dependencies
- Job Queue API:
POST {JOB_QUEUE_API_URL}/redis/get - Redis key:
api:workflow:completion:{workflow_id} - Environment:
JOB_QUEUE_API_URL
Validation Rules
const validMimeTypes = [
"image/png", "image/jpeg", "image/jpg",
"image/webp", "image/gif",
"video/mp4", "video/webm", "video/quicktime"
];Consequences
Positive
- Immediate User Relief: Users can access outputs without support intervention
- Self-Healing: System automatically recovers from transient failures
- Audit Trail: Recovery logged in
job_history - Transparent: API callers don't need code changes
Negative
- Not a Root Cause Fix: Problem will continue occurring
- Performance: Additional Redis fetch on first call for null outputs
- TTL Dependency: Cannot recover after 30-day Redis expiration
- Non-Idempotent GET: Mutates database state
Alternatives Considered
- Manual Database Updates - Doesn't scale
- Background Batch Job - Doesn't help users immediately
- Improved Webhook Retry - Doesn't fix existing broken jobs
Success Criteria
- Users can access outputs for jobs with null workflow_output
- Recovery success rate >90% (within Redis TTL window)
- No user-reported "missing output" tickets
Future Work
- Root cause investigation
- Improved webhook reliability
- Proactive detection and fixing
- Remove endpoint once root cause fixed
