ADR-010: LoRA User Storage and Affinity Routing

Date: 2025-11-14 Status: 🤔 Proposed Decision Makers: Engineering Team Related Systems: Job Broker, ComfyUI Workers, Model Cache, Azure Blob Storage

Executive Summary

Enable users to upload and store custom LoRA models in their personal Azure Blob Storage, with intelligent just-in-time downloading and cache-aware job routing. This ADR combines user storage infrastructure with affinity-based job claiming to minimize model download times and improve job execution performance.

Key Capabilities:

User-owned LoRA storage in Azure Blob Storage
Just-in-time LoRA downloads on worker machines
Intelligent LRU + time-based cache eviction (50GB reserved, 7-day TTL)
Affinity-based job routing (prefer workers with cached LoRAs)
Non-blocking fallback to workers without cached models

North Star Alignment:

✅ Supports predictive model management (Phase 2 goal)
✅ Eliminates first-user wait times for popular LoRAs
✅ Advances toward specialized machine pools
✅ Improves job execution performance through cache-aware routing

Context
Decision
Architecture Design
Implementation Specification
Consequences
Alternatives Considered
Success Metrics
Implementation Phases

Context

Current State

Existing LoRA Infrastructure:

EmProps_Lora_Loader custom node with Azure/AWS/GCS support
SQLite model_cache.db tracking model usage with LRU eviction
is_ignore flag preventing eviction of system LoRAs
Azure Blob Storage handlers for cloud downloads

Current Limitations:

No user-owned LoRA storage capability
All workers download LoRAs independently (no affinity routing)
Redis job matching uses FIFO order (ignores cache state)
First-user wait times for popular LoRAs (2-5 minutes)

Problem Statement

User Storage Problem: Users cannot upload and manage their own LoRA models. Current system only supports shared/system LoRAs baked into containers or downloaded from shared storage.

Performance Problem: When 3 workers can claim a job requiring a LoRA:

Worker A has the LoRA cached (ready in <1 second)
Worker B doesn't have it cached (5 minute download)
Worker C doesn't have it cached (5 minute download)

Current FIFO matching might assign to Worker B or C, causing unnecessary wait times.

Infrastructure Constraints:

Ephemeral machines with no shared storage
50GB reserved for LoRA cache per machine
Need to balance cache utilization vs. disk space
Must work with existing flat_file table for user assets

Decision

We will implement a two-tier LoRA storage system with affinity-based job routing:

Part 1: User Storage Infrastructure

Storage Architecture:

Use existing flat_file table for user LoRA metadata
Store LoRA files in Azure Blob Storage (user-loras container)
Tag flat_file entries with tags=['lora'] for identification
Leverage existing Azure handlers and model cache database

Cache Management:

Reserve 50GB per machine for LoRA cache
LRU eviction when cache fills (existing mechanism)
Time-based cleanup after 7 days of inactivity
Preserve system LoRAs via is_ignore=true flag

Part 2: Affinity-Based Job Routing

Scoring Algorithm:

Workers report cached LoRAs in capabilities
Redis function scores each worker-job match
Higher score = better match (prefer cached models)
Non-blocking: workers without cache can still claim jobs

Scoring Rules:

lua

-- Scoring weights (configurable)
USER_LORA_MATCH_SCORE = 10   -- User's custom LoRA already cached
SHARED_LORA_MATCH_SCORE = 5  -- Shared LoRA already cached
BASE_SCORE = 0               -- No cache match, download required

Architecture Design

High-Level Architecture

┌─────────────────┐
│   User Uploads  │
│   LoRA via API  │
└────────┬────────┘
         │
         v
┌─────────────────────────────────────────────────┐
│         flat_file Table (PostgreSQL)            │
│  ┌─────────────────────────────────────────┐   │
│  │ id: uuid                                │   │
│  │ user_id: uuid                           │   │
│  │ url: https://emprops.blob...            │   │
│  │ tags: ['lora']                          │   │
│  │ metadata: {original_name, size_bytes}   │   │
│  └─────────────────────────────────────────┘   │
└─────────────────────────────────────────────────┘
                     │
                     │ Referenced in workflow
                     v
         ┌───────────────────────┐
         │   Job Requirements    │
         │  {loras: [{           │
         │    type: 'user',      │
         │    flat_file_id: uuid │
         │  }]}                  │
         └───────────┬───────────┘
                     │
                     v
         ┌─────────────────────────────────────┐
         │   Redis: findMatchingJob()          │
         │                                     │
         │  1. Extract LoRA requirements       │
         │  2. Score each worker:              │
         │     - User LoRA cached: +10 points  │
         │     - Shared LoRA cached: +5 points │
         │  3. Return highest scoring match    │
         └───────────┬─────────────────────────┘
                     │
                     v
         ┌─────────────────────────┐
         │  Worker Claims Job      │
         └───────────┬─────────────┘
                     │
                     v
         ┌─────────────────────────────────────┐
         │  EmProps_Lora_Loader Node           │
         │                                     │
         │  1. Check local cache               │
         │  2. If not cached:                  │
         │     - Get flat_file metadata        │
         │     - Download from Azure           │
         │     - Register in model_cache.db    │
         │  3. Load LoRA into ComfyUI          │
         └─────────────────────────────────────┘
                     │
                     v
         ┌─────────────────────────┐
         │  model_cache.db         │
         │  (SQLite)               │
         │                         │
         │  - Track usage          │
         │  - LRU eviction         │
         │  - 7-day TTL cleanup    │
         │  - is_ignore protection │
         └─────────────────────────┘

Data Flow

Upload Flow

User → API → flat_file table → Azure Blob Storage
                 │
                 └─> Tags: ['lora']
                     Metadata: {original_name, size_bytes}

Job Execution Flow

1. Worker calls Redis: FCALL findMatchingJob({
     worker_id: "worker-123",
     cached_loras: {
       user_loras: ["uuid-1", "uuid-2"],    // flat_file IDs
       shared_loras: ["model-a.safetensors"] // filenames
     }
   })

2. Redis scores matches:
   Job requires: uuid-1 (user LoRA)

   Worker A: Has uuid-1 cached → Score: 10 ✅
   Worker B: No uuid-1 → Score: 0
   Worker C: No uuid-1 → Score: 0

   → Worker A claims job

3. Worker executes:
   - EmProps_Lora_Loader checks cache
   - LoRA already present → immediate load
   - Job starts in <1 second

Download Flow (Cache Miss)

1. EmProps_Lora_Loader: LoRA not in cache
2. Query flat_file table for metadata
3. Download from Azure Blob Storage
4. Register in model_cache.db (is_ignore=false)
5. Load LoRA into ComfyUI
6. Future jobs on this worker get Score: +10

Implementation Specification

1. Database Schema

No schema changes required - use existing flat_file table:

sql

-- Existing flat_file table (no changes needed)
CREATE TABLE flat_file (
  id UUID PRIMARY KEY,
  user_id UUID REFERENCES users(id),
  url TEXT NOT NULL,
  tags TEXT[] DEFAULT '{}',
  metadata JSONB DEFAULT '{}',
  created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Query user LoRAs
SELECT * FROM flat_file
WHERE user_id = $1
  AND 'lora' = ANY(tags);

2. API Endpoints

New Endpoints:

typescript

// Upload user LoRA
POST /api/user-loras/upload
Body: multipart/form-data { file: File }
Response: {
  flat_file_id: string;
  url: string;
  size_bytes: number;
  original_name: string;
}

// List user LoRAs
GET /api/user-loras
Query: { user_id: string }
Response: {
  loras: Array<{
    id: string;
    original_name: string;
    size_bytes: number;
    created_at: string;
  }>
}

// Delete user LoRA
DELETE /api/user-loras/:id
Response: { success: boolean }

3. Worker Changes

File: apps/worker/src/redis-direct-worker-client.ts

typescript

import { ModelCacheDB } from '@emp/comfyui-custom-nodes/db/model_cache';

// New helper function
async function getCachedLoRAs(): Promise<{
  user_loras: string[];
  shared_loras: string[];
}> {
  const model_cache_db = new ModelCacheDB();
  const models = await model_cache_db.get_all_models();

  // Extract cached LoRAs
  const user_loras = models
    .filter(m => m.model_type === 'user_lora')
    .map(m => extractFlatFileIdFromPath(m.path));

  const shared_loras = models
    .filter(m => m.model_type === 'lora' && m.is_ignore === false)
    .map(m => m.filename);

  return { user_loras, shared_loras };
}

// Helper to extract flat_file ID from path
function extractFlatFileIdFromPath(path: string): string {
  // Path format: /workspace/ComfyUI/models/loras/user/{flat_file_id}.safetensors
  const match = path.match(/user\/([a-f0-9-]+)\.safetensors$/);
  return match ? match[1] : '';
}

// Modified: callFindMatchingJob (around line 499)
async callFindMatchingJob(capabilities: WorkerCapabilities) {
  // Add cached LoRAs to capabilities
  const cached_loras = await getCachedLoRAs();

  const enhancedCapabilities = {
    ...capabilities,
    cached_loras
  };

  const result = await this.redis.call(
    'FCALL',
    'findMatchingJob',
    0,
    JSON.stringify(enhancedCapabilities),
    '100' // max_scan
  );

  return result;
}

4. Type Changes

File: packages/core/src/types/worker.ts

typescript

export interface WorkerCapabilities {
  worker_id: string;
  job_service_required_map: string[];
  hardware?: {
    gpu_memory_gb?: number;
    cpu_cores?: number;
    ram_gb?: number;
  };
  models?: Record<string, string[]>;
  customer_access?: {
    isolation?: 'strict' | 'shared';
    allowed_customers?: string[];
    denied_customers?: string[];
  };
  workflow_id?: string;

  // NEW: Cached LoRA tracking
  cached_loras?: {
    user_loras?: string[];    // Array of flat_file IDs
    shared_loras?: string[];  // Array of filenames
  };
}

5. Redis Lua Function Changes

File: packages/core/src/redis-functions/functions/findMatchingJob.lua

lua

-- NEW: Calculate affinity score based on cached LoRAs
local function calculate_affinity_score(worker, job)
  local score = 0

  -- Parse job requirements to extract LoRA requirements
  local requirements = {}
  if job.requirements and job.requirements ~= '' then
    local success, parsed = pcall(cjson.decode, job.requirements)
    if success then
      requirements = parsed
    end
  end

  -- Check if job requires any LoRAs
  local required_loras = requirements.loras or {}
  if #required_loras == 0 then
    return 0  -- No LoRAs required, no affinity bonus
  end

  -- Get worker's cached LoRAs
  local cached_loras = worker.cached_loras or {}
  local user_loras = cached_loras.user_loras or {}
  local shared_loras = cached_loras.shared_loras or {}

  -- Score each required LoRA
  for _, required_lora in ipairs(required_loras) do
    if required_lora.type == 'user' then
      -- Check if worker has this user LoRA cached
      for _, cached_id in ipairs(user_loras) do
        if cached_id == required_lora.flat_file_id then
          score = score + 10  -- USER_LORA_MATCH_SCORE
          break
        end
      end
    elseif required_lora.type == 'shared' then
      -- Check if worker has this shared LoRA cached
      for _, cached_name in ipairs(shared_loras) do
        if cached_name == required_lora.filename then
          score = score + 5  -- SHARED_LORA_MATCH_SCORE
          break
        end
      end
    end
  end

  return score
end

-- MODIFIED: Main function (line 340)
redis.register_function('findMatchingJob', function(keys, args)
  local worker_caps_json = args[1]
  local max_scan = tonumber(args[2]) or 100

  local worker_caps = cjson.decode(worker_caps_json)

  redis.log(redis.LOG_NOTICE, 'Worker ' .. worker_caps.worker_id .. ' requesting job')

  local pending_jobs = redis.call('ZREVRANGE', 'jobs:pending', '0', tostring(max_scan - 1))

  if not pending_jobs or #pending_jobs == 0 then
    return nil
  end

  -- NEW: Track best match
  local best_job_id = nil
  local best_score = -1
  local best_job = nil

  -- Check each job for compatibility AND affinity
  for i = 1, #pending_jobs do
    local job_id = pending_jobs[i]
    local job_data = redis.call('HGETALL', 'job:' .. job_id)

    if job_data and #job_data > 0 then
      local job = hash_to_table(job_data)

      -- Check if worker meets requirements
      if matches_requirements(worker_caps, job) then
        -- Calculate affinity score
        local affinity_score = calculate_affinity_score(worker_caps, job)

        redis.log(redis.LOG_DEBUG, 'Job ' .. job_id .. ' affinity score: ' .. affinity_score)

        -- Track best match
        if affinity_score > best_score then
          best_score = affinity_score
          best_job_id = job_id
          best_job = job
        end
      end
    end
  end

  -- If we found a match, claim the best one
  if best_job_id then
    local removed = redis.call('ZREM', 'jobs:pending', best_job_id)
    if removed == 1 then
      -- Update job status (existing code)
      redis.call('HMSET', 'job:' .. best_job_id,
        'status', 'assigned',
        'worker_id', worker_caps.worker_id,
        'assigned_at', get_iso_timestamp()
      )

      -- Add to worker's active jobs
      redis.call('HSET', 'jobs:active:' .. worker_caps.worker_id, best_job_id, cjson.encode(best_job))

      -- Update worker status
      redis.call('HMSET', 'worker:' .. worker_caps.worker_id,
        'status', 'busy',
        'current_job_id', best_job_id,
        'last_status_change', get_iso_timestamp()
      )

      -- Publish events (existing code)
      redis.call('PUBLISH', 'job_claimed', cjson.encode({
        job_id = best_job_id,
        worker_id = worker_caps.worker_id,
        status = 'claimed',
        affinity_score = best_score,  -- NEW: Include score in event
        timestamp = tonumber(redis.call('TIME')[1]) * 1000
      }))

      redis.log(redis.LOG_NOTICE, 'Worker ' .. worker_caps.worker_id .. ' claimed job ' .. best_job_id .. ' (score: ' .. best_score .. ')')

      return cjson.encode({
        jobId = best_job_id,
        job = best_job
      })
    end
  end

  return nil
end)

6. ComfyUI Loader Changes

File: packages/comfyui-custom-nodes/emprops_comfy_nodes/nodes/emprops_lora_loader.py

python

# NEW: Support for user LoRAs
def download_from_cloud(self, lora_name, provider=None, bucket=None, flat_file_id=None):
    """Download LoRA from cloud storage if not found locally"""

    # NEW: Handle user LoRA downloads
    if flat_file_id:
        # User LoRA - download to user-specific directory
        local_path = self._get_user_lora_path(flat_file_id)

        # Check cache first
        if os.path.exists(local_path):
            print(f"[EmProps] User LoRA already cached: {local_path}")
            model_cache_db.update_model_usage(local_path)
            return local_path

        # Fetch flat_file metadata from API
        metadata = self._fetch_flat_file_metadata(flat_file_id)
        if not metadata:
            print(f"[EmProps] Failed to fetch metadata for flat_file: {flat_file_id}")
            return None

        # Download from Azure Blob Storage
        handler = AzureHandler(container_name='user-loras')
        success, error = handler.download_file(
            blob_path=metadata['blob_path'],
            local_path=local_path
        )

        if not success:
            print(f"[EmProps] Failed to download user LoRA: {error}")
            return None

        # Register in model cache (is_ignore=False for user LoRAs)
        model_cache_db.register_model(
            path=local_path,
            model_type='user_lora',
            size_bytes=os.path.getsize(local_path),
            is_ignore=False  # User LoRAs can be evicted
        )

        return local_path

    # Existing shared LoRA logic
    # ... (unchanged)

def _get_user_lora_path(self, flat_file_id):
    """Get local path for user LoRA"""
    lora_paths = folder_paths.folder_names_and_paths["loras"][0]
    user_dir = os.path.join(lora_paths[0], 'user')
    os.makedirs(user_dir, exist_ok=True)
    return os.path.join(user_dir, f"{flat_file_id}.safetensors")

def _fetch_flat_file_metadata(self, flat_file_id):
    """Fetch flat_file metadata from API"""
    api_url = os.getenv('EMP_API_URL', 'http://localhost:3001')
    response = requests.get(f"{api_url}/api/flat-files/{flat_file_id}")
    if response.status_code == 200:
        return response.json()
    return None

Consequences

Positive Consequences ✅

User Empowerment
- Users can upload and manage custom LoRAs
- No dependency on shared model repositories
- Full control over model lifecycle
Performance Optimization
- Jobs route to workers with cached models
- Eliminates redundant downloads across workers
- Reduces job start latency by 2-5 minutes (when cache hits)
Cache Efficiency
- Popular LoRAs naturally accumulate high scores
- LRU + time-based eviction keeps cache fresh
- 50GB dedicated space prevents disk exhaustion
Non-Blocking Design
- Workers without cache can still claim jobs
- System degrades gracefully under load
- No hard dependencies on cache state
North Star Alignment
- Advances Phase 2: Model Intelligence goals
- Supports predictive model placement strategy
- Foundation for pool-specific model baking

Negative Consequences ❌

Complexity Increase
- Redis function logic becomes more sophisticated
- Additional API endpoints for LoRA management
- Worker needs to query cache state before claiming
Storage Costs
- Azure Blob Storage costs for user LoRAs
- 50GB per machine reserved for cache
- Potential for cache thrashing with diverse workloads
Cold Start Problem
- First user of a LoRA still experiences download wait
- Cache warmup period before affinity benefits appear
- May need pre-warming strategies for popular models
Monitoring Complexity
- Need to track cache hit rates per LoRA
- Affinity score distribution monitoring
- Cache eviction analytics required

Alternatives Considered

Alternative 1: Priority Queue Approach

Design:

Two-pass matching: first try workers with cache, then fallback
Maintain separate priority queues for jobs

Rejected Because:

More complex than scoring approach
Harder to extend for multi-LoRA jobs
Doesn't handle partial cache matches well

Alternative 2: S3/Shared Storage

Design:

Use AWS S3 instead of Azure
Shared storage accessible by all workers

Rejected Because:

EmProps uses Azure, not AWS
Vendor lock-in concerns
Azure Blob Storage already integrated

Alternative 3: Pre-baked Container Images

Design:

Bake popular LoRAs into container images
Different images for different LoRA sets

Rejected Because:

Not feasible for user-owned LoRAs
Container images become massive (50GB+)
Deployment time increases dramatically
Note: May complement this approach in Phase 2

Alternative 4: Centralized Cache Service

Design:

Single cache service shared across workers
Workers fetch from cache service, not Azure

Rejected Because:

Single point of failure
Network overhead for model transfers
Incompatible with ephemeral distributed machines
Violates "no shared storage" constraint

Success Metrics

Performance Metrics

Metric	Baseline	Target	Measurement
Cache Hit Rate	0% (no cache)	60% within 2 weeks	Redis event logs
Job Start Latency (cache hit)	5 min	<10 seconds	OpenTelemetry traces
Job Start Latency (cache miss)	5 min	5 min (unchanged)	OpenTelemetry traces
Affinity Score Distribution	N/A	70% jobs score >0	Redis function logs
Cache Eviction Rate	N/A	<10% of downloads	model_cache.db analytics

User Experience Metrics

Metric	Baseline	Target	Measurement
User LoRA Uploads	0	100+ LoRAs in 4 weeks	flat_file table count
Workflows Using User LoRAs	0	50+ workflows	Job requirements analysis
User-Reported Wait Times	High complaints	<5 complaints/week	Support tickets

System Health Metrics

Metric	Target	Measurement
Disk Space Utilization	70-90% of 50GB	Machine metrics
Cache Thrashing Rate	<5%	Eviction immediately followed by re-download
Azure Blob Egress	<100GB/day	Azure billing dashboard

Implementation Phases

Phase 1: User Storage Infrastructure (Week 1-2)

Goal: Enable users to upload and store LoRAs

Tasks:

[ ] Create API endpoints for LoRA upload/list/delete
[ ] Implement Azure Blob Storage integration for user LoRAs
[ ] Add tags=['lora'] filtering to flat_file queries
[ ] Update EmProps Studio UI for LoRA management
[ ] Add validation for LoRA file format (.safetensors)
[ ] Implement quota limits per user (e.g., 10 LoRAs, 10GB total)

Deliverables:

Working API endpoints with Azure upload
UI for uploading and managing LoRAs
Documentation for LoRA upload process

Testing:

bash

# Upload LoRA
curl -X POST http://localhost:3001/api/user-loras/upload \
  -F "file=@my_lora.safetensors" \
  -H "Authorization: Bearer $TOKEN"

# List user LoRAs
curl http://localhost:3001/api/user-loras?user_id=$USER_ID

Phase 2: Just-in-Time Downloads (Week 2-3)

Goal: Workers download user LoRAs on demand

Tasks:

[ ] Modify EmProps_Lora_Loader to support flat_file_id parameter
[ ] Implement _fetch_flat_file_metadata() API call
[ ] Add user LoRA directory structure (/models/loras/user/)
[ ] Update model registration to distinguish user vs. shared LoRAs
[ ] Add error handling for failed downloads
[ ] Implement retry logic with exponential backoff

Deliverables:

Working LoRA download from flat_file references
Model cache tracking for user LoRAs
Error recovery for failed downloads

Testing:

python

# In ComfyUI workflow JSON
{
  "inputs": {
    "lora_name": "",  # Empty for user LoRAs
    "flat_file_id": "550e8400-e29b-41d4-a716-446655440000",
    "strength_model": 1.0,
    "strength_clip": 1.0
  }
}

Phase 3: Cache Management (Week 3-4)

Goal: Implement LRU + time-based eviction

Tasks:

[ ] Configure 50GB reserved space setting in model_cache.db
[ ] Implement pre-download space check
[ ] Add LRU eviction when cache fills
[ ] Create background job for 7-day TTL cleanup
[ ] Add cache metrics collection
[ ] Implement graceful handling of eviction during job execution

Deliverables:

Automated cache eviction system
Background cleanup job (cron or PM2 scheduled task)
Cache metrics dashboard in monitoring UI

Testing:

bash

# Fill cache to capacity
for i in {1..50}; do
  # Upload and download LoRAs to fill 50GB
done

# Verify LRU eviction
sqlite3 model_cache.db "SELECT * FROM models ORDER BY last_used ASC LIMIT 10"

Phase 4: Affinity Routing (Week 4-5)

Goal: Prefer workers with cached LoRAs

Tasks:

[ ] Add cached_loras field to WorkerCapabilities interface
[ ] Implement getCachedLoRAs() in worker client
[ ] Modify Redis Lua function with scoring algorithm
[ ] Add affinity score to job claim events
[ ] Update monitoring UI to show affinity scores
[ ] Add Redis logs for affinity debugging

Deliverables:

Working affinity-based job routing
Affinity score in monitoring dashboard
Debug tooling for score analysis

Testing:

bash

# Create job requiring specific LoRA
curl -X POST http://localhost:3001/api/jobs \
  -d '{
    "service_required": "comfyui",
    "requirements": {
      "loras": [{
        "type": "user",
        "flat_file_id": "550e8400-e29b-41d4-a716-446655440000"
      }]
    }
  }'

# Verify worker with cached LoRA claims job
# Check Redis logs for affinity score: "score: 10"

Phase 5: Monitoring & Analytics (Week 5-6)

Goal: Observe system behavior and optimize

Tasks:

[ ] Add cache hit rate metrics to OpenTelemetry
[ ] Create Dash0 dashboard for LoRA analytics
[ ] Track affinity score distribution
[ ] Monitor cache eviction patterns
[ ] Analyze user LoRA upload trends
[ ] Identify optimization opportunities

Deliverables:

Comprehensive LoRA analytics dashboard
Weekly report on cache performance
Recommendations for Phase 2+ optimizations

Metrics Dashboard:

Cache hit rate over time
Affinity score distribution histogram
Top 10 most popular LoRAs
Cache eviction frequency
Average job start latency by cache state

Phase 6: Optimization & Polish (Week 6+)

Goal: Refine based on production data

Tasks:

[ ] Tune affinity scoring weights based on metrics
[ ] Implement pre-warming for popular LoRAs
[ ] Optimize cache eviction algorithm
[ ] Add user notifications for evicted LoRAs
[ ] Document best practices for LoRA usage
[ ] Create troubleshooting guides

Deliverables:

Production-ready LoRA support
User documentation and guides
Internal runbooks for operations team

Open Questions

LoRA Quota Management
- What limits should we enforce? (10 LoRAs/user? 10GB total?)
- How do we handle quota violations?
Cache Warming Strategy
- Should we pre-download popular LoRAs on machine startup?
- How do we identify "popular" LoRAs for warming?
Multi-LoRA Jobs
- How do we score jobs requiring 3+ LoRAs?
- Should we use sum of scores or weighted average?
Cross-Pool Behavior
- How does affinity routing interact with pool separation (Fast Lane / Standard / Heavy)?
- Should LoRA cache be pool-specific?
Azure Costs
- What is acceptable egress cost per month?
- Should we implement CDN or edge caching?

LoRA User Storage Support Report - Detailed investigation and technical analysis
CLAUDE.md - North star architecture and model management strategy
Environment Management Guide - Configuration system
Testing Procedures - Standard testing procedures

Approval

Decision Makers:

[ ] Engineering Lead
[ ] Product Manager
[ ] DevOps Team

Next Steps:

Review ADR with team
Gather feedback on open questions
Approve or request revisions
Begin Phase 1 implementation

Questions? Post in #architecture or #job-broker Slack channels.

ADR-010: LoRA User Storage and Affinity Routing ​

Executive Summary ​

Table of Contents ​

Context ​

Current State ​

Problem Statement ​

Decision ​

Part 1: User Storage Infrastructure ​

Part 2: Affinity-Based Job Routing ​

Architecture Design ​

High-Level Architecture ​

Data Flow ​

Upload Flow ​

Job Execution Flow ​

Download Flow (Cache Miss) ​

Implementation Specification ​

1. Database Schema ​

2. API Endpoints ​

3. Worker Changes ​

4. Type Changes ​

5. Redis Lua Function Changes ​

6. ComfyUI Loader Changes ​

Consequences ​

Positive Consequences ✅ ​

Negative Consequences ❌ ​

Alternatives Considered ​

Alternative 1: Priority Queue Approach ​

Alternative 2: S3/Shared Storage ​

Alternative 3: Pre-baked Container Images ​

Alternative 4: Centralized Cache Service ​

Success Metrics ​

Performance Metrics ​

User Experience Metrics ​

System Health Metrics ​

Implementation Phases ​

Phase 1: User Storage Infrastructure (Week 1-2) ​

Phase 2: Just-in-Time Downloads (Week 2-3) ​

Phase 3: Cache Management (Week 3-4) ​

Phase 4: Affinity Routing (Week 4-5) ​

Phase 5: Monitoring & Analytics (Week 5-6) ​

Phase 6: Optimization & Polish (Week 6+) ​

Open Questions ​

Related Documentation ​

Approval ​

ADR-010: LoRA User Storage and Affinity Routing

Executive Summary

Table of Contents

Context

Current State

Problem Statement

Decision

Part 1: User Storage Infrastructure

Part 2: Affinity-Based Job Routing

Architecture Design

High-Level Architecture

Data Flow

Upload Flow

Job Execution Flow

Download Flow (Cache Miss)

Implementation Specification

1. Database Schema

2. API Endpoints

3. Worker Changes

4. Type Changes

5. Redis Lua Function Changes

6. ComfyUI Loader Changes

Consequences

Positive Consequences ✅

Negative Consequences ❌

Alternatives Considered

Alternative 1: Priority Queue Approach

Alternative 2: S3/Shared Storage

Alternative 3: Pre-baked Container Images

Alternative 4: Centralized Cache Service

Success Metrics

Performance Metrics

User Experience Metrics

System Health Metrics

Implementation Phases

Phase 1: User Storage Infrastructure (Week 1-2)

Phase 2: Just-in-Time Downloads (Week 2-3)

Phase 3: Cache Management (Week 3-4)

Phase 4: Affinity Routing (Week 4-5)

Phase 5: Monitoring & Analytics (Week 5-6)

Phase 6: Optimization & Polish (Week 6+)

Open Questions

Related Documentation

Approval