Skip to content

Fix stale RunPod jobs stuck in queue after restart/deploy #21

@itzing

Description

@itzing

Problem\nJobs can remain stuck as in Engui after a page reload or server deploy, even though the corresponding RunPod jobs are already completed.\n\n## Root cause\n1. Status polling resolves RunPod endpoint from current settings/model mapping instead of using the immutable stored on the job.\n2. RunPod 404 responses are always treated as , which hides endpoint mismatches and stale upstream lookups for older jobs.\n3. There is no server-side repair/resync path for active jobs that were left behind during restart/deploy.\n\n## Fix\n- Poll RunPod status using first, with current mapping only as fallback.\n- Treat 404 as initialization only for fresh jobs; return a distinct stale/not-found state for older jobs.\n- Add a server-side resync path for active jobs so completed RunPod jobs can be finalized after restart/deploy.\n\n## Acceptance criteria\n- Old active jobs are checked against their original endpoint.\n- Completed RunPod jobs get finalized locally after reload/restart.\n- Older upstream 404s no longer show forever as .

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions