Skip to content

Design browser-triggered rebuilds as a single out-of-process rebuild supervisor #25

@achandra-rp

Description

@achandra-rp

Context

We are choosing the simpler v1 contract now:

  • The browser can edit settings.
  • Runtime-only settings apply immediately.
  • Rebuild-requiring settings are saved, but the user runs the rebuild from the CLI.
  • The CLI remains the source of truth for setup, ingestion, and rebuilds.

That is the right boundary for the current homelab release. It keeps the app predictable and avoids turning the web process into an operational control plane.

At the same time, browser-triggered rebuilds are still a reasonable future convenience feature if we build them with a much tighter contract than the current in-process job approach.

Problem

The current in-process rebuild model is the wrong foundation for long-running rebuilds:

  • rebuild execution is tied to the Gunicorn worker lifecycle
  • job state lives in process memory
  • config reloads and rebuild execution can interfere with each other
  • page refresh/navigation can lose rebuild state in the UI
  • long rebuilds interact poorly with worker timeouts and restarts
  • failure details are hard to surface cleanly and durably

This issue is not asking for a generic background job system. That would add too much complexity for a single-user homelab app.

Goal

If we reintroduce browser-triggered rebuilds later, build them as one special rebuild supervisor with a narrow contract:

  • only one rebuild may run at a time
  • rebuild runs in a separate process, not inside the web worker/thread model
  • rebuild state is durable across page refreshes
  • queries continue using the last completed profile/index while rebuild runs
  • rebuilt artifacts become active only after a successful rebuild
  • failed rebuilds leave the previous active artifacts unchanged
  • rebuilds are always explicit; saving settings must not auto-start a rebuild

Proposed Product Contract

Allowed while rebuild runs

  • users may continue to browse and query recommendations
  • queries use the last completed active snapshot of the watch index and taste profile
  • settings may still be edited, but the UI must make it clear whether a newer rebuild is now required after the current one finishes

Not allowed while rebuild runs

  • starting a second rebuild
  • overwriting active artifacts mid-run
  • treating an in-progress rebuild as partially active

User-visible states

  • idle
  • rebuild required
  • rebuilding
  • rebuild succeeded
  • rebuild failed

The UI should remain accurate across refreshes and navigation.

Minimal Technical Design

1. Replace in-process rebuild execution

Use subprocess.Popen(...) to launch the CLI rebuild command from the web app instead of calling run_setup() inside the web process.

Examples:

  • profile-only rebuild: ./recommend setup --refresh-profile
  • data + profile rebuild: ./recommend setup --refresh-data

The subprocess should write stdout/stderr to a dedicated rebuild log file.

2. Persist rebuild state to disk

Store rebuild state in a small JSON file under app-managed local state.

Suggested fields:

  • status
  • mode (profile or data)
  • pid
  • started_at
  • finished_at
  • exit_code
  • log_path
  • error_summary
  • requested_from_config_generation or equivalent stale marker if useful

This state file becomes the source of truth for the UI and survives page refreshes.

3. Enforce single rebuild semantics

Use a lock file or equivalent process-level guard.

Behavior:

  • if a rebuild is already running, POST /rebuild returns the existing rebuild state instead of launching another one
  • the UI should show that a rebuild is already in progress

4. Keep runtime reads on the previous active snapshot

Do not mutate the active recommendation context while the rebuild is in progress.

Only after successful completion:

  • invalidate the cached runtime context
  • reload artifacts on the next request

Failure must leave the old active context untouched.

5. Poll durable state, not in-memory job objects

The rebuild status endpoint should read the rebuild state file and render the current rebuild state from durable data.

This must survive:

  • page refresh
    n- navigating away and back
  • worker restart

6. Surface meaningful failure output

The UI should show a concise failure summary derived from the rebuild subprocess output, not just Process exited with status 1.

The full log can remain in a file for operator inspection.

Suggested Implementation Steps

  1. Remove the current in-process rebuild path from the web layer.
  2. Add a small rebuild-state module responsible for:
    • state file read/write
    • lock acquisition/release
    • process metadata
  3. Implement rebuild process launch via subprocess.
  4. Replace current rebuild polling to read durable state.
  5. Invalidate runtime context only after successful completion.
  6. Preserve and display concise failure messages.
  7. Add tests for the rebuild supervisor contract.

Acceptance Criteria

  • POST /rebuild starts at most one rebuild.
  • Refreshing /settings during rebuild still shows correct progress.
  • Query requests continue working during rebuild and use the last completed active data.
  • A failed rebuild does not break queries or replace active artifacts.
  • A successful rebuild becomes active on the next request after completion.
  • Saving rebuild-requiring settings never auto-starts rebuild work.
  • The UI shows a meaningful failure reason when rebuild fails.

Explicit Non-Goals

Do not turn this into a general-purpose job framework.

Out of scope:

  • multiple concurrent job types
  • cancellation
  • retries
  • queueing multiple rebuilds
  • resumable jobs after reboot
  • real-time log streaming
  • generalized worker orchestration

Why this scope is correct

This keeps the convenience of browser-triggered rebuilds without drifting into SaaS-style control-plane complexity. It fits the product philosophy: single-user, explicit operations, predictable behavior, and low operational burden.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestpriority: mediumMedium priority - valuable but not blocking

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions