Skip to content

Plan: Runner host retention and standby posture #896

@shiny-code-bot

Description

@shiny-code-bot

Objective

Design the next runner-host infra posture after Launchplane-owned hygiene is in place: durable retention budgets, drift thresholds, and a recovery/standby strategy for chris-testing or its successor hosts.

Finish Line

Runner-host retention budgets and recovery posture are explicit after scheduled hygiene evidence.

Current Status

State: The small corrected-counter orphan BuildKit cleanup is complete. The temporary independent allowlist was set only for buildx_buildkit_launchplane-ci0_state and buildx_buildkit_verireel-ci0_state, dry-run verified both were dangling/link 0, mutate removed them, the allowlist variable was deleted, and a final report-only run is healthy.
Next action: Leave this issue waiting for future scheduled drift evidence or an operator decision about broader retention thresholds/standby design. No immediate cleanup mutation is pending.
Blocked by: No native issue blocker.
Waiting for: Scheduled drift evidence or operator choice on broader retention/standby policy.
Last verified: 2026-05-25. Dry-run run 26407271540 used audit key runner-host-hygiene/2026-05-25/chris-testing-orphan-volume-cleanup-dry-run-1; blockers only mutate_not_requested; targets present, dangling, links 0: buildx_buildkit_launchplane-ci0_state (1,421,000,000 bytes) and buildx_buildkit_verireel-ci0_state (3,060,000,000 bytes). Mutate run 26407394767 used audit key runner-host-hygiene/2026-05-25/chris-testing-orphan-volume-cleanup-mutate-1; status completed; targets absent from post evidence; post report healthy; free disk 401,200,775,168 bytes; Docker reclaimable 46,014,000,000 bytes; orphan BuildKit containers 0; orphan BuildKit volumes 0; warm builders present. The temporary repo variable LAUNCHPLANE_RUNNER_HOST_HYGIENE_ALLOWED_BUILDKIT_STATE_VOLUMES was deleted after mutation. Final steady-state dry-run run 26407633101 used audit key runner-host-hygiene/2026-05-25/chris-testing-post-orphan-cleanup-dry-run-1; report healthy; blockers only mutate_not_requested; free disk 401,630,429,184 bytes; Docker reclaimable 46,014,000,000 bytes; runner workdir 37,519,639,238 bytes; 68 images; 36 volumes; orphan BuildKit containers 0; orphan BuildKit volumes 0; warm builders present: odoo-docker:verify-devtools, odoo-docker:verify-runtime.

Scope

  • Define retention budgets for generic BuildKit cache, named BuildKit state volumes, image inventory, runner _work, logs, and warm Odoo builders.
  • Decide alert thresholds for scheduled dry-run reports: disk free bytes, Docker reclaimable bytes, volume growth, orphan BuildKit artifacts, and missing warm builders.
  • Decide whether chris-testing should remain a multi-role host, gain a warm standby, or be split into dedicated runner hosts.
  • Preserve Launchplane ownership of shared runner-host cleanup and prevent product repos from adding broad shared-host Docker prune jobs.
  • Keep future mutation evidence-first, named-target, and allowlisted.

Acceptance Criteria

  • Scheduled dry-run evidence has a documented interpretation policy for when to act.
  • Any broader cleanup mode is scoped, tested, and fails closed like the BuildKit state-volume lane.
  • The chris-testing recovery path is either accepted as rebuild-from-runbook or replaced with a concrete standby/split-host design.
  • Repo-local product cleanup guidance remains clear: product repos may own product-scoped cleanup, not shared runner-host Docker/BuildKit pruning.
  • Operator docs identify which variables, labels, service users, warm builders, and audit records are required for new runner hosts.

Relationships

Validation

  • Review at least one scheduled Runner Host Hygiene report after the Require independent BuildKit volume allowlist #895 allowlist hardening merge.
  • If adding policy, run local focused hygiene tests, CI/Security/CodeQL, and one report-only rehearsal before any mutation.
  • If provisioning a standby/successor host, prove GitHub runner labels, service user, Docker/BuildKit setup, warm builders, Launchplane variables, and hygiene dry-run evidence before cutover.

Decisions

  • Launchplane remains the only owner for shared runner-host hygiene.
  • Product repos should not run broad shared-host Docker prune jobs.
  • Future cleanup should favor named targets and independent policy inputs over inferred or automatic deletion.

Open Questions

  • What disk and Docker reclaimable thresholds should trigger attention or action?
  • Should Odoo warm builders keep their current linked state volumes indefinitely, or should they get explicit budgets?
  • Should chris-testing remain a shared multi-role host, or should Launchplane split ops-gate/hygiene from product CI lanes?
  • Is a warm standby worth the operational overhead now, or is the documented replacement runbook enough?

Metadata

Metadata

Assignees

No one assigned

    Labels

    planDurable planning issueplan:waitingPlan is waiting on non-issue evidence or decision

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions