Prune stale cache entries during setup by jdoss · Pull Request #32 · quickvm/psi

jdoss · 2026-04-17T05:44:35Z

Summary

Each time _register_secrets deletes and re-creates a Podman secret, Podman assigns a new hex ID. The old ID's cache entry becomes orphaned — valid ciphertext for a secret that no longer exists. Without pruning, the cache grows unboundedly across setup runs.

Observed on the test server: 13,376 cache entries for 495 real secrets (~27 setup cycles' worth of accumulation).

Fix: after writing new entries in run_setup, query the current Podman secret list and invalidate any cache key that isn't a current hex ID. This naturally handles orphans from re-registration cycles while preserving HSM lazy-cached entries for secrets that still exist.

Podman API failures during the prune step are non-fatal — log and skip rather than risk dropping valid entries.

Test plan

pytest tests/test_setup.py — 3 new tests covering the happy path (prune orphans), API-failure safety (no invalidate calls), and no-op clean case.
ruff check / ty check — clean.
Deploy to test server; expect cache size to drop from ~13k to ~495 entries after one setup run.

Each time _register_secrets deletes and re-creates a Podman secret, Podman assigns a new hex ID. The old ID's cache entry becomes orphaned — valid ciphertext for a secret that no longer exists. Without pruning, the cache grows unboundedly across setup runs (13,376 entries observed on the test server for 495 real secrets, ~27 setup cycles). After writing new entries in run_setup, query the current Podman secret list and invalidate any cache key that isn't a current hex ID. This naturally handles orphans from re-registration cycles while preserving HSM lazy-cached entries for secrets that still exist. Podman API failures during the prune step are non-fatal: log and skip rather than risk dropping valid entries.

Serve and setup each hold their own in-memory dict of the on-disk cache. Without coordination, serve's cache.save() on a lookup miss overwrites setup's freshly pruned state with serve's older dict — resurrecting the stale entries the prune step in PR #32 just removed. Observed on the test server: cache grew back to 15k+ entries within minutes of setup pruning it down to ~500. Drop the cache.save() call on the serve cache-miss path. Values still populate the in-memory dict (cache.set), so subsequent lookups for the same secret in the same process still hit, and tests asserting on in-memory presence still pass. The disk file is owned exclusively by setup, which prunes on every run. Tradeoff: a secret lazily cached during serve runtime (e.g. an HSM-stored secret outside config.workloads) is lost on serve restart and will miss the cache on its next lookup. Acceptable — the cache's purpose is surviving provider outages for workload secrets, and those are always populated by setup.

Every time psi-{provider}-refresh.timer fires, setup re-registers secrets via delete+create through the Podman API, which assigns fresh hex IDs. Setup writes those new IDs to the on-disk cache file and the prune step from PR #32 drops the old entries. But serve holds the OLD cache in memory from its last startup and never picks up the new file state — so every lookup after the first refresh goes straight to the provider, and the cache does no work until an operator manually restarts psi-secrets. Observed on the test server: 1554 secret lookups over 30 minutes, zero cache hits. All source=provider. The refresh timer had fired 7 minutes earlier and silently broke the cache. Add a second ExecStart to the refresh wrapper that runs systemctl try-restart psi-secrets.service after setup completes. try-restart is a no-op if serve is not currently active, so this is safe on hosts that have intentionally stopped psi-secrets. There is a brief (~30s on HSM) lookup-fails-to-cache window during the serve restart, but this happens at most once per cache.refresh_interval (default 1h) instead of never.

jdoss merged commit b081545 into master Apr 17, 2026
2 checks passed

jdoss mentioned this pull request Apr 17, 2026

Make serve a read-only consumer of the on-disk cache #33

Merged

3 tasks

jdoss mentioned this pull request Apr 17, 2026

Restart psi-secrets after refresh so serve reloads fresh cache #35

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prune stale cache entries during setup#32

Prune stale cache entries during setup#32
jdoss merged 1 commit intomasterfrom
fix/cache-prune-stale-entries

jdoss commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jdoss commented Apr 17, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant