Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,16 @@ See the [secret cache reference](docs/secret-cache.md) for the threat model, env
format, deployment walkthroughs (native TPM, container TPM, container HSM), and
troubleshooting.

**Why `psi serve` is busy even when containers are stable:** Podman re-resolves
every `Secret=` reference in a container's quadlet each time that container's
healthcheck fires — not just at container start. With ~50 containers and a
default `HealthInterval=30s`, this generates a constant ~15+ lookups per second
against the PSI socket, one per secret per container per healthcheck cycle. This
is upstream Podman behavior, not a PSI bug. The cache serves these hits from an
in-memory dict in under a millisecond, so the cost is trivial — cache hits are
logged at `DEBUG` so the journal is not flooded. Cache misses and provider
errors stay at `INFO` / `WARNING`.

### Workloads

Each workload specifies which provider handles its secrets:
Expand Down
20 changes: 20 additions & 0 deletions docs/secret-cache.md
Original file line number Diff line number Diff line change
Expand Up @@ -513,3 +513,23 @@ container will be killed and restarted in a loop. Confirm the issue with
`sudo journalctl -u psi-secrets.service` and raise `HealthStartPeriod` in
the generated quadlet (or use the TPM backend, which has much lower
startup latency).

### `psi serve` is handling 10+ lookups/sec when all containers are stable

This is expected and is not a PSI issue. Podman re-resolves every `Secret=`
reference in a container's quadlet on every `podman healthcheck run` call —
not just at container start. With the default `HealthInterval=30s`, each
container with N secrets generates N lookups per 30 seconds for its entire
lifetime. Across a fleet this reaches double-digit lookups per second.

The secret value is already resolved into the container's environment at
start time and never changes, so the re-resolution is wasted work, but that
is Podman's behavior and not something PSI can opt out of. The cache serves
every one of these hits from an in-memory dict in under a millisecond, so
the throughput cost is negligible.

To keep the journal quiet, `psi serve` logs cache hits at `DEBUG` level
(off by default). Cache misses, provider fetches, and errors still log at
`INFO` / `WARNING` / `ERROR`. If you want to see the hit-rate, either run
with `--log-level=DEBUG` or grep for `"event": "secret.lookup"` in the
journal with `LOGURU_LEVEL=DEBUG` exported into the serve unit.
2 changes: 1 addition & 1 deletion psi/serve.py
Original file line number Diff line number Diff line change
Expand Up @@ -229,7 +229,7 @@ def _handle_lookup(self, secret_id: str) -> None:
cached = cache.get(secret_id)
if cached is not None:
self._respond(200, cached)
audit.bind(outcome="success", source="cache").info("lookup")
audit.bind(outcome="success", source="cache").debug("lookup")
return

try:
Expand Down
Loading