From b5e4af92331bdda3756b742ea4bee24cdbe82566 Mon Sep 17 00:00:00 2001 From: Joe Doss Date: Fri, 17 Apr 2026 00:32:21 -0500 Subject: [PATCH] Drop cache-hit lookups to DEBUG, document the healthcheck amplifier MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Podman re-resolves every Secret= reference in a container's quadlet on every `podman healthcheck run` call, not just at container start. With a default HealthInterval=30s across ~50 containers, this generates a constant ~15+ lookups/sec against the PSI socket. The cache serves these from an in-memory dict in under a millisecond — the throughput cost is negligible — but logging every hit at INFO floods the journal with thousands of entries per minute. Move the cache-hit log line to DEBUG. Cache misses, provider fetches, and errors stay at INFO / WARNING / ERROR so anything interesting still surfaces. Document the behavior in the README cache section and add a troubleshooting entry in docs/secret-cache.md explaining that this is upstream Podman behavior, not a PSI regression. --- README.md | 10 ++++++++++ docs/secret-cache.md | 20 ++++++++++++++++++++ psi/serve.py | 2 +- 3 files changed, 31 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 8c15918..84a8d69 100644 --- a/README.md +++ b/README.md @@ -220,6 +220,16 @@ See the [secret cache reference](docs/secret-cache.md) for the threat model, env format, deployment walkthroughs (native TPM, container TPM, container HSM), and troubleshooting. +**Why `psi serve` is busy even when containers are stable:** Podman re-resolves +every `Secret=` reference in a container's quadlet each time that container's +healthcheck fires — not just at container start. With ~50 containers and a +default `HealthInterval=30s`, this generates a constant ~15+ lookups per second +against the PSI socket, one per secret per container per healthcheck cycle. This +is upstream Podman behavior, not a PSI bug. The cache serves these hits from an +in-memory dict in under a millisecond, so the cost is trivial — cache hits are +logged at `DEBUG` so the journal is not flooded. Cache misses and provider +errors stay at `INFO` / `WARNING`. + ### Workloads Each workload specifies which provider handles its secrets: diff --git a/docs/secret-cache.md b/docs/secret-cache.md index 79d1dc1..50b5c4a 100644 --- a/docs/secret-cache.md +++ b/docs/secret-cache.md @@ -513,3 +513,23 @@ container will be killed and restarted in a loop. Confirm the issue with `sudo journalctl -u psi-secrets.service` and raise `HealthStartPeriod` in the generated quadlet (or use the TPM backend, which has much lower startup latency). + +### `psi serve` is handling 10+ lookups/sec when all containers are stable + +This is expected and is not a PSI issue. Podman re-resolves every `Secret=` +reference in a container's quadlet on every `podman healthcheck run` call — +not just at container start. With the default `HealthInterval=30s`, each +container with N secrets generates N lookups per 30 seconds for its entire +lifetime. Across a fleet this reaches double-digit lookups per second. + +The secret value is already resolved into the container's environment at +start time and never changes, so the re-resolution is wasted work, but that +is Podman's behavior and not something PSI can opt out of. The cache serves +every one of these hits from an in-memory dict in under a millisecond, so +the throughput cost is negligible. + +To keep the journal quiet, `psi serve` logs cache hits at `DEBUG` level +(off by default). Cache misses, provider fetches, and errors still log at +`INFO` / `WARNING` / `ERROR`. If you want to see the hit-rate, either run +with `--log-level=DEBUG` or grep for `"event": "secret.lookup"` in the +journal with `LOGURU_LEVEL=DEBUG` exported into the serve unit. diff --git a/psi/serve.py b/psi/serve.py index 0bab46b..a315a7b 100644 --- a/psi/serve.py +++ b/psi/serve.py @@ -229,7 +229,7 @@ def _handle_lookup(self, secret_id: str) -> None: cached = cache.get(secret_id) if cached is not None: self._respond(200, cached) - audit.bind(outcome="success", source="cache").info("lookup") + audit.bind(outcome="success", source="cache").debug("lookup") return try: