From b5e4af92331bdda3756b742ea4bee24cdbe82566 Mon Sep 17 00:00:00 2001
From: Joe Doss <joe@quickvm.com>
Date: Fri, 17 Apr 2026 00:32:21 -0500
Subject: [PATCH] Drop cache-hit lookups to DEBUG, document the healthcheck
 amplifier
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Podman re-resolves every Secret= reference in a container's quadlet on
every `podman healthcheck run` call, not just at container start.
With a default HealthInterval=30s across ~50 containers, this generates
a constant ~15+ lookups/sec against the PSI socket. The cache serves
these from an in-memory dict in under a millisecond — the throughput
cost is negligible — but logging every hit at INFO floods the journal
with thousands of entries per minute.

Move the cache-hit log line to DEBUG. Cache misses, provider fetches,
and errors stay at INFO / WARNING / ERROR so anything interesting still
surfaces.

Document the behavior in the README cache section and add a
troubleshooting entry in docs/secret-cache.md explaining that this is
upstream Podman behavior, not a PSI regression.
---
 README.md            | 10 ++++++++++
 docs/secret-cache.md | 20 ++++++++++++++++++++
 psi/serve.py         |  2 +-
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 8c15918..84a8d69 100644
--- a/README.md
+++ b/README.md
@@ -220,6 +220,16 @@ See the [secret cache reference](docs/secret-cache.md) for the threat model, env
 format, deployment walkthroughs (native TPM, container TPM, container HSM), and
 troubleshooting.
 
+**Why `psi serve` is busy even when containers are stable:** Podman re-resolves
+every `Secret=` reference in a container's quadlet each time that container's
+healthcheck fires — not just at container start. With ~50 containers and a
+default `HealthInterval=30s`, this generates a constant ~15+ lookups per second
+against the PSI socket, one per secret per container per healthcheck cycle. This
+is upstream Podman behavior, not a PSI bug. The cache serves these hits from an
+in-memory dict in under a millisecond, so the cost is trivial — cache hits are
+logged at `DEBUG` so the journal is not flooded. Cache misses and provider
+errors stay at `INFO` / `WARNING`.
+
 ### Workloads
 
 Each workload specifies which provider handles its secrets:
diff --git a/docs/secret-cache.md b/docs/secret-cache.md
index 79d1dc1..50b5c4a 100644
--- a/docs/secret-cache.md
+++ b/docs/secret-cache.md
@@ -513,3 +513,23 @@ container will be killed and restarted in a loop. Confirm the issue with
 `sudo journalctl -u psi-secrets.service` and raise `HealthStartPeriod` in
 the generated quadlet (or use the TPM backend, which has much lower
 startup latency).
+
+### `psi serve` is handling 10+ lookups/sec when all containers are stable
+
+This is expected and is not a PSI issue. Podman re-resolves every `Secret=`
+reference in a container's quadlet on every `podman healthcheck run` call —
+not just at container start. With the default `HealthInterval=30s`, each
+container with N secrets generates N lookups per 30 seconds for its entire
+lifetime. Across a fleet this reaches double-digit lookups per second.
+
+The secret value is already resolved into the container's environment at
+start time and never changes, so the re-resolution is wasted work, but that
+is Podman's behavior and not something PSI can opt out of. The cache serves
+every one of these hits from an in-memory dict in under a millisecond, so
+the throughput cost is negligible.
+
+To keep the journal quiet, `psi serve` logs cache hits at `DEBUG` level
+(off by default). Cache misses, provider fetches, and errors still log at
+`INFO` / `WARNING` / `ERROR`. If you want to see the hit-rate, either run
+with `--log-level=DEBUG` or grep for `"event": "secret.lookup"` in the
+journal with `LOGURU_LEVEL=DEBUG` exported into the serve unit.
diff --git a/psi/serve.py b/psi/serve.py
index 0bab46b..a315a7b 100644
--- a/psi/serve.py
+++ b/psi/serve.py
@@ -229,7 +229,7 @@ def _handle_lookup(self, secret_id: str) -> None:
                 cached = cache.get(secret_id)
                 if cached is not None:
                     self._respond(200, cached)
-                    audit.bind(outcome="success", source="cache").info("lookup")
+                    audit.bind(outcome="success", source="cache").debug("lookup")
                     return
 
             try: