P0-7: instant-api terminationGracePeriodSeconds=35 + preStop (MR-P0-7)#15
Merged
Merged
Conversation
…MR-P0-7) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three new Prometheus alerts tied to the worker repo's PASS 3 enhanced reasons + PASS 6 stuck-build counters: - OrphanSweepNoDBRowReap (CRITICAL, 1h): a k8s namespace had no backing deployments row — the P0-3 atomic-provision symptom. Pages on first occurrence over 1h. - OrphanSweepStuckBuildSpike (WARNING, 15m): >5 stuck-build flips in 15m means the kaniko/GHCR build pipeline is degraded for many customers at once. - OrphanSweepReapFailureRate (WARNING, 30m): the reconciler detected orphans it cannot reap (k8s/DB write failure sustained). The counters land in worker master commit 7d2ff0d; the alerts go live once the deploy lands + scrape picks them up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Diagnosis
The api Deployment in
k8s/app.yamlwas missing two pieces needed for graceful shutdown:terminationGracePeriodSeconds— defaulted to 30s, which collides withthe api's drain budget:
preStop sleep 5 + readinessDrainGrace 3 + ShutdownWithTimeout 25 = 33sof in-process work. Kubelet was SIGKILLing mid-drain.preStoplifecycle hook — without it, the kubelet sends SIGTERMimmediately on pod termination. The LB doesn't refresh Service endpoints
until the readinessProbe fails on the next tick — so new traffic kept
landing on a pod that was about to stop accepting connections.
Diff Summary
k8s/app.yaml:terminationGracePeriodSeconds: 35on the api pod spec(budget: preStop 5s + readinessDrainGrace 3s + shutdownTimeout 25s + safety 2s).
lifecycle.preStop.exec.command: ["/bin/sh", "-c", "sleep 5"]on the api container — gives the kubelet a window to observe the api's
/readyz503 flip (viahooks.Readyz.MarkDrainingin the api repo'scompanion PR) and update Service endpoints before SIGTERM is delivered.
Required Companion PR
api repo —
ship/p0-7-graceful-shutdown-readiness-2026-05-20adds
MarkDrainingto/readyz+ wireshooks.Readyz.MarkDraining()into theSIGTERM handler. Must land together — this manifest change alone widens the
grace period but doesn't flip
/readyzto 503, so the LB still routes newtraffic to a draining pod.
Live Verify Plan (post-merge)
kubectl apply -f k8s/app.yaml(or whichever path is canonical for infra).kubectl rollout restart deploy/instant-api -n instantkubectl describe podmid-roll shows preStop running, then probe failing, then container exit.kubectl get events -n instant --sort-by='.lastTimestamp' | tail -20— no'FailedKillPod' / 'killed before terminationGracePeriod' events.
🤖 Generated with Claude Code