Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
…ung pods Without a livenessProbe, zoekt-webserver pods that crash silently remain in an unhealthy state indefinitely. The probe acts as a backup to the in-process watchdog (failureThreshold=10, period=60s > watchdog's 9x60s detection window). ref incident INC-484 https://sourcegraph.slack.com/archives/C0APJUXBG4R
2838bed to
8b403c5
Compare
michaellzc
left a comment
There was a problem hiding this comment.
can you quickly validate this on a cloud instance?
Tested diff and applied against
https://sourcegraph.slack.com/archives/C05DWT4ANHH/p1774644755678949 |
keegancsmith
left a comment
There was a problem hiding this comment.
so if the pod is hung, it will take 10 minutes before we kill it? That seems alright.
FYI I kinda remember us intentionally not doing this back in the day. But it likely had to do with hard to control stuff with on-prem + kubernetes being more immature. Which is why we extending the watchdog stuff. But yeah, the recent incident I think caused the go runtime to kinda crash so we need this check to not be internal.
Here is the history I managed to find in the infrastructure repo!
Yes, a liveness probe was set for zoekt-webserver, but it was short-lived. The K8s definitions lived under kubernetes/ in this repo (as Go code using a kubegen library).
- Added on Feb 2, 2018 by Nick Snyder in commit 82c2aa37 — an HTTP liveness probe hitting /healthz on port http (6070) with 5s timeout and 5s initial delay.
- Removed 4 days later on Feb 6, 2018 by Beyang Liu in commit d3749e67 with the message: "indexed-search: remove liveness check (it can erroneously kill zoekt-webserver)".
So it existed for about 4 days before being pulled because it was causing false-positive failures that killed the pod.
https://ampcode.com/threads/T-019d3db6-b71c-739f-855d-aca839ba9170


zoekt-webserver was unhealthy for long time
ref https://linear.app/sourcegraph/issue/PLAT-509/incident-indexed-search-pods-were-unhealthy-for-long-time PLAT-509
Checklist
Test plan