Skip to content

Add JSON output, fix --stuck DaemonSet crashloop#1

Open
allreduce wants to merge 1 commit into
masoncl:mainfrom
allreduce:json-output
Open

Add JSON output, fix --stuck DaemonSet crashloop#1
allreduce wants to merge 1 commit into
masoncl:mainfrom
allreduce:json-output

Conversation

@allreduce

@allreduce allreduce commented Apr 25, 2026

Copy link
Copy Markdown

Summary

  • Add --json / -j flag for JSONL output (one line per stuck task with timestamp, comm, pid, state, wait_seconds, cpu, kstack)
  • Fix --stuck unconditionally overriding --interval, --count, and --waiting with hardcoded values — now only sets them as defaults when the caller hasn't provided explicit values
  • Fix --stuck early-exit behavior: only break on first found result when count is bounded, not when running as a long-lived daemon
  • Minimize ringbuf allocations when not profiling (64MB+192MB → 4KB+4KB)

Problem

The DaemonSet passes --stuck --interval 10 --waiting 2, but --stuck hard-overrides interval to 2 and count to 5. rwalker exits cleanly after ~10s, Kubernetes restarts it, and eventually all 1900+ pods enter CrashLoopBackOff.

Add --json / -j flag for JSONL output (one line per stuck task with
timestamp, comm, pid, state, wait_seconds, cpu, kstack).  Suppress
human-readable noise (headers, "no tasks" messages) in JSON mode.

Fix --stuck mode unconditionally overriding --interval, --count, and
--waiting with hardcoded values.  Now only sets them as defaults when
the caller hasn't provided explicit values.  This was causing the
DaemonSet deployment to exit after ~10s (5 loops × 2s) despite passing
--interval 10, leading to CrashLoopBackOff across 1900+ pods.

Also fix --stuck early exit: only break on first found result when
count is bounded (one-shot mode), not when running as a daemon with
--count 0.

Minimize ringbuf allocations when not profiling to reduce per-pod
memory footprint (64MB+192MB → 4KB+4KB for stuck-task scanning).

Signed-off-by: Jenya Lee <jenya@thinkingmachines.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant