A CLI tool to simulate failure conditions on EKS worker nodes for testing the EKS Node Health Monitoring Agent (NMA).
The EKS Node Monitoring Agent continuously monitors node health across 5 categories and can trigger automatic node repairs. This CLI helps you test the NMA's detection capabilities by simulating various failure conditions.
# Clone and build
git clone https://github.com/jicowan/hma-cli.git
cd hma-cli
make build
# Or install directly
go install github.com/jicowan/hma-cli/cmd/hma-cli@latestmake build-linux
# Creates: hma-cli-linux-amd64, hma-cli-linux-arm64hma-cli --node <node-name> <category> <failure-type> [flags]Important: The --node flag is required for all simulations. The CLI creates a privileged pod on the target node to execute commands.
| Flag | Description |
|---|---|
--node |
Required. Target node name (creates privileged pod automatically) |
--keep-alive |
Keep pod alive for duration (e.g., 30m). Required for process-based simulations. |
--kubeconfig |
Path to kubeconfig (default: ~/.kube/config) |
--dry-run |
Show what would happen without executing |
--force |
Skip confirmation prompts |
--cleanup |
Revert simulation. Needed for: pid-exhaustion, interface-down |
# Create zombie processes (threshold: >= 20)
# Requires --keep-alive to prevent processes from being killed on pod exit
hma-cli --node <node-name> kernel zombies --count 25 --keep-alive 30m --force
# Exhaust PIDs (threshold: > 70% of MAX(pid_max, threads-max))
# Requires --keep-alive to keep sleep processes alive
hma-cli --node <node-name> kernel pid-exhaustion --keep-alive 30m --force
# Inject kernel log patterns (dmesg injection, no --keep-alive needed)
hma-cli --node <node-name> kernel kernel-bug --force # Creates Warning event
hma-cli --node <node-name> kernel soft-lockup --force # Creates Warning event
# Exhaust PIDs to cause kubelet fork failures - triggers KernelReady=False
# WARNING: This may make the node unrecoverable and require node replacement!
hma-cli --node <node-name> kernel fork-oom --forceWarning: The
fork-oomsimulation exhausts node PIDs and may make the node unrecoverable. The node may need to be deleted and replaced after running this simulation. Only use on nodes you can afford to lose.
# Kill IPAMD repeatedly (NMA requires 5 restarts to trigger condition)
# Requires --keep-alive to allow background kill loop to run
hma-cli --node <node-name> networking ipamd-down --keep-alive 10m --force
# Bring down secondary ENI (auto-detects eth1/ens6)
# Note: May not trigger NMA condition change in all configurations
hma-cli --node <node-name> networking interface-down --force# Create I/O delay process (NMA checks every 10 minutes)
# Requires --keep-alive for at least 15 minutes
hma-cli --node <node-name> storage io-delay --keep-alive 15m --force# Kill kubelet process repeatedly to increment NRestarts counter
# NMA threshold: NRestarts > 3 AND increasing
# Requires --keep-alive to complete all kills
hma-cli --node <node-name> runtime systemd-restarts --keep-alive 10m --force# Inject NVIDIA XID error (requires DCGM installed)
# Fatal codes: 13, 31, 48, 63, 64, 74, 79, 94, 95, 119, 120, 121, 140
hma-cli --node <node-name> accelerator xid-error --code 79 --force
# Inject AWS Neuron errors (dmesg injection)
hma-cli --node <node-name> accelerator neuron-sram-error --force
hma-cli --node <node-name> accelerator neuron-hbm-error --force
hma-cli --node <node-name> accelerator neuron-nc-error --force
hma-cli --node <node-name> accelerator neuron-dma-error --forceCreate a NodeDiagnostic CR to collect logs from a node. The CLI auto-generates a presigned S3 PUT URL.
# Create NodeDiagnostic CR (auto-generates presigned URL)
hma-cli diagnose --node <node-name> --bucket my-logs-bucket
# Wait for completion
hma-cli diagnose --node <node-name> --bucket my-logs-bucket --wait
# Check status of existing NodeDiagnostic
hma-cli diagnose --node <node-name> --status
# Create, wait, then delete
hma-cli diagnose --node <node-name> --bucket my-logs-bucket --wait --deleteThe logs are uploaded to: s3://<bucket>/<timestamp>/<node-name>/logs.tar.gz
After completion, download with:
aws s3 cp s3://my-logs-bucket/2026-03-17T15-30-00Z/ip-10-0-1-123.ec2.internal/logs.tar.gz ./logs.tar.gzMany simulations run in the foreground and require --keep-alive to keep the node-shell pod running. The simulation runs continuously until the keep-alive duration expires or you press Ctrl+C.
Some simulations modify persistent system state that survives pod deletion. Use --cleanup to revert these changes.
| Simulation | Needs --keep-alive |
Needs --cleanup |
Notes |
|---|---|---|---|
zombies |
Yes (30m) | No | Processes die with pod |
pid-exhaustion |
Yes (30m) | Yes | Lowered pid_max persists |
io-delay |
Yes (15m) | No | Worker dies with pod |
systemd-restarts |
Yes (10m) | No | Kubelet auto-restarts |
fork-oom |
No | No | Node may be unrecoverable |
kernel-bug |
No | No | Dmesg injection is instant |
soft-lockup |
No | No | Dmesg injection is instant |
ipamd-down |
Yes (10m) | No | NMA needs 5 restarts (MinOccurrences) |
interface-down |
No | Yes | Auto-detects eth1/ens6 |
neuron-* |
No | No | Dmesg can't be cleaned |
hma-cli listSee what a simulation would do without executing:
hma-cli --node <node-name> kernel zombies --dry-runCleanup is only needed for simulations that modify persistent system state:
# Restore pid_max and threads-max after pid-exhaustion
hma-cli --node <node-name> kernel pid-exhaustion --cleanup --force
# Bring interface back up after interface-down
hma-cli --node <node-name> networking interface-down --cleanup --forceFor other simulations, the pod exit handles cleanup automatically.
After running a simulation, verify NMA detection:
# Check node conditions
kubectl get node <node-name> -o json | jq '.status.conditions[] | select(.type | test("Kernel|Network|Storage|Runtime|Accelerated"))'
# Check node events (for Warning-level detections)
kubectl get events --field-selector involvedObject.name=<node-name> --sort-by='.lastTimestamp'
# Check NMA logs on the node
NMA_POD=$(kubectl get pods -n kube-system -o wide | grep eks-node-monitoring | grep <node-name> | awk '{print $1}')
kubectl logs -n kube-system $NMA_POD --tail=100The NMA has two detection levels:
| Level | Effect | Examples |
|---|---|---|
| CONDITION | Changes node condition to False |
ipamd-down, interface-down, neuron-* |
| EVENT | Creates Warning event only (condition stays True) |
zombies, kernel-bug, soft-lockup |
- Go 1.21+ (for building)
- kubectl access to EKS cluster
- Cluster must have NMA installed
- For GPU simulations: DCGM must be installed on GPU nodes
# Run tests
make test
# Run tests with coverage
make test-coverage
# Format code
make fmt
# Run linter
make lintWhen using --node, the CLI:
- Creates a privileged pod on the target node
- Uses
nsenterto enter the node's namespaces - Executes simulation commands
- Keeps the pod alive if
--keep-aliveis specified - Cleans up the pod when done
The NMA monitors:
- Kernel: Zombie count (>=20), PID usage (>70%), dmesg patterns (
BUG:,soft lockup) - Networking: IPAMD process, interface state
- Storage: Per-process I/O delay from
/proc/[PID]/stat(>10s) - Runtime: systemd NRestarts counter via dbus (>3 and increasing)
- Accelerator: NVIDIA XID errors via DCGM, Neuron errors via dmesg
Apache 2.0