hma-cli

A CLI tool to simulate failure conditions on EKS worker nodes for testing the EKS Node Health Monitoring Agent (NMA).

Overview

The EKS Node Monitoring Agent continuously monitors node health across 5 categories and can trigger automatic node repairs. This CLI helps you test the NMA's detection capabilities by simulating various failure conditions.

Installation

# Clone and build
git clone https://github.com/jicowan/hma-cli.git
cd hma-cli
make build

# Or install directly
go install github.com/jicowan/hma-cli/cmd/hma-cli@latest

Build for Linux (EKS nodes)

make build-linux
# Creates: hma-cli-linux-amd64, hma-cli-linux-arm64

Usage

hma-cli --node <node-name> <category> <failure-type> [flags]

Important: The --node flag is required for all simulations. The CLI creates a privileged pod on the target node to execute commands.

Global Flags

Flag	Description
`--node`	Required. Target node name (creates privileged pod automatically)
`--keep-alive`	Keep pod alive for duration (e.g., `30m`). Required for process-based simulations.
`--kubeconfig`	Path to kubeconfig (default: ~/.kube/config)
`--dry-run`	Show what would happen without executing
`--force`	Skip confirmation prompts
`--cleanup`	Revert simulation. Needed for: `pid-exhaustion`, `interface-down`

Available Simulations

Kernel (`KernelReady` condition)

# Create zombie processes (threshold: >= 20)
# Requires --keep-alive to prevent processes from being killed on pod exit
hma-cli --node <node-name> kernel zombies --count 25 --keep-alive 30m --force

# Exhaust PIDs (threshold: > 70% of MAX(pid_max, threads-max))
# Requires --keep-alive to keep sleep processes alive
hma-cli --node <node-name> kernel pid-exhaustion --keep-alive 30m --force

# Inject kernel log patterns (dmesg injection, no --keep-alive needed)
hma-cli --node <node-name> kernel kernel-bug --force    # Creates Warning event
hma-cli --node <node-name> kernel soft-lockup --force   # Creates Warning event

# Exhaust PIDs to cause kubelet fork failures - triggers KernelReady=False
# WARNING: This may make the node unrecoverable and require node replacement!
hma-cli --node <node-name> kernel fork-oom --force

Warning: The fork-oom simulation exhausts node PIDs and may make the node unrecoverable. The node may need to be deleted and replaced after running this simulation. Only use on nodes you can afford to lose.

Networking (`NetworkingReady` condition)

# Kill IPAMD repeatedly (NMA requires 5 restarts to trigger condition)
# Requires --keep-alive to allow background kill loop to run
hma-cli --node <node-name> networking ipamd-down --keep-alive 10m --force

# Bring down secondary ENI (auto-detects eth1/ens6)
# Note: May not trigger NMA condition change in all configurations
hma-cli --node <node-name> networking interface-down --force

Storage (`StorageReady` condition)

# Create I/O delay process (NMA checks every 10 minutes)
# Requires --keep-alive for at least 15 minutes
hma-cli --node <node-name> storage io-delay --keep-alive 15m --force

Runtime (`ContainerRuntimeReady` condition)

# Kill kubelet process repeatedly to increment NRestarts counter
# NMA threshold: NRestarts > 3 AND increasing
# Requires --keep-alive to complete all kills
hma-cli --node <node-name> runtime systemd-restarts --keep-alive 10m --force

Accelerator (`AcceleratedHardwareReady` condition)

# Inject NVIDIA XID error (requires DCGM installed)
# Fatal codes: 13, 31, 48, 63, 64, 74, 79, 94, 95, 119, 120, 121, 140
hma-cli --node <node-name> accelerator xid-error --code 79 --force

# Inject AWS Neuron errors (dmesg injection)
hma-cli --node <node-name> accelerator neuron-sram-error --force
hma-cli --node <node-name> accelerator neuron-hbm-error --force
hma-cli --node <node-name> accelerator neuron-nc-error --force
hma-cli --node <node-name> accelerator neuron-dma-error --force

NodeDiagnostic (Log Collection)

Create a NodeDiagnostic CR to collect logs from a node. The CLI auto-generates a presigned S3 PUT URL.

# Create NodeDiagnostic CR (auto-generates presigned URL)
hma-cli diagnose --node <node-name> --bucket my-logs-bucket

# Wait for completion
hma-cli diagnose --node <node-name> --bucket my-logs-bucket --wait

# Check status of existing NodeDiagnostic
hma-cli diagnose --node <node-name> --status

# Create, wait, then delete
hma-cli diagnose --node <node-name> --bucket my-logs-bucket --wait --delete

The logs are uploaded to: s3://<bucket>/<timestamp>/<node-name>/logs.tar.gz

After completion, download with:

aws s3 cp s3://my-logs-bucket/2026-03-17T15-30-00Z/ip-10-0-1-123.ec2.internal/logs.tar.gz ./logs.tar.gz

Understanding `--keep-alive` and `--cleanup`

Many simulations run in the foreground and require --keep-alive to keep the node-shell pod running. The simulation runs continuously until the keep-alive duration expires or you press Ctrl+C.

Some simulations modify persistent system state that survives pod deletion. Use --cleanup to revert these changes.

Simulation	Needs `--keep-alive`	Needs `--cleanup`	Notes
`zombies`	Yes (30m)	No	Processes die with pod
`pid-exhaustion`	Yes (30m)	Yes	Lowered pid_max persists
`io-delay`	Yes (15m)	No	Worker dies with pod
`systemd-restarts`	Yes (10m)	No	Kubelet auto-restarts
`fork-oom`	No	No	Node may be unrecoverable
`kernel-bug`	No	No	Dmesg injection is instant
`soft-lockup`	No	No	Dmesg injection is instant
`ipamd-down`	Yes (10m)	No	NMA needs 5 restarts (MinOccurrences)
`interface-down`	No	Yes	Auto-detects eth1/ens6
`neuron-*`	No	No	Dmesg can't be cleaned

Examples

List All Simulations

hma-cli list

Dry Run

See what a simulation would do without executing:

hma-cli --node <node-name> kernel zombies --dry-run

Manual Cleanup

Cleanup is only needed for simulations that modify persistent system state:

# Restore pid_max and threads-max after pid-exhaustion
hma-cli --node <node-name> kernel pid-exhaustion --cleanup --force

# Bring interface back up after interface-down
hma-cli --node <node-name> networking interface-down --cleanup --force

For other simulations, the pod exit handles cleanup automatically.

Verification

After running a simulation, verify NMA detection:

# Check node conditions
kubectl get node <node-name> -o json | jq '.status.conditions[] | select(.type | test("Kernel|Network|Storage|Runtime|Accelerated"))'

# Check node events (for Warning-level detections)
kubectl get events --field-selector involvedObject.name=<node-name> --sort-by='.lastTimestamp'

# Check NMA logs on the node
NMA_POD=$(kubectl get pods -n kube-system -o wide | grep eks-node-monitoring | grep <node-name> | awk '{print $1}')
kubectl logs -n kube-system $NMA_POD --tail=100

NMA Detection Levels

The NMA has two detection levels:

Level	Effect	Examples
CONDITION	Changes node condition to `False`	`ipamd-down`, `interface-down`, `neuron-*`
EVENT	Creates Warning event only (condition stays `True`)	`zombies`, `kernel-bug`, `soft-lockup`

Requirements

Go 1.21+ (for building)
kubectl access to EKS cluster
Cluster must have NMA installed
For GPU simulations: DCGM must be installed on GPU nodes

Development

# Run tests
make test

# Run tests with coverage
make test-coverage

# Format code
make fmt

# Run linter
make lint

How It Works

Node Access

When using --node, the CLI:

Creates a privileged pod on the target node
Uses nsenter to enter the node's namespaces
Executes simulation commands
Keeps the pod alive if --keep-alive is specified
Cleans up the pod when done

NMA Detection Patterns

The NMA monitors:

Kernel: Zombie count (>=20), PID usage (>70%), dmesg patterns (BUG:, soft lockup)
Networking: IPAMD process, interface state
Storage: Per-process I/O delay from /proc/[PID]/stat (>10s)
Runtime: systemd NRestarts counter via dbus (>3 and increasing)
Accelerator: NVIDIA XID errors via DCGM, Neuron errors via dmesg

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
cmd/hma-cli		cmd/hma-cli
pkg		pkg
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
TEST_RESULTS.md		TEST_RESULTS.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hma-cli

Overview

Installation

Build for Linux (EKS nodes)

Usage

Global Flags

Available Simulations

Kernel (`KernelReady` condition)

Networking (`NetworkingReady` condition)

Storage (`StorageReady` condition)

Runtime (`ContainerRuntimeReady` condition)

Accelerator (`AcceleratedHardwareReady` condition)

NodeDiagnostic (Log Collection)

Understanding `--keep-alive` and `--cleanup`

Examples

List All Simulations

Dry Run

Manual Cleanup

Verification

NMA Detection Levels

Requirements

Development

How It Works

Node Access

NMA Detection Patterns

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

hma-cli

Overview

Installation

Build for Linux (EKS nodes)

Usage

Global Flags

Available Simulations

Kernel (KernelReady condition)

Networking (NetworkingReady condition)

Storage (StorageReady condition)

Runtime (ContainerRuntimeReady condition)

Accelerator (AcceleratedHardwareReady condition)

NodeDiagnostic (Log Collection)

Understanding --keep-alive and --cleanup

Examples

List All Simulations

Dry Run

Manual Cleanup

Verification

NMA Detection Levels

Requirements

Development

How It Works

Node Access

NMA Detection Patterns

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Kernel (`KernelReady` condition)

Networking (`NetworkingReady` condition)

Storage (`StorageReady` condition)

Runtime (`ContainerRuntimeReady` condition)

Accelerator (`AcceleratedHardwareReady` condition)

Understanding `--keep-alive` and `--cleanup`

Packages