Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1917,6 +1917,31 @@ kimik2.5-int4-h200-vllm:
search-space:
- { tp: 8, conc-start: 4, conc-end: 64 }

minimaxm2.5-fp4-b200-vllm:
image: vllm/vllm-openai:v0.18.0
model: nvidia/MiniMax-M2.5-NVFP4
model-prefix: minimaxm2.5
runner: b200
precision: fp4
framework: vllm
multinode: false
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- { tp: 4, ep: 4, conc-start: 4, conc-end: 64 }
- { tp: 2, ep: 2, conc-start: 4, conc-end: 64 }
- isl: 1024
osl: 8192
search-space:
- { tp: 4, ep: 4, conc-start: 4, conc-end: 64 }
- { tp: 2, ep: 2, conc-start: 4, conc-end: 64 }
- isl: 8192
osl: 1024
search-space:
- { tp: 4, ep: 4, conc-start: 4, conc-end: 64 }
- { tp: 2, ep: 2, conc-start: 4, conc-end: 64 }

kimik2.5-fp4-b200-vllm:
image: vllm/vllm-openai:v0.17.0
model: nvidia/Kimi-K2.5-NVFP4
Expand Down
77 changes: 77 additions & 0 deletions benchmarks/single_node/minimaxm2.5_fp4_b200.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
MAX_MODEL_LEN \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

Check failure on line 13 in benchmarks/single_node/minimaxm2.5_fp4_b200.sh

View check run for this annotation

Claude / Claude Code Review

EP_SIZE missing from check_env_vars

The new `minimaxm2.5_fp4_b200.sh` script uses `EP_SIZE` to conditionally enable expert parallelism (line 30) but omits it from the `check_env_vars` validation call (lines 5-13). If `EP_SIZE` is not injected by the workflow, expert parallelism is silently disabled—producing incorrect, suboptimal benchmark results with no error. Add `EP_SIZE` to `check_env_vars` to match every other MiniMax benchmark script (fp8_b200, fp8_h100, fp8_h200, fp8_mi355x, fp8_mi325x).
Comment on lines +5 to +13
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new minimaxm2.5_fp4_b200.sh script uses EP_SIZE to conditionally enable expert parallelism (line 30) but omits it from the check_env_vars validation call (lines 5-13). If EP_SIZE is not injected by the workflow, expert parallelism is silently disabled—producing incorrect, suboptimal benchmark results with no error. Add EP_SIZE to check_env_vars to match every other MiniMax benchmark script (fp8_b200, fp8_h100, fp8_h200, fp8_mi355x, fp8_mi325x).

Extended reasoning...

What the bug is and how it manifests

The script benchmarks/single_node/minimaxm2.5_fp4_b200.sh references EP_SIZE on line 30 via ${EP_SIZE:-1} to decide whether to pass --enable-expert-parallel to the vLLM server. However, EP_SIZE is not listed in the check_env_vars call at lines 5–13. The check_env_vars function is the repository's standard mechanism for early-failing when a required environment variable is missing—if it's not listed there, a missing value is never caught.

The specific code path that triggers it

All six search-space entries in nvidia-master.yaml for minimaxm2.5-fp4-b200-vllm specify ep: 4 or ep: 2. The CI harness maps ep: to the EP_SIZE environment variable (via benchmark-tmpl.yml EP_SIZE: ${{ inputs.ep }}). If that injection is absent—due to a workflow misconfiguration, a manual execution, or a future refactor that omits the variable—the ${EP_SIZE:-1} default silently evaluates to 1, the if condition is false, and --enable-expert-parallel is never passed to vLLM. The benchmark then runs in standard TP-only mode instead of the intended TP+EP mode.

Why existing code doesn't prevent it

The check_env_vars guard exists precisely to catch this class of silent misconfiguration. By omitting EP_SIZE from it, the script gives up the only defense against a missing value. Compare the otherwise identical minimaxm2.5_fp8_b200.sh, which lists EP_SIZE as a required variable and uses $EP_SIZE (no default) in the same conditional—meaning a missing value there would both fail the pre-flight check and error at the [ $EP_SIZE -gt 1 ] comparison.

Impact

MiniMax-M2.5 is a large MoE model where expert parallelism (EP) is a key performance lever. Running without EP when EP=4 or EP=2 is expected would significantly degrade benchmark throughput and latency, producing results that understate the hardware's actual capability. Because there is no error message or warning, these incorrect results would silently enter the benchmark record and potentially mislead performance comparisons.

How to fix it

Add EP_SIZE to the check_env_vars call in minimaxm2.5_fp4_b200.sh, and consider removing the :-1 default from ${EP_SIZE:-1} so the script behaves identically to the FP8 variant—failing loudly when EP_SIZE is absent rather than silently degrading.

Step-by-step proof

  1. A developer runs the script manually (or the workflow drops the EP_SIZE injection): EP_SIZE is unset.
  2. check_env_vars is called with MODEL TP CONC ISL OSL MAX_MODEL_LEN RANDOM_RANGE_RATIO RESULT_FILENAME—it exits successfully because EP_SIZE is not listed.
  3. Execution reaches line 30: if [ "${EP_SIZE:-1}" -gt 1 ]if [ "1" -gt 1 ] → false.
  4. EP is set to " " (empty), so --enable-expert-parallel is never appended to the vLLM serve command.
  5. vLLM starts in TP-only mode; the benchmark runs and records results that reflect no expert parallelism, contradicting the intended ep:4/ep:2 configuration—with zero error output.


if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

hf download "$MODEL"

nvidia-smi

export TORCH_CUDA_ARCH_LIST="10.0"
export PYTHONNOUSERSITE=1

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

# EP support: conditionally enable expert parallel based on EP_SIZE env var
if [ "${EP_SIZE:-1}" -gt 1 ]; then
EP=" --enable-expert-parallel"
else
EP=" "
fi

# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

set -x
vllm serve $MODEL --host 0.0.0.0 --port $PORT \
--tensor-parallel-size=$TP \
--gpu-memory-utilization 0.90 \
--max-model-len $MAX_MODEL_LEN \
--max-num-seqs $CONC \
--no-enable-prefix-caching \
--compilation_config.pass_config.fuse_allreduce_rms true \
--trust-remote-code${EP} > $SERVER_LOG 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

pip install -q datasets pandas

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts $(( CONC * 10 )) \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--trust-remote-code

# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT" --concurrent-requests $CONC
append_lm_eval_summary
fi

# Stop GPU monitoring
stop_gpu_monitor
set +x
8 changes: 8 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,11 @@
- config-keys:
- minimaxm2.5-fp4-b200-vllm
description:
- "Add MiniMax M2.5 NVFP4 single-node B200 vLLM benchmark (TP4, TP2)"
- "Uses vllm/vllm-openai:v0.18.0 image with --no-enable-prefix-caching"
- "Concurrency 4-64, all three seq-len configs (1k1k, 1k8k, 8k1k)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

Check warning on line 7 in perf-changelog.yaml

View check run for this annotation

Claude / Claude Code Review

perf-changelog pr-link uses XXX placeholder

The new `minimaxm2.5-fp4-b200-vllm` changelog entry uses a placeholder `XXX` in the `pr-link` field instead of the actual PR number (`952`). Update `pr-link` to `https://github.com/SemiAnalysisAI/InferenceX/pull/952` to ensure changelog traceability.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new minimaxm2.5-fp4-b200-vllm changelog entry uses a placeholder XXX in the pr-link field instead of the actual PR number (952). Update pr-link to https://github.com/SemiAnalysisAI/InferenceX/pull/952 to ensure changelog traceability.

Extended reasoning...

The perf-changelog.yaml entry added in this PR contains pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX (line 7), which is a placeholder that was never replaced with the actual PR number.

The specific code path is straightforward: the diff shows the new entry prepended to perf-changelog.yaml with the XXX placeholder, and the PR itself is #952 as confirmed by the PR description ("Closes #951") and the URL of this review.

Existing code does not prevent this because perf-changelog.yaml is a hand-maintained YAML file with no automated validation to catch placeholder values. Other entries in the file (e.g., glm5-fp8-mi355x-atom, minimaxm2.5-fp8-h200-vllm, qwen3.5-bf16-mi325x-sglang, qwen3.5-fp8-mi325x-sglang) also contain XXX, confirming this is a known recurring oversight rather than intentional behavior.

The impact is limited to changelog traceability: anyone reviewing the changelog to understand when or why minimaxm2.5-fp4-b200-vllm was introduced will follow a broken link. The benchmark itself functions correctly; only the audit trail is broken.

Step-by-step proof:

  1. Open perf-changelog.yaml and look at the first entry (lines 1–8).
  2. The pr-link field reads https://github.com/SemiAnalysisAI/InferenceX/pull/XXX.
  3. Navigating to that URL returns a 404 or incorrect page since XXX is not a valid PR number.
  4. The correct URL is https://github.com/SemiAnalysisAI/InferenceX/pull/952, which is the PR being reviewed here.

Fix: replace XXX with 952 on line 7 of perf-changelog.yaml.


- config-keys:
- dsr1-fp8-b200-dynamo-trt
- dsr1-fp8-h200-dynamo-trt
Expand Down
Loading