test: add a diagnostic script for prefix caching naning by terrykong · Pull Request #1987 · NVIDIA-NeMo/RL

terrykong · 2026-02-18T21:18:08Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

Documentation
- Added diagnostic guidance for troubleshooting prefix caching NaN logprobs issues.
New Features
- Introduced a diagnostic tool to validate prefix caching behavior and detect NaN logprobs.

Signed-off-by: Terry Kong <terryk@nvidia.com>

coderabbitai · 2026-02-18T21:23:39Z

📝 Walkthrough

Walkthrough

Documentation updated to include a new diagnostic section for prefix caching NaN logprobs validation. A new Python script added to tools/model_diagnostics/ that reproduces and validates prefix caching behavior in vLLM, including multi-iteration generation with prefix cache reuse and NaN logprob detection.

Changes

Cohort / File(s)	Summary
Documentation `docs/adding-new-models.md`	Added diagnostic section documenting prefix caching NaN logprobs test case with usage instructions, expected outputs for different vLLM versions, and explanation of underlying prefix caching bug.
Diagnostic Tool `tools/model_diagnostics/5.prefix_caching_nan.py`	New script to reproduce prefix caching behavior: instantiates LLM with prefix caching, performs two-iteration generation (initial build, then cache reuse), detects NaN logprobs, and asserts expected behavior.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested labels

CI:docs

Suggested reviewers

chtruong814

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title contains a typo ('naning' instead of 'NaN') and lacks clarity about what the diagnostic script addresses, making it unclear despite referencing the main change.	Correct the typo to 'test: add a diagnostic script for prefix caching NaN logprobs' to accurately describe the script's purpose of detecting NaN values in prefix caching.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Test Results For Major Changes	✅ Passed	PR adds minor diagnostic utilities and documentation; no major production code changes or breaking changes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch tk/vllm-nan-diagnostic

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

tools/model_diagnostics/5.prefix_caching_nan.py (2)
75-84: Unconditional break silently under-counts NaNs if logprobs > 1.

The break at line 84 exits after inspecting only the first token-id entry per step, regardless of whether a NaN was found. With logprobs=1 and temperature=0.0 (greedy), each step's dict has exactly one entry so this is functionally correct today. However, vLLM returns up to logprobs+1 elements per step, meaning if logprobs is ever bumped above 1, the counter would silently undercount NaN occurrences (at most 1 per step). Removing the break makes the intent clear and future-proof:
♻️ Proposed fix
     for _tid, lp_obj in step.items():
         lp = lp_obj.logprob if hasattr(lp_obj, "logprob") else lp_obj
         if isinstance(lp, float) and math.isnan(lp):
             nan_count += 1
-        break
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/model_diagnostics/5.prefix_caching_nan.py` around lines 75 - 84, The
loop over out2.logprobs currently contains an unconditional break after
inspecting the first token-id entry, which causes under-counting NaNs when a
step contains multiple entries; remove the break so the inner loop over
step.items() examines every lp_obj (keep existing hasattr(lp_obj, "logprob")
check and NaN detection for lp) so nan_count increments for every NaN in all
token-id entries rather than just the first one per step.
35-41: Consider consolidating under if __name__ == "__main__": and moving imports to the top.

All module-level logic (argparse, LLM instantiation, generation) runs unconditionally on import. A __main__ guard is the standard protection against accidental execution when scripts are discovered by tooling. Additionally, the vllm import (lines 40–41) appears mid-file after parse_args() — while this speeds up --help, it diverges from PEP 8 and can surprise readers.
♻️ Suggested structure
 import argparse
 import math
+from vllm import LLM, SamplingParams
+import vllm

 MODEL = "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
 ...

-parser = argparse.ArgumentParser()
-parser.add_argument("--model", type=str, default=MODEL)
-parser.add_argument("--tp", type=int, default=TP)
-args = parser.parse_args()
-
-import vllm
-from vllm import LLM, SamplingParams
-
-print(f"vLLM version: {vllm.__version__}")
-...
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model", type=str, default=MODEL)
+    parser.add_argument("--tp", type=int, default=TP)
+    args = parser.parse_args()
+
+    print(f"vLLM version: {vllm.__version__}")
+    ...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/model_diagnostics/5.prefix_caching_nan.py` around lines 35 - 41, Move
all runtime logic (argparse setup and calls using parser/args, LLM
instantiation, and generation) under a guarded block: wrap code that calls
parser.parse_args(), creates the vllm LLM and SamplingParams, and runs
generation inside if __name__ == "__main__":. Also relocate imports (import vllm
and from vllm import LLM, SamplingParams) to the top of the file with other
imports to follow PEP8; keep only lightweight module-level constants like MODEL
and TP outside the guard and ensure no heavy side-effect code runs at import
time.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/adding-new-models.md`:
- Around line 330-344: The two fenced code blocks under "Expected pass output
(vLLM 0.13.0)" and "Expected fail output (vLLM 0.14.0)" are missing language
specifiers causing markdownlint MD040; update both fences to include a language
(e.g., use ```text) so the pass-output and fail-output blocks explicitly start
with ```text and end with ``` to satisfy MD040 and preserve formatting.

In `@tools/model_diagnostics/5.prefix_caching_nan.py`:
- Line 87: The print statement currently uses an unnecessary f-string: locate
the call print(f"\n  Sample logprobs from iteration 2:") and remove the leading
"f" so it becomes a plain string literal; this eliminates the Ruff F541 spurious
f-string warning without changing behavior.
- Around line 29-75: The module defines several module-level mutable bindings
(parser, args, numbers, prompt, llm, sampling_params, out1, out2, nan_count)
which violate the global naming guideline; wrap all runtime code that creates or
mutates these symbols inside an if __name__ == "__main__": block so they become
local to main (keep MODEL, TP, MAX_TOKENS, MAX_MODEL_LEN, COUNT_UP_TO and
imports at module scope), e.g., move creation of
argparse.ArgumentParser()/parser, args = parser.parse_args(), numbers, prompt
construction, LLM() instantiation, SamplingParams(), the two generate calls that
produce out1/out2, and nan_count into that guard; alternatively if you must keep
any of them global, rename using upper snake_case with the G_ prefix (e.g.,
G_PARSER, G_LLM) to satisfy the guideline.

---

Nitpick comments:
In `@tools/model_diagnostics/5.prefix_caching_nan.py`:
- Around line 75-84: The loop over out2.logprobs currently contains an
unconditional break after inspecting the first token-id entry, which causes
under-counting NaNs when a step contains multiple entries; remove the break so
the inner loop over step.items() examines every lp_obj (keep existing
hasattr(lp_obj, "logprob") check and NaN detection for lp) so nan_count
increments for every NaN in all token-id entries rather than just the first one
per step.
- Around line 35-41: Move all runtime logic (argparse setup and calls using
parser/args, LLM instantiation, and generation) under a guarded block: wrap code
that calls parser.parse_args(), creates the vllm LLM and SamplingParams, and
runs generation inside if __name__ == "__main__":. Also relocate imports (import
vllm and from vllm import LLM, SamplingParams) to the top of the file with other
imports to follow PEP8; keep only lightweight module-level constants like MODEL
and TP outside the guard and ensure no heavy side-effect code runs at import
time.

coderabbitai · 2026-02-18T21:23:42Z

docs/adding-new-models.md

+Expected pass output (vLLM 0.13.0):
+```
+Iteration 1 — prompt length: 13990 chars
+  tokens: 2048, finish_reason: length
+  text (first 100): '3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 '
+
+Iteration 2 — prompt length: 16038 chars
+  tokens: 2048, finish_reason: length
+  text (first 100): '1 3412 3413 3414 3415 3416 3417 3418 3419 3420 3421 3422 3423 3424 3425 3426 3427 3428 3429 3430 343'
+
+[nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16] ALL GOOD!
+```
+
+Expected fail output (vLLM 0.14.0):
+```


⚠️ Potential issue | 🟡 Minor

Add language specifiers to the two output code blocks (markdownlint MD040).

Both the pass-output and fail-output fenced code blocks are missing a language identifier, triggering MD040.

🐛 Proposed fix

Expected pass output (vLLM 0.13.0): -``` +```text Iteration 1 — prompt length: 13990 chars ...

Expected fail output (vLLM 0.14.0): -``` +```text Iteration 1 — prompt length: 13990 chars ...

🧰 Tools

🪛 markdownlint-cli2 (0.21.0)

[warning] 331-331: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

[warning] 344-344: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/adding-new-models.md` around lines 330 - 344, The two fenced code blocks under "Expected pass output (vLLM 0.13.0)" and "Expected fail output (vLLM 0.14.0)" are missing language specifiers causing markdownlint MD040; update both fences to include a language (e.g., use ```text) so the pass-output and fail-output blocks explicitly start with ```text and end with ``` to satisfy MD040 and preserve formatting.

coderabbitai · 2026-02-18T21:23:42Z

tools/model_diagnostics/5.prefix_caching_nan.py

+MODEL = "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
+TP = 2
+MAX_TOKENS = 2048
+MAX_MODEL_LEN = 32768
+COUNT_UP_TO = 3000
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--model", type=str, default=MODEL)
+parser.add_argument("--tp", type=int, default=TP)
+args = parser.parse_args()
+
+import vllm
+from vllm import LLM, SamplingParams
+
+print(f"vLLM version: {vllm.__version__}")
+
+numbers = " ".join(str(i) for i in range(1, COUNT_UP_TO + 1))
+prompt = (
+    "You are a counting assistant. Output ONLY numbers separated by spaces.\n\n"
+    f"User: Continue counting: {numbers} "
+)
+
+llm = LLM(
+    model=args.model,
+    tensor_parallel_size=args.tp,
+    enable_prefix_caching=True,
+    max_model_len=MAX_MODEL_LEN,
+    gpu_memory_utilization=0.90,
+    trust_remote_code=True,
+)
+sampling_params = SamplingParams(temperature=0.0, max_tokens=MAX_TOKENS, logprobs=1)
+
+# Iteration 1: initial generation (builds the prefix cache)
+print(f"\nIteration 1 — prompt length: {len(prompt)} chars")
+out1 = llm.generate([prompt], sampling_params)[0].outputs[0]
+print(f"  tokens: {len(out1.token_ids)}, finish_reason: {out1.finish_reason}")
+print(f"  text (first 100): {out1.text[:100]!r}")
+
+# Iteration 2: extend prompt with prior output (triggers prefix cache reuse)
+prompt += out1.text
+print(f"\nIteration 2 — prompt length: {len(prompt)} chars")
+out2 = llm.generate([prompt], sampling_params)[0].outputs[0]
+print(f"  tokens: {len(out2.token_ids)}, finish_reason: {out2.finish_reason}")
+print(f"  text (first 100): {out2.text[:100]!r}")
+
+# Check for NaN logprobs
+nan_count = 0


⚠️ Potential issue | 🟡 Minor

Module-level mutable variables are missing the G_ prefix required by coding guidelines.

parser, args, numbers, prompt, llm, sampling_params, out1, out2, and nan_count are all module-scope mutable bindings. As per coding guidelines, global variables should use upper snake_case with a G_ prefix (e.g., G_PARSER, G_LLM). The constants MODEL, TP, etc. are already correctly named.

The cleanest resolution is the if __name__ == "__main__": refactor suggested above — that would demote all of these to local variables, eliminating the naming concern entirely.

As per coding guidelines: "Use upper snake_case with G prefix for global variables, e.g., G_MY_GLOBAL"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tools/model_diagnostics/5.prefix_caching_nan.py` around lines 29 - 75, The module defines several module-level mutable bindings (parser, args, numbers, prompt, llm, sampling_params, out1, out2, nan_count) which violate the global naming guideline; wrap all runtime code that creates or mutates these symbols inside an if __name__ == "__main__": block so they become local to main (keep MODEL, TP, MAX_TOKENS, MAX_MODEL_LEN, COUNT_UP_TO and imports at module scope), e.g., move creation of argparse.ArgumentParser()/parser, args = parser.parse_args(), numbers, prompt construction, LLM() instantiation, SamplingParams(), the two generate calls that produce out1/out2, and nan_count into that guard; alternatively if you must keep any of them global, rename using upper snake_case with the G_ prefix (e.g., G_PARSER, G_LLM) to satisfy the guideline.

coderabbitai · 2026-02-18T21:23:42Z

tools/model_diagnostics/5.prefix_caching_nan.py

+            break
+
+if nan_count > 0:
+    print(f"\n  Sample logprobs from iteration 2:")


⚠️ Potential issue | 🟡 Minor

Remove spurious f prefix — f-string with no placeholders (Ruff F541).

🐛 Proposed fix

- print(f"\n Sample logprobs from iteration 2:") + print("\n Sample logprobs from iteration 2:")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

print(f"\n Sample logprobs from iteration 2:")

print("\n Sample logprobs from iteration 2:")

🧰 Tools

🪛 Ruff (0.15.1)

[error] 87-87: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tools/model_diagnostics/5.prefix_caching_nan.py` at line 87, The print statement currently uses an unnecessary f-string: locate the call print(f"\n Sample logprobs from iteration 2:") and remove the leading "f" so it becomes a plain string literal; this eliminates the Ruff F541 spurious f-string warning without changing behavior.

Signed-off-by: Terry Kong <terryk@nvidia.com>

tests: add a diagnostic script for prefix caching naning

761f74e

Signed-off-by: Terry Kong <terryk@nvidia.com>

terrykong requested review from a team as code owners February 18, 2026 21:18

github-actions bot added the documentation Improvements or additions to documentation label Feb 18, 2026

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

copyright

c5c0bee

Signed-off-by: Terry Kong <terryk@nvidia.com>

terrykong added the CI:docs Run doctest label Feb 18, 2026

terrykong temporarily deployed to nemo-ci February 18, 2026 21:38 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci February 18, 2026 21:42 — with GitHub Actions Inactive

benchislett mentioned this pull request Feb 19, 2026

[Bug]: Prefix caching failing for Nemotron models vllm-project/vllm#34865

Open

1 task

yfw approved these changes Feb 19, 2026

View reviewed changes

terrykong changed the title ~~tests: add a diagnostic script for prefix caching naning~~ test: add a diagnostic script for prefix caching naning Feb 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

test: add a diagnostic script for prefix caching naning#1987

test: add a diagnostic script for prefix caching naning#1987
terrykong wants to merge 2 commits intomainfrom
tk/vllm-nan-diagnostic

terrykong commented Feb 18, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 18, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 18, 2026

Uh oh!

coderabbitai bot Feb 18, 2026

Uh oh!

coderabbitai bot Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	print(f"\n Sample logprobs from iteration 2:")
	print("\n Sample logprobs from iteration 2:")

Comments

Conversation

terrykong commented Feb 18, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

terrykong commented Feb 18, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 18, 2026 •

edited

Loading