Develop/logprobs opt by zongyuanwu · Pull Request #791 · alibaba/rtp-llm

zongyuanwu · 2026-03-17T08:22:48Z

fix logprobs custom render

Copilot

Pull request overview

Fixes incorrect logprobs rendering by ensuring the selected token’s probability is used and that probability tensors are propagated through the RPC response path.

Changes:

Use selected_id (instead of output_id) when computing the chosen token’s logprob in custom OpenAI renderers.
Populate all_probs in the flattened RPC response so downstream logprobs rendering has the required data.
Add CLAUDE.md with build/test/architecture guidance for repository tooling.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
rtp_llm/openai/renderers/custom_renderer.py	Fixes logprob lookup to use the selected token index.
rtp_llm/cpp/model_rpc/QueryConverter.cc	Adds `all_probs` stacking into the response payload for logprobs support.
CLAUDE.md	Adds developer documentation for build/test workflow and architecture.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    stackBuffersToTensorPB(
+        flatten_output->mutable_all_probs(), source_outputs, [](const auto& r) { return r.aux_info.all_probs; });


+bazelisk test //rtp_llm/cpp/core/test/... --config=cuda12
+
+# Run with ASAN
+bazelisk test //rtp_llm/cpp/core/test:... --config=cuda12 --config=asan


Copilot

Pull request overview

This pull request refactors “logprobs” support across the Python + C++ stack to return top‑k logprobs and token ids instead of a full-vocab all_probs tensor, plumbing the new fields through configs, RPC/proto, sampling, and OpenAI renderers.

Changes:

Add top_logprobs to generation config and propagate it through Python endpoint, C++ endpoint, and RPC proto/client conversion.
Replace all_probs usage with topk_logprobs + topk_token_ids in OpenAI rendering and in C++ stream aux info.
Update CUDA sampling to compute top‑k logprobs/token ids on GPU and pass them through normal engine streaming.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
rtp_llm/utils/base_model_datatypes.py	Adds `topk_logprobs`/`topk_token_ids` to `GenerateOutput` and deprecates `all_probs`.
rtp_llm/openai/renderers/custom_renderer.py	Switches OpenAI logprob rendering to consume top‑k tensors; updates stream/sync plumbing signatures.
rtp_llm/openai/openai_endpoint.py	Sets `GenerateConfig.top_logprobs` when `logprobs` is requested.
rtp_llm/cpp/test/SamplerTest.cc	Updates `SamplerInputs` construction for new top‑k fields.
rtp_llm/cpp/normal_engine/speculative/MtpExecutor.cc	Adjusts fake stream update struct initialization for updated `StreamUpdateInfo`.
rtp_llm/cpp/normal_engine/NormalGenerateStream.cc	Stores top‑k tensors into aux info when `return_all_probs` is enabled.
rtp_llm/cpp/normal_engine/NormalEngine.cc	Updates fake stream `StreamUpdateInfo` initialization for new struct layout.
rtp_llm/cpp/normal_engine/NormalBatchStreamProcessor.cc	Allocates/dispatches top‑k buffers instead of full `all_probs`.
rtp_llm/cpp/models/Sampler.cc	Threads new optional top‑k output buffers into device sampling calls.
rtp_llm/cpp/models/SampleInfos.h	Extends `SamplerInputs`/`SamplerOutput` with top‑k buffers and `top_logprobs`.
rtp_llm/cpp/model_rpc/QueryConverter.cc	Adds top‑k tensors to RPC response flattening and `top_logprobs` to config conversion.
rtp_llm/cpp/model_rpc/proto/model_rpc_service.proto	Adds `top_logprobs` to config PB and top‑k tensors to output PB.
rtp_llm/cpp/model_rpc/model_rpc_client.py	Reads top‑k tensors from RPC outputs and sets them on Python `GenerateOutput`.
rtp_llm/cpp/engine_base/stream/StreamGroups.h	Computes max `top_logprobs` across streams to size batch allocations.
rtp_llm/cpp/engine_base/stream/GenerateTypes.h	Replaces `AuxInfo.all_probs` with `topk_logprobs`/`topk_token_ids`.
rtp_llm/cpp/engine_base/stream/GenerateStream.h	Updates `StreamUpdateInfo` to carry top‑k buffers; extends speculative output struct.
rtp_llm/cpp/engine_base/stream/GenerateStream.cc	Adjusts speculative update call to match new `StreamUpdateInfo` layout.
rtp_llm/cpp/engine_base/stream/GenerateConfig.h	Adds `top_logprobs` to C++ generate config and JSON serialization.
rtp_llm/cpp/devices/OpData.h	Extends `GreedyParams` with output top‑k buffers.
rtp_llm/cpp/devices/cuda_impl/CudaSampleOp.cc	Computes renormalized probs, top‑k logprobs/token ids on GPU, and copies outputs.
rtp_llm/cpp/api_server/openai/OpenaiEndpoint.cc	Populates `GenerateConfig.top_logprobs` from request.
rtp_llm/cpp/api_server/openai/ChatRender.cc	Passes top‑k tensors to Python renderer calls instead of `all_probs`.
rtp_llm/config/generate_config.py	Adds `top_logprobs` to Python `GenerateConfig`; formatting/validator tweaks.
CLAUDE.md	Adds repository build/test/architecture guidance documentation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        for i in range(topk_logprobs.shape[0]):
+            if topk_token_ids[i].item() == selected_id:
+                selected_logprob = topk_logprobs[i].item()
+                break


+        for i in range(topk_logprobs_tensor.shape[0]):
+            if topk_token_ids_tensor[i].item() == selected_id:
+                selected_logprob = topk_logprobs_tensor[i].item()
+                break


    def _flush_buffer_sync(
        self,
        buffer_list: List[StreamStatusSync],
        input_len_list,
        output_len_list,


    auto& top_k = params.top_k;
    if (std::all_of(top_k.data<uint32_t>(), top_k.data<uint32_t>() + batch_size, [&](auto t) { return t == 1; })
-        && !params.output_all_probs.has_value()) {
+        && !params.output_topk_logprobs.has_value() && !params.cum_log_probs.has_value()) {


+        auto output_topk_logprobs_t  = Buffer2torchTensor(*params.output_topk_logprobs, false);
+        auto output_topk_token_ids_t = Buffer2torchTensor(*params.output_topk_token_ids, false);
+        int  k                       = (int)params.output_topk_logprobs->get().shape()[1];
+        auto [topk_vals, topk_ids]   = torch::topk(renorm_probs_t, k, -1, true, true);


+                if (!update_info.topk_logprobs) {
+                    throw std::runtime_error("topk_logprobs is null while generate_config return_all_probs is true");


+    if (return_all_probs && sampler_output.topk_logprobs) {
+        topk_logprobs = sampler_output.topk_logprobs->slice(batch_idx_out, next_batch_size, false);
+        topk_logprobs->updateParent(sampler_output.topk_logprobs);
+        topk_token_ids = sampler_output.topk_token_ids->slice(batch_idx_out, next_batch_size, false);
+        topk_token_ids->updateParent(sampler_output.topk_token_ids);


+    OptionalBufferRef output_topk_logprobs;
+    OptionalBufferRef output_topk_token_ids;
+    OptionalBufferRef output_all_probs;  // used by speculative decoding


    }
    if (req.logprobs.has_value()) {
        config.return_all_probs = req.logprobs.value();
+        config.top_logprobs     = req.top_logprobs.value_or(1);


LLLLKKKK · 2026-03-19T18:36:26Z

🤖 AI Code Review

概述

将 FlashInferTRTLLM 系列 Impl 类的 prepare_cuda_graph 方法从 Python 层的 prepare() + tensor.copy_() 模式替换为 Triton kernel 直接在 GPU 上计算，减少 CUDA graph replay 路径上的 Python 开销和 CPU-GPU 同步。同时重构了测试并修复了原有测试方法名冲突 bug。

优点

核心思路清晰，Triton kernel 设计合理：pid == 0 处理小规模 seq_lens 计算，所有 pid 并行处理大规模 block_id -> kv_offset 转换
测试覆盖充分，包括基本功能、大 batch、边界条件和 multi-block 场景
修复了原有测试中 test_flashinfer_trtllm_prefill_fp8 重复定义导致 decode fp8 测试覆盖 prefill fp8 测试的 bug

建议改进

P1 - 重要

_compute_cg_grid 中 BLOCK_SIZE 对大 N 的寄存器压力：当 N 很大（如 2048）时，单个 block 内 tl.arange(0, 2048) 寄存器压力较大。建议添加 assert 或文档说明 N 的预期上限。
sequence_lengths_plus_1_d 语义假设需文档化：FlashInferTRTLLMSpecDecodeImpl.prepare_cuda_graph 的 decode 分支依赖 sequence_lengths_plus_1_d 等价于 sequence_lengths + 1。建议在代码中添加注释说明此语义依赖。
_DecodeCGParams / _PrefillCGParams 中 N/M 在 CUDA graph 重放时固定：当前设计与 CUDA graph 的固定 shape 约束一致，但建议添加注释说明此限制。

P2 - 建议

Tensor 共享关系隐式：_cg.kv_cache_offset 与 self.rope_params.kv_cache_offset 是同一个 tensor，_cg.seq_lens / cu_kv_seqlens 与 self.fmha_params 中的也是同一 tensor。Triton kernel 的写入会直接影响下游行为。建议添加注释说明这些隐式共享关系。

总结

高质量的性能优化 PR，无阻塞性问题。LGTM，建议补充上述注释以降低后续维护风险。

LLLLKKKK · 2026-03-20T03:03:13Z

🤖 AI Code Review (修正版 — 之前的评论误贴了其他 PR 的内容，非常抱歉)

概述

将 logprobs 功能从传输完整 vocab-size 概率向量优化为仅传输 top-k 结果（topk_logprobs + topk_token_ids），显著降低内存和带宽开销。改动覆盖完整数据流路径（采样 → stream → RPC → 渲染）。

优点

优化方向正确，从 O(vocab_size) 降到 O(top_k)
保留了 speculative decoding 所需的 all_probs 路径
新增 smoke 测试覆盖基本功能

建议改进

P1 - 重要

selected token 不在 top-k 中时 logprob 为 -inf: 当采样的 top_k（如 50）大于用户请求的 top_logprobs（如 3）时，被采样的 token 可能不在返回的 top-3 中，selected_logprob 会是 -inf，不符合 OpenAI API 规范。建议额外输出 selected_token_logprob。
return_all_probs=true + top_logprobs=0 导致运行时崩溃: 通过非 OpenAI 路径设置 return_all_probs=true 但未设置 top_logprobs 时，buffer 不分配但下游抛异常。
top_logprobs 缺少上界校验: 用户可传入任意大值，建议限制在 0-20（OpenAI 规范）。

P2 - 建议

MtpBatchStreamProcessor 未适配 topk，speculative decoding 下不返回 logprobs
StreamUpdateInfo 使用位置初始化（11-13 个 nullptr），极易出错，建议改用命名初始化

总结

方向正确的性能优化。建议修复 selected token logprob 为 -inf 的问题和边界条件后合入。

LLLLKKKK · 2026-04-09T13:39:06Z

🤖 AI Code Review — PR #791
Head SHA: f0420b41 | Verdict: P2

Summary

Replaces full vocab-size all_probs tensor [batch_size, vocab_size] with compact top-k representation (topk_logprobs + topk_token_ids, each [batch_size, k]). Significant memory/bandwidth optimization across the full stack: config, sampler, CUDA sampling, stream update, RPC proto, OpenAI rendering.

Findings

[P2] Speculative decoding all_probs path may be stale — all_probs is kept for speculative decoding but gatherSamplerInput no longer allocates it in the normal path. Verify speculative decoding tests still pass.
[P2] top_logprobs=0 with return_all_probs=true silently produces no logprobs — In gatherSamplerInput, condition is if (return_all_probs && max_top_logprobs > 0). If top_logprobs defaults to 0, no buffers are allocated. The Python OpenAI endpoint covers this (top_logprobs or 1), but other entry points (direct gRPC) may not. Consider defaulting to 1 when return_all_probs is true.
[P2] selected_logprob defaults to -inf when selected token not in top-k — In custom_renderer.py, if the sampled token isn't in top-k, logprob stays -inf. Acceptable per OpenAI spec but differs from previous behavior.
[P2] Unrelated CLAUDE.md included in diff — Should be a separate PR.

_{Generated by AI Code Review}

LLLLKKKK · 2026-04-25T15:00:36Z

AI Code Review — PR #791

概述

将 logprobs 返回从 Python 端全量概率 tensor（all_probs，shape [batch_size, vocab_size]）重构为 C++ Sampler 端 GPU 上直接计算 top-k logprobs（shape [batch_size, k]），减少 GPU→CPU 数据传输量。

问题

top_logprobs > vocab_size 时 torch::topk 会 crash [P1]

CudaSampleOp.cc 中 torch::topk(renorm_probs_t, k) 当 k 大于 vocab_size 时会抛出 runtime error。NormalBatchStreamProcessor.cc 中 max_top_logprobs 直接取自用户请求，没有与 vocab_size 做 min 截断。

建议：在 gatherSamplerInput 中添加 k = std::min(k, vocab_size) 截断，或在请求解析阶段做校验（OpenAI API 规范限制 top_logprobs 最大为 20）。
StreamUpdateInfo 聚合初始化脆弱 [P2]

StreamUpdateInfo 使用位置初始化，本 PR 拆分字段后多处需手动补 nullptr。建议改用指定初始化器（designated initializers）。
Python 端 selected_logprob 查找使用 for 循环 [P2]

custom_renderer.py 中用 Python for 循环查找 selected token 的 logprob，可用 tensor 操作替代。
_generate_log_probs 和 _generate_log_probs_sync 逻辑完全重复 [P2]

异步版本和同步版本的 logprobs 生成逻辑几乎完全相同，建议抽取共享方法。

小问题 [P3]

格式化改动与功能改动混合
"add claude.md" commit 与 PR 功能无关
PR description 过于简略
return_all_probs=true 且 top_logprobs=0 时可能触发 runtime_error

亮点

top-k 计算下沉到 GPU 端，减少数据传输
全链路覆盖完整
speculative decoding 向后兼容设计合理
新增 smoke test 覆盖

结论

存在 P1 问题（top_logprobs 边界校验缺失），不建议合入。修复后可进入 CI 验证。

LLLLKKKK · 2026-04-28T10:40:03Z

AI Code Review — PR #791

Summary: P0/0 · P1/1 · P2/2 · P3/1

Review status: review_blocking

P1 Blocking

return_all_probs=true with default top_logprobs=0 causes runtime crash @ rtp_llm/cpp/normal_engine/NormalBatchStreamProcessor.cc:334
- suggestion: Either force top_logprobs >= 1 when return_all_probs is true (e.g. in GenerateConfig validation or at the allocation site: if (return_all_probs) { max_top_logprobs = std::max(max_top_logprobs, 1); }), or relax the throw in NormalGenerateStream to skip topk output when top_logprobs == 0. The OpenAI endpoints already set top_logprobs >= 1, but direct GenerateConfig users (gRPC, internal callers) hit this crash path.

P2 Important

Speculative decoding all_probs buffer no longer allocated for logprobs requests @ rtp_llm/cpp/normal_engine/NormalBatchStreamProcessor.cc:334
- suggestion: Verify that speculative decoding paths (MTP, Eagle3) never co-occur with return_all_probs=true, or add back the all_probs allocation when speculative decoding is active. If speculative decoding has its own allocation path, document that assumption.
topk_logprobs stores log of renormalized probs, not log of softmax probs @ rtp_llm/cpp/devices/cuda_impl/CudaSampleOp.cc:207
- suggestion: Consider computing topk from the original softmax probs (probs_t) rather than renorm_probs_t. The renormalized distribution zeros out tokens outside the top-k/top-p window, inflating the reported probabilities of surviving tokens. Use torch::topk(probs_t.log(), k, -1, true, true) for OpenAI-compatible logprobs.

P3 Notes

selected_logprob defaults to -inf when selected token not in topk @ rtp_llm/openai/renderers/custom_renderer.py:530
- suggestion: This is an inherent limitation of the topk-only approach. Consider documenting this behavior, or ensure top_logprobs is always large enough to include the sampled token (which it should be for greedy/top-k but may not be for pure top-p).

Strengths

Significant memory optimization: replaces full vocab-size all_probs buffer (batch x vocab_size floats) with compact topk buffers (batch x k), reducing GPU memory pressure for logprobs requests
Clean separation of topk logprobs from speculative decoding all_probs, making the two use cases independent
Consistent plumbing across the full stack: proto, gRPC client, C++ engine, Python renderer all updated coherently
Moves topk computation to GPU (torch::topk in CudaSampleOp) instead of doing it on CPU in the Python renderer

zongyuanwu added 2 commits March 17, 2026 16:19

add claude.md

471aa78

minor fix logprobs custom renderer

b5abe4c

zongyuanwu requested a review from LLLLKKKK as a code owner March 17, 2026 08:22

Copilot AI review requested due to automatic review settings March 17, 2026 08:22

Copilot started reviewing on behalf of zongyuanwu March 17, 2026 08:27 View session

Copilot AI reviewed Mar 17, 2026

View reviewed changes

zongyuanwu added 2 commits March 17, 2026 20:38

move topk logprob return from python render to sampler

169bb1d

fix SamplerTest compile

f0420b4

Copilot AI review requested due to automatic review settings March 17, 2026 13:38

Copilot started reviewing on behalf of zongyuanwu March 17, 2026 13:40 View session

Copilot AI reviewed Mar 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop/logprobs opt#791

Develop/logprobs opt#791
zongyuanwu wants to merge 4 commits intomainfrom
develop/logprobs_opt

zongyuanwu commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

LLLLKKKK commented Mar 19, 2026

Uh oh!

LLLLKKKK commented Mar 20, 2026

Uh oh!

LLLLKKKK commented Apr 9, 2026

Uh oh!

LLLLKKKK commented Apr 25, 2026

Uh oh!

LLLLKKKK commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		stackBuffersToTensorPB(
		flatten_output->mutable_all_probs(), source_outputs, [](const auto& r) { return r.aux_info.all_probs; });

		if (!update_info.topk_logprobs) {
		throw std::runtime_error("topk_logprobs is null while generate_config return_all_probs is true");

Conversation

zongyuanwu commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

LLLLKKKK commented Mar 19, 2026

概述

优点

建议改进

P1 - 重要

P2 - 建议

总结

Uh oh!

LLLLKKKK commented Mar 20, 2026

概述

优点

建议改进

P1 - 重要

P2 - 建议

总结

Uh oh!

LLLLKKKK commented Apr 9, 2026

Summary

Findings

Uh oh!

LLLLKKKK commented Apr 25, 2026

AI Code Review — PR #791

概述

问题

小问题 [P3]

亮点

结论

Uh oh!

LLLLKKKK commented Apr 28, 2026

AI Code Review — PR #791

P1 Blocking

P2 Important

P3 Notes

Strengths

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants