Skip to content

Develop/logprobs opt#791

Open
zongyuanwu wants to merge 4 commits intomainfrom
develop/logprobs_opt
Open

Develop/logprobs opt#791
zongyuanwu wants to merge 4 commits intomainfrom
develop/logprobs_opt

Conversation

@zongyuanwu
Copy link
Copy Markdown
Collaborator

fix logprobs custom render

@zongyuanwu zongyuanwu requested a review from LLLLKKKK as a code owner March 17, 2026 08:22
Copilot AI review requested due to automatic review settings March 17, 2026 08:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes incorrect logprobs rendering by ensuring the selected token’s probability is used and that probability tensors are propagated through the RPC response path.

Changes:

  • Use selected_id (instead of output_id) when computing the chosen token’s logprob in custom OpenAI renderers.
  • Populate all_probs in the flattened RPC response so downstream logprobs rendering has the required data.
  • Add CLAUDE.md with build/test/architecture guidance for repository tooling.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
rtp_llm/openai/renderers/custom_renderer.py Fixes logprob lookup to use the selected token index.
rtp_llm/cpp/model_rpc/QueryConverter.cc Adds all_probs stacking into the response payload for logprobs support.
CLAUDE.md Adds developer documentation for build/test workflow and architecture.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rtp_llm/cpp/model_rpc/QueryConverter.cc Outdated
Comment on lines +462 to +463
stackBuffersToTensorPB(
flatten_output->mutable_all_probs(), source_outputs, [](const auto& r) { return r.aux_info.all_probs; });
Comment thread CLAUDE.md
bazelisk test //rtp_llm/cpp/core/test/... --config=cuda12

# Run with ASAN
bazelisk test //rtp_llm/cpp/core/test:... --config=cuda12 --config=asan
Copilot AI review requested due to automatic review settings March 17, 2026 13:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request refactors “logprobs” support across the Python + C++ stack to return top‑k logprobs and token ids instead of a full-vocab all_probs tensor, plumbing the new fields through configs, RPC/proto, sampling, and OpenAI renderers.

Changes:

  • Add top_logprobs to generation config and propagate it through Python endpoint, C++ endpoint, and RPC proto/client conversion.
  • Replace all_probs usage with topk_logprobs + topk_token_ids in OpenAI rendering and in C++ stream aux info.
  • Update CUDA sampling to compute top‑k logprobs/token ids on GPU and pass them through normal engine streaming.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
rtp_llm/utils/base_model_datatypes.py Adds topk_logprobs/topk_token_ids to GenerateOutput and deprecates all_probs.
rtp_llm/openai/renderers/custom_renderer.py Switches OpenAI logprob rendering to consume top‑k tensors; updates stream/sync plumbing signatures.
rtp_llm/openai/openai_endpoint.py Sets GenerateConfig.top_logprobs when logprobs is requested.
rtp_llm/cpp/test/SamplerTest.cc Updates SamplerInputs construction for new top‑k fields.
rtp_llm/cpp/normal_engine/speculative/MtpExecutor.cc Adjusts fake stream update struct initialization for updated StreamUpdateInfo.
rtp_llm/cpp/normal_engine/NormalGenerateStream.cc Stores top‑k tensors into aux info when return_all_probs is enabled.
rtp_llm/cpp/normal_engine/NormalEngine.cc Updates fake stream StreamUpdateInfo initialization for new struct layout.
rtp_llm/cpp/normal_engine/NormalBatchStreamProcessor.cc Allocates/dispatches top‑k buffers instead of full all_probs.
rtp_llm/cpp/models/Sampler.cc Threads new optional top‑k output buffers into device sampling calls.
rtp_llm/cpp/models/SampleInfos.h Extends SamplerInputs/SamplerOutput with top‑k buffers and top_logprobs.
rtp_llm/cpp/model_rpc/QueryConverter.cc Adds top‑k tensors to RPC response flattening and top_logprobs to config conversion.
rtp_llm/cpp/model_rpc/proto/model_rpc_service.proto Adds top_logprobs to config PB and top‑k tensors to output PB.
rtp_llm/cpp/model_rpc/model_rpc_client.py Reads top‑k tensors from RPC outputs and sets them on Python GenerateOutput.
rtp_llm/cpp/engine_base/stream/StreamGroups.h Computes max top_logprobs across streams to size batch allocations.
rtp_llm/cpp/engine_base/stream/GenerateTypes.h Replaces AuxInfo.all_probs with topk_logprobs/topk_token_ids.
rtp_llm/cpp/engine_base/stream/GenerateStream.h Updates StreamUpdateInfo to carry top‑k buffers; extends speculative output struct.
rtp_llm/cpp/engine_base/stream/GenerateStream.cc Adjusts speculative update call to match new StreamUpdateInfo layout.
rtp_llm/cpp/engine_base/stream/GenerateConfig.h Adds top_logprobs to C++ generate config and JSON serialization.
rtp_llm/cpp/devices/OpData.h Extends GreedyParams with output top‑k buffers.
rtp_llm/cpp/devices/cuda_impl/CudaSampleOp.cc Computes renormalized probs, top‑k logprobs/token ids on GPU, and copies outputs.
rtp_llm/cpp/api_server/openai/OpenaiEndpoint.cc Populates GenerateConfig.top_logprobs from request.
rtp_llm/cpp/api_server/openai/ChatRender.cc Passes top‑k tensors to Python renderer calls instead of all_probs.
rtp_llm/config/generate_config.py Adds top_logprobs to Python GenerateConfig; formatting/validator tweaks.
CLAUDE.md Adds repository build/test/architecture guidance documentation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

for i in range(topk_logprobs.shape[0]):
if topk_token_ids[i].item() == selected_id:
selected_logprob = topk_logprobs[i].item()
break
for i in range(topk_logprobs_tensor.shape[0]):
if topk_token_ids_tensor[i].item() == selected_id:
selected_logprob = topk_logprobs_tensor[i].item()
break
Comment on lines 1190 to 1194
def _flush_buffer_sync(
self,
buffer_list: List[StreamStatusSync],
input_len_list,
output_len_list,
auto& top_k = params.top_k;
if (std::all_of(top_k.data<uint32_t>(), top_k.data<uint32_t>() + batch_size, [&](auto t) { return t == 1; })
&& !params.output_all_probs.has_value()) {
&& !params.output_topk_logprobs.has_value() && !params.cum_log_probs.has_value()) {
auto output_topk_logprobs_t = Buffer2torchTensor(*params.output_topk_logprobs, false);
auto output_topk_token_ids_t = Buffer2torchTensor(*params.output_topk_token_ids, false);
int k = (int)params.output_topk_logprobs->get().shape()[1];
auto [topk_vals, topk_ids] = torch::topk(renorm_probs_t, k, -1, true, true);
Comment on lines +124 to +125
if (!update_info.topk_logprobs) {
throw std::runtime_error("topk_logprobs is null while generate_config return_all_probs is true");
Comment on lines +561 to +565
if (return_all_probs && sampler_output.topk_logprobs) {
topk_logprobs = sampler_output.topk_logprobs->slice(batch_idx_out, next_batch_size, false);
topk_logprobs->updateParent(sampler_output.topk_logprobs);
topk_token_ids = sampler_output.topk_token_ids->slice(batch_idx_out, next_batch_size, false);
topk_token_ids->updateParent(sampler_output.topk_token_ids);
Comment on lines +915 to +917
OptionalBufferRef output_topk_logprobs;
OptionalBufferRef output_topk_token_ids;
OptionalBufferRef output_all_probs; // used by speculative decoding
}
if (req.logprobs.has_value()) {
config.return_all_probs = req.logprobs.value();
config.top_logprobs = req.top_logprobs.value_or(1);
@LLLLKKKK
Copy link
Copy Markdown
Collaborator

🤖 AI Code Review

概述

将 FlashInferTRTLLM 系列 Impl 类的 prepare_cuda_graph 方法从 Python 层的 prepare() + tensor.copy_() 模式替换为 Triton kernel 直接在 GPU 上计算,减少 CUDA graph replay 路径上的 Python 开销和 CPU-GPU 同步。同时重构了测试并修复了原有测试方法名冲突 bug。

优点

  • 核心思路清晰,Triton kernel 设计合理:pid == 0 处理小规模 seq_lens 计算,所有 pid 并行处理大规模 block_id -> kv_offset 转换
  • 测试覆盖充分,包括基本功能、大 batch、边界条件和 multi-block 场景
  • 修复了原有测试中 test_flashinfer_trtllm_prefill_fp8 重复定义导致 decode fp8 测试覆盖 prefill fp8 测试的 bug

建议改进

P1 - 重要

  1. _compute_cg_grid 中 BLOCK_SIZE 对大 N 的寄存器压力:当 N 很大(如 2048)时,单个 block 内 tl.arange(0, 2048) 寄存器压力较大。建议添加 assert 或文档说明 N 的预期上限。

  2. sequence_lengths_plus_1_d 语义假设需文档化FlashInferTRTLLMSpecDecodeImpl.prepare_cuda_graph 的 decode 分支依赖 sequence_lengths_plus_1_d 等价于 sequence_lengths + 1。建议在代码中添加注释说明此语义依赖。

  3. _DecodeCGParams / _PrefillCGParams 中 N/M 在 CUDA graph 重放时固定:当前设计与 CUDA graph 的固定 shape 约束一致,但建议添加注释说明此限制。

P2 - 建议

  1. Tensor 共享关系隐式_cg.kv_cache_offsetself.rope_params.kv_cache_offset 是同一个 tensor,_cg.seq_lens / cu_kv_seqlensself.fmha_params 中的也是同一 tensor。Triton kernel 的写入会直接影响下游行为。建议添加注释说明这些隐式共享关系。

总结

高质量的性能优化 PR,无阻塞性问题。LGTM,建议补充上述注释以降低后续维护风险。

@LLLLKKKK
Copy link
Copy Markdown
Collaborator

🤖 AI Code Review (修正版 — 之前的评论误贴了其他 PR 的内容,非常抱歉)

概述

将 logprobs 功能从传输完整 vocab-size 概率向量优化为仅传输 top-k 结果(topk_logprobs + topk_token_ids),显著降低内存和带宽开销。改动覆盖完整数据流路径(采样 → stream → RPC → 渲染)。

优点

  • 优化方向正确,从 O(vocab_size) 降到 O(top_k)
  • 保留了 speculative decoding 所需的 all_probs 路径
  • 新增 smoke 测试覆盖基本功能

建议改进

P1 - 重要

  1. selected token 不在 top-k 中时 logprob 为 -inf: 当采样的 top_k(如 50)大于用户请求的 top_logprobs(如 3)时,被采样的 token 可能不在返回的 top-3 中,selected_logprob 会是 -inf,不符合 OpenAI API 规范。建议额外输出 selected_token_logprob
  2. return_all_probs=true + top_logprobs=0 导致运行时崩溃: 通过非 OpenAI 路径设置 return_all_probs=true 但未设置 top_logprobs 时,buffer 不分配但下游抛异常。
  3. top_logprobs 缺少上界校验: 用户可传入任意大值,建议限制在 0-20(OpenAI 规范)。

P2 - 建议

  • MtpBatchStreamProcessor 未适配 topk,speculative decoding 下不返回 logprobs
  • StreamUpdateInfo 使用位置初始化(11-13 个 nullptr),极易出错,建议改用命名初始化

总结

方向正确的性能优化。建议修复 selected token logprob 为 -inf 的问题和边界条件后合入。

@LLLLKKKK
Copy link
Copy Markdown
Collaborator

LLLLKKKK commented Apr 9, 2026

🤖 AI Code Review — PR #791
Head SHA: f0420b41 | Verdict: P2

Summary

Replaces full vocab-size all_probs tensor [batch_size, vocab_size] with compact top-k representation (topk_logprobs + topk_token_ids, each [batch_size, k]). Significant memory/bandwidth optimization across the full stack: config, sampler, CUDA sampling, stream update, RPC proto, OpenAI rendering.

Findings

  1. [P2] Speculative decoding all_probs path may be staleall_probs is kept for speculative decoding but gatherSamplerInput no longer allocates it in the normal path. Verify speculative decoding tests still pass.

  2. [P2] top_logprobs=0 with return_all_probs=true silently produces no logprobs — In gatherSamplerInput, condition is if (return_all_probs && max_top_logprobs > 0). If top_logprobs defaults to 0, no buffers are allocated. The Python OpenAI endpoint covers this (top_logprobs or 1), but other entry points (direct gRPC) may not. Consider defaulting to 1 when return_all_probs is true.

  3. [P2] selected_logprob defaults to -inf when selected token not in top-k — In custom_renderer.py, if the sampled token isn't in top-k, logprob stays -inf. Acceptable per OpenAI spec but differs from previous behavior.

  4. [P2] Unrelated CLAUDE.md included in diff — Should be a separate PR.

Generated by AI Code Review

@LLLLKKKK
Copy link
Copy Markdown
Collaborator

AI Code Review — PR #791

概述

将 logprobs 返回从 Python 端全量概率 tensor(all_probs,shape [batch_size, vocab_size])重构为 C++ Sampler 端 GPU 上直接计算 top-k logprobs(shape [batch_size, k]),减少 GPU→CPU 数据传输量。

问题

  1. top_logprobs > vocab_sizetorch::topk 会 crash [P1]

    CudaSampleOp.cctorch::topk(renorm_probs_t, k)k 大于 vocab_size 时会抛出 runtime error。NormalBatchStreamProcessor.ccmax_top_logprobs 直接取自用户请求,没有与 vocab_size 做 min 截断。

    建议:在 gatherSamplerInput 中添加 k = std::min(k, vocab_size) 截断,或在请求解析阶段做校验(OpenAI API 规范限制 top_logprobs 最大为 20)。

  2. StreamUpdateInfo 聚合初始化脆弱 [P2]

    StreamUpdateInfo 使用位置初始化,本 PR 拆分字段后多处需手动补 nullptr。建议改用指定初始化器(designated initializers)。

  3. Python 端 selected_logprob 查找使用 for 循环 [P2]

    custom_renderer.py 中用 Python for 循环查找 selected token 的 logprob,可用 tensor 操作替代。

  4. _generate_log_probs_generate_log_probs_sync 逻辑完全重复 [P2]

    异步版本和同步版本的 logprobs 生成逻辑几乎完全相同,建议抽取共享方法。

小问题 [P3]

  • 格式化改动与功能改动混合
  • "add claude.md" commit 与 PR 功能无关
  • PR description 过于简略
  • return_all_probs=truetop_logprobs=0 时可能触发 runtime_error

亮点

  • top-k 计算下沉到 GPU 端,减少数据传输
  • 全链路覆盖完整
  • speculative decoding 向后兼容设计合理
  • 新增 smoke test 覆盖

结论

存在 P1 问题(top_logprobs 边界校验缺失),不建议合入。修复后可进入 CI 验证。

@LLLLKKKK
Copy link
Copy Markdown
Collaborator

AI Code Review — PR #791

Summary: P0/0 · P1/1 · P2/2 · P3/1

Review status: review_blocking

P1 Blocking

  • return_all_probs=true with default top_logprobs=0 causes runtime crash @ rtp_llm/cpp/normal_engine/NormalBatchStreamProcessor.cc:334
    • suggestion: Either force top_logprobs >= 1 when return_all_probs is true (e.g. in GenerateConfig validation or at the allocation site: if (return_all_probs) { max_top_logprobs = std::max(max_top_logprobs, 1); }), or relax the throw in NormalGenerateStream to skip topk output when top_logprobs == 0. The OpenAI endpoints already set top_logprobs >= 1, but direct GenerateConfig users (gRPC, internal callers) hit this crash path.

P2 Important

  • Speculative decoding all_probs buffer no longer allocated for logprobs requests @ rtp_llm/cpp/normal_engine/NormalBatchStreamProcessor.cc:334
    • suggestion: Verify that speculative decoding paths (MTP, Eagle3) never co-occur with return_all_probs=true, or add back the all_probs allocation when speculative decoding is active. If speculative decoding has its own allocation path, document that assumption.
  • topk_logprobs stores log of renormalized probs, not log of softmax probs @ rtp_llm/cpp/devices/cuda_impl/CudaSampleOp.cc:207
    • suggestion: Consider computing topk from the original softmax probs (probs_t) rather than renorm_probs_t. The renormalized distribution zeros out tokens outside the top-k/top-p window, inflating the reported probabilities of surviving tokens. Use torch::topk(probs_t.log(), k, -1, true, true) for OpenAI-compatible logprobs.

P3 Notes

  • selected_logprob defaults to -inf when selected token not in topk @ rtp_llm/openai/renderers/custom_renderer.py:530
    • suggestion: This is an inherent limitation of the topk-only approach. Consider documenting this behavior, or ensure top_logprobs is always large enough to include the sampled token (which it should be for greedy/top-k but may not be for pure top-p).

Strengths

  • Significant memory optimization: replaces full vocab-size all_probs buffer (batch x vocab_size floats) with compact topk buffers (batch x k), reducing GPU memory pressure for logprobs requests
  • Clean separation of topk logprobs from speculative decoding all_probs, making the two use cases independent
  • Consistent plumbing across the full stack: proto, gRPC client, C++ engine, Python renderer all updated coherently
  • Moves topk computation to GPU (torch::topk in CudaSampleOp) instead of doing it on CPU in the Python renderer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants