Develop/logprobs opt#791
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes incorrect logprobs rendering by ensuring the selected token’s probability is used and that probability tensors are propagated through the RPC response path.
Changes:
- Use
selected_id(instead ofoutput_id) when computing the chosen token’slogprobin custom OpenAI renderers. - Populate
all_probsin the flattened RPC response so downstream logprobs rendering has the required data. - Add
CLAUDE.mdwith build/test/architecture guidance for repository tooling.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| rtp_llm/openai/renderers/custom_renderer.py | Fixes logprob lookup to use the selected token index. |
| rtp_llm/cpp/model_rpc/QueryConverter.cc | Adds all_probs stacking into the response payload for logprobs support. |
| CLAUDE.md | Adds developer documentation for build/test workflow and architecture. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| stackBuffersToTensorPB( | ||
| flatten_output->mutable_all_probs(), source_outputs, [](const auto& r) { return r.aux_info.all_probs; }); |
| bazelisk test //rtp_llm/cpp/core/test/... --config=cuda12 | ||
|
|
||
| # Run with ASAN | ||
| bazelisk test //rtp_llm/cpp/core/test:... --config=cuda12 --config=asan |
There was a problem hiding this comment.
Pull request overview
This pull request refactors “logprobs” support across the Python + C++ stack to return top‑k logprobs and token ids instead of a full-vocab all_probs tensor, plumbing the new fields through configs, RPC/proto, sampling, and OpenAI renderers.
Changes:
- Add
top_logprobsto generation config and propagate it through Python endpoint, C++ endpoint, and RPC proto/client conversion. - Replace
all_probsusage withtopk_logprobs+topk_token_idsin OpenAI rendering and in C++ stream aux info. - Update CUDA sampling to compute top‑k logprobs/token ids on GPU and pass them through normal engine streaming.
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| rtp_llm/utils/base_model_datatypes.py | Adds topk_logprobs/topk_token_ids to GenerateOutput and deprecates all_probs. |
| rtp_llm/openai/renderers/custom_renderer.py | Switches OpenAI logprob rendering to consume top‑k tensors; updates stream/sync plumbing signatures. |
| rtp_llm/openai/openai_endpoint.py | Sets GenerateConfig.top_logprobs when logprobs is requested. |
| rtp_llm/cpp/test/SamplerTest.cc | Updates SamplerInputs construction for new top‑k fields. |
| rtp_llm/cpp/normal_engine/speculative/MtpExecutor.cc | Adjusts fake stream update struct initialization for updated StreamUpdateInfo. |
| rtp_llm/cpp/normal_engine/NormalGenerateStream.cc | Stores top‑k tensors into aux info when return_all_probs is enabled. |
| rtp_llm/cpp/normal_engine/NormalEngine.cc | Updates fake stream StreamUpdateInfo initialization for new struct layout. |
| rtp_llm/cpp/normal_engine/NormalBatchStreamProcessor.cc | Allocates/dispatches top‑k buffers instead of full all_probs. |
| rtp_llm/cpp/models/Sampler.cc | Threads new optional top‑k output buffers into device sampling calls. |
| rtp_llm/cpp/models/SampleInfos.h | Extends SamplerInputs/SamplerOutput with top‑k buffers and top_logprobs. |
| rtp_llm/cpp/model_rpc/QueryConverter.cc | Adds top‑k tensors to RPC response flattening and top_logprobs to config conversion. |
| rtp_llm/cpp/model_rpc/proto/model_rpc_service.proto | Adds top_logprobs to config PB and top‑k tensors to output PB. |
| rtp_llm/cpp/model_rpc/model_rpc_client.py | Reads top‑k tensors from RPC outputs and sets them on Python GenerateOutput. |
| rtp_llm/cpp/engine_base/stream/StreamGroups.h | Computes max top_logprobs across streams to size batch allocations. |
| rtp_llm/cpp/engine_base/stream/GenerateTypes.h | Replaces AuxInfo.all_probs with topk_logprobs/topk_token_ids. |
| rtp_llm/cpp/engine_base/stream/GenerateStream.h | Updates StreamUpdateInfo to carry top‑k buffers; extends speculative output struct. |
| rtp_llm/cpp/engine_base/stream/GenerateStream.cc | Adjusts speculative update call to match new StreamUpdateInfo layout. |
| rtp_llm/cpp/engine_base/stream/GenerateConfig.h | Adds top_logprobs to C++ generate config and JSON serialization. |
| rtp_llm/cpp/devices/OpData.h | Extends GreedyParams with output top‑k buffers. |
| rtp_llm/cpp/devices/cuda_impl/CudaSampleOp.cc | Computes renormalized probs, top‑k logprobs/token ids on GPU, and copies outputs. |
| rtp_llm/cpp/api_server/openai/OpenaiEndpoint.cc | Populates GenerateConfig.top_logprobs from request. |
| rtp_llm/cpp/api_server/openai/ChatRender.cc | Passes top‑k tensors to Python renderer calls instead of all_probs. |
| rtp_llm/config/generate_config.py | Adds top_logprobs to Python GenerateConfig; formatting/validator tweaks. |
| CLAUDE.md | Adds repository build/test/architecture guidance documentation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for i in range(topk_logprobs.shape[0]): | ||
| if topk_token_ids[i].item() == selected_id: | ||
| selected_logprob = topk_logprobs[i].item() | ||
| break |
| for i in range(topk_logprobs_tensor.shape[0]): | ||
| if topk_token_ids_tensor[i].item() == selected_id: | ||
| selected_logprob = topk_logprobs_tensor[i].item() | ||
| break |
| def _flush_buffer_sync( | ||
| self, | ||
| buffer_list: List[StreamStatusSync], | ||
| input_len_list, | ||
| output_len_list, |
| auto& top_k = params.top_k; | ||
| if (std::all_of(top_k.data<uint32_t>(), top_k.data<uint32_t>() + batch_size, [&](auto t) { return t == 1; }) | ||
| && !params.output_all_probs.has_value()) { | ||
| && !params.output_topk_logprobs.has_value() && !params.cum_log_probs.has_value()) { |
| auto output_topk_logprobs_t = Buffer2torchTensor(*params.output_topk_logprobs, false); | ||
| auto output_topk_token_ids_t = Buffer2torchTensor(*params.output_topk_token_ids, false); | ||
| int k = (int)params.output_topk_logprobs->get().shape()[1]; | ||
| auto [topk_vals, topk_ids] = torch::topk(renorm_probs_t, k, -1, true, true); |
| if (!update_info.topk_logprobs) { | ||
| throw std::runtime_error("topk_logprobs is null while generate_config return_all_probs is true"); |
| if (return_all_probs && sampler_output.topk_logprobs) { | ||
| topk_logprobs = sampler_output.topk_logprobs->slice(batch_idx_out, next_batch_size, false); | ||
| topk_logprobs->updateParent(sampler_output.topk_logprobs); | ||
| topk_token_ids = sampler_output.topk_token_ids->slice(batch_idx_out, next_batch_size, false); | ||
| topk_token_ids->updateParent(sampler_output.topk_token_ids); |
| OptionalBufferRef output_topk_logprobs; | ||
| OptionalBufferRef output_topk_token_ids; | ||
| OptionalBufferRef output_all_probs; // used by speculative decoding |
| } | ||
| if (req.logprobs.has_value()) { | ||
| config.return_all_probs = req.logprobs.value(); | ||
| config.top_logprobs = req.top_logprobs.value_or(1); |
|
🤖 AI Code Review 概述将 FlashInferTRTLLM 系列 Impl 类的 优点
建议改进P1 - 重要
P2 - 建议
总结高质量的性能优化 PR,无阻塞性问题。LGTM,建议补充上述注释以降低后续维护风险。 |
|
🤖 AI Code Review (修正版 — 之前的评论误贴了其他 PR 的内容,非常抱歉) 概述将 logprobs 功能从传输完整 vocab-size 概率向量优化为仅传输 top-k 结果( 优点
建议改进P1 - 重要
P2 - 建议
总结方向正确的性能优化。建议修复 selected token logprob 为 -inf 的问题和边界条件后合入。 |
|
🤖 AI Code Review — PR #791 SummaryReplaces full vocab-size Findings
Generated by AI Code Review |
AI Code Review — PR #791概述将 logprobs 返回从 Python 端全量概率 tensor( 问题
小问题 [P3]
亮点
结论存在 P1 问题( |
AI Code Review — PR #791Summary: P0/0 · P1/1 · P2/2 · P3/1 Review status: review_blocking P1 Blocking
P2 Important
P3 Notes
Strengths
|
fix logprobs custom render