fix(grpc_client): honor ResponsesRequest sampling params in vLLM/SGLang/TRT-LLM#1106
fix(grpc_client): honor ResponsesRequest sampling params in vLLM/SGLang/TRT-LLM#1106key4ng wants to merge 10 commits intolightseekorg:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThe PR maps ResponsesRequest sampling fields (top_k, min_p, frequency_penalty, presence_penalty, repetition_penalty) through to gRPC Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Code Review
This pull request updates the gRPC clients for SGLang, TRT-LLM, and vLLM to map sampling parameters such as top_k, min_p, and penalties directly from the ResponsesRequest, replacing previous hardcoded defaults. Corresponding unit tests were added for each client. Feedback indicates that the stop field is currently omitted from the mapping logic and that internal comments incorrectly claim the Responses API lacks stop sequence support.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/grpc_client/src/vllm_engine.rs`:
- Around line 929-953: Add a disabled-case assertion to the existing
test_responses_sampling_params_are_passed_through to lock in the clamp behavior:
create a ResponsesRequest with top_k set to a negative value (e.g., -1), call
VllmEngineClient::build_grpc_sampling_params_from_responses(&request, None) and
assert that the returned params.top_k equals 0 to ensure negative top_k is
clamped to zero; place this check alongside the other assertions in the same
test so the behavior is verified.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 74bb011a-0040-4a67-b736-bde1e51634be
📒 Files selected for processing (3)
crates/grpc_client/src/sglang_scheduler.rscrates/grpc_client/src/trtllm_service.rscrates/grpc_client/src/vllm_engine.rs
CatherineSue
left a comment
There was a problem hiding this comment.
Can we add an e2e test for this?
|
Hi @key4ng, this PR has merge conflicts that must be resolved before it can be merged. Please rebase your branch: git fetch origin main
git rebase origin/main
# resolve any conflicts, then:
git push --force-with-lease |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c4e2a8dbe0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/pr-test-rust.yml:
- Around line 499-503: The workflow condition treats only github.event_name !=
'pull_request' so pull_request_target bypasses the detect-changes gating; update
the conditional that gates running e2e-1gpu-completions to also treat
'pull_request_target' like 'pull_request' (e.g., check github.event_name !=
'pull_request' && github.event_name != 'pull_request_target' or invert to
explicitly allow only non-PR events), ensuring
needs.detect-changes.outputs.completions is consulted; target the same
conditional block referencing github.event_name and
needs.detect-changes.outputs.completions and the e2e-1gpu-completions job.
In `@model_gateway/src/service_discovery.rs`:
- Around line 641-643: When enqueuing Job::RemoveWorker do not use
expected_revision: None; instead read the currently installed worker's revision
and set expected_revision to that value so stale remove jobs become no-ops if
the worker has been replaced. Locate the places creating Job::RemoveWorker (the
snippet with expected_revision: None and the other occurrences around the other
removal sites) and replace expected_revision: None with the current revision
fetched from the registry (e.g., get the WorkerEntry or Worker struct for
old_url and use its revision field), ensuring you handle the case where the
worker is missing by leaving expected_revision as None only if no installed
revision exists.
In `@model_gateway/src/worker/manager.rs`:
- Around line 403-407: The removal job enqueues the concrete DP-aware worker URL
via worker.url(), which for DP workers like "...@rank" prevents
find_workers_by_url() (which matches "{url}@") from finding them; normalize
DP-aware URLs to their base form before creating RemovalCandidate: detect and
strip the "@<suffix>" DP qualifier from worker.url() (use the same normalization
where removal logic at the other occurrence is applied) and push
RemovalCandidate with the base URL and expected_revision so prefix-matching in
find_workers_by_url() will succeed; apply the same change to the other similar
push site referenced in the diff.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: d2b5bd57-ccc1-4dd3-995d-244a417e1dca
📒 Files selected for processing (23)
.github/workflows/pr-test-rust.ymlbindings/golang/src/policy.rscrates/grpc_client/src/sglang_scheduler.rscrates/grpc_client/src/trtllm_service.rscrates/grpc_client/src/vllm_engine.rse2e_test/completions/__init__.pye2e_test/completions/test_basic.pymodel_gateway/src/observability/metrics_ws/collectors.rsmodel_gateway/src/policies/manual.rsmodel_gateway/src/policies/mod.rsmodel_gateway/src/routers/http/pd_router.rsmodel_gateway/src/server.rsmodel_gateway/src/service_discovery.rsmodel_gateway/src/worker/builder.rsmodel_gateway/src/worker/circuit_breaker.rsmodel_gateway/src/worker/manager.rsmodel_gateway/src/worker/registry.rsmodel_gateway/src/worker/service.rsmodel_gateway/src/worker/worker.rsmodel_gateway/src/workflow/job_queue.rsmodel_gateway/src/workflow/steps/local/find_workers_to_remove.rsmodel_gateway/src/workflow/steps/local/mod.rsmodel_gateway/src/workflow/steps/local/update_worker_properties.rs
efdc5c7 to
4bea74b
Compare
|
Note: the |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/grpc_client/src/sglang_scheduler.rs`:
- Around line 997-1022: Add a negative-sentinel assertion to the existing
test_responses_sampling_params_are_passed_through: call
SglangSchedulerClient::build_grpc_sampling_params_from_responses with a
ResponsesRequest where top_k is set to -1 (and other fields as needed or
default), then assert that the produced params.top_k equals -1 to verify the
disabled-sentinel behavior; reference the test function name and
build_grpc_sampling_params_from_responses so you add this extra case in the same
test (or a tiny new test) to lock the backend contract for the sentinel value.
In `@crates/grpc_client/src/trtllm_service.rs`:
- Around line 1002-1025: Extend the test for
TrtllmServiceClient::build_sampling_config_from_responses
(test_responses_sampling_config_is_passed_through) to include a negative-top_k
scenario: construct a ResponsesRequest with top_k set to a negative value and
call build_sampling_config_from_responses, then assert that the resulting
cfg.top_k is Some(0) to lock in the disabled-edge behavior (top_k < 0 ->
Some(0)). Ensure you reference the same function and request struct names
(ResponsesRequest and TrtllmServiceClient::build_sampling_config_from_responses)
when adding the new assertion or sub-case in the test.
In `@e2e_test/responses/test_sampling_params.py`:
- Around line 20-47: In test_all_sampling_params_accepted, add assertions that
the sampling parameters sent in extra_body are echoed back in the response so
propagation is validated; specifically assert that resp contains top_k == 40,
min_p == 0.05, frequency_penalty == 0.5, presence_penalty == 0.3 and
repetition_penalty == 1.1 (access them from the response payload/metadata where
echoed fields live, e.g., resp.extra or resp.<field> depending on the response
shape) in addition to the existing temperature/top_p/max_output_tokens checks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 58c3af9b-a860-4d68-acaf-8cc5707ed7a1
📒 Files selected for processing (4)
crates/grpc_client/src/sglang_scheduler.rscrates/grpc_client/src/trtllm_service.rscrates/grpc_client/src/vllm_engine.rse2e_test/responses/test_sampling_params.py
| def test_all_sampling_params_accepted(self, model, api_client): | ||
| """Non-default sampling params produce a valid response with correct echo-back.""" | ||
| resp = api_client.responses.create( | ||
| model=model, | ||
| input="What is 2+2?", | ||
| temperature=0.5, | ||
| top_p=0.9, | ||
| max_output_tokens=32, | ||
| extra_body={ | ||
| "top_k": 40, | ||
| "min_p": 0.05, | ||
| "frequency_penalty": 0.5, | ||
| "presence_penalty": 0.3, | ||
| "repetition_penalty": 1.1, | ||
| }, | ||
| ) | ||
| assert resp.status == "completed" | ||
| assert resp.error is None | ||
| assert len(resp.output_text) > 0 | ||
|
|
||
| assert resp.temperature == 0.5 | ||
| assert resp.top_p == 0.9 | ||
| assert resp.max_output_tokens == 32 | ||
|
|
||
| assert resp.usage is not None | ||
| assert resp.usage.output_tokens > 0 | ||
| assert resp.usage.output_tokens <= 32 | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Strengthen assertions for extra_body sampling fields.
This test sends top_k, min_p, and penalty fields but only asserts echoed temperature/top_p/max_output_tokens. If these fields are present in the response payload, assert them too to validate true end-to-end propagation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@e2e_test/responses/test_sampling_params.py` around lines 20 - 47, In
test_all_sampling_params_accepted, add assertions that the sampling parameters
sent in extra_body are echoed back in the response so propagation is validated;
specifically assert that resp contains top_k == 40, min_p == 0.05,
frequency_penalty == 0.5, presence_penalty == 0.3 and repetition_penalty == 1.1
(access them from the response payload/metadata where echoed fields live, e.g.,
resp.extra or resp.<field> depending on the response shape) in addition to the
existing temperature/top_p/max_output_tokens checks.
There was a problem hiding this comment.
The ResponsesResponse proto only echoes temperature, top_p, and max_output_tokens — the extension fields (top_k, min_p, penalties) are not in the response schema. We can't assert what isn't returned. The unit tests cover the actual builder pass-through; the E2E test confirms the full pipeline doesn't reject them.
There was a problem hiding this comment.
@key4ng, that makes sense — if ResponsesResponse doesn't echo those extension fields back, there's nothing to assert at the E2E layer for them. The division of responsibility is clear: unit tests validate builder pass-through, and the E2E test confirms the pipeline accepts the parameters without error. I'll drop this suggestion.
✏️ Learnings added
Learnt from: key4ng
Repo: lightseekorg/smg PR: 1106
File: e2e_test/responses/test_sampling_params.py:20-47
Timestamp: 2026-04-15T04:44:14.643Z
Learning: In repo lightseekorg/smg, `ResponsesResponse` proto only echoes back `temperature`, `top_p`, and `max_output_tokens`. Extension sampling fields (`top_k`, `min_p`, `frequency_penalty`, `presence_penalty`, `repetition_penalty`) are **not** part of the response schema and are therefore not present in `resp.*` in E2E tests (e.g., `e2e_test/responses/test_sampling_params.py`). Do not flag the absence of assertions for these fields in E2E response-sampling tests; builder pass-through for these fields is validated by unit tests in `crates/grpc_client/src/` instead.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 812
File: e2e_test/embeddings/test_correctness.py:223-223
Timestamp: 2026-03-20T00:00:47.235Z
Learning: In `lightseekorg/smg` PR `#812`, `TestEmbeddingCorrectness` in `e2e_test/embeddings/test_correctness.py` uses the `api_client` fixture without class-level parametrization by design. SmgClient coverage for embedding correctness (HF parity assertions in `test_semantic_similarity` and `test_relevance_scores`) is intentionally opt-in and deferred until SmgClient embedding compatibility is verified. Do not flag the absence of SmgClient parametrization in this test class as a gap.
Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 1122
File: e2e_test/chat_completions/test_validation.py:365-387
Timestamp: 2026-04-14T17:59:46.231Z
Learning: In `lightseekorg/smg` PR `#1122`, `e2e_test/chat_completions/test_validation.py` `TestEosTokenStripping.test_no_stop_trim_with_skip_special_false`: When `no_stop_trim=True` AND `skip_special_tokens=False`, EOS tokens (e.g. `<|eom_id|>`, `<|eot_id|>`, `<|end_of_text|>`) are expected to be **visible** in the decoded output. The test correctly asserts `has_eos == True`. The EOS hidden-stop stripping only applies when `no_stop_trim` is not set; with `no_stop_trim=True`, the stop token is not trimmed, and with `skip_special_tokens=False` it decodes to text. Do not flag this assertion as inverted.
Learnt from: vschandramourya
Repo: lightseekorg/smg PR: 915
File: model_gateway/src/routers/grpc/client.rs:387-423
Timestamp: 2026-03-26T17:06:14.307Z
Learning: In repo lightseekorg/smg, in `model_gateway/src/routers/grpc/client.rs` and the corresponding backend builders (`crates/grpc_client/src/sglang_scheduler.rs`, `vllm_engine.rs`, `trtllm_service.rs`): The per-backend divergence in handling `CompletionRequest.max_tokens == None` is intentional. SGLang and vLLM pass `None` through to their proto builders, while TRT-LLM falls back to `16`. This matches the pre-existing per-backend pattern used in the chat/messages request builders. Do not flag this divergence as a bug or request normalization at the `build_completion_request` dispatcher layer in `client.rs`.
Learnt from: ai-jz
Repo: lightseekorg/smg PR: 0
File: :0-0
Timestamp: 2026-04-12T20:47:06.274Z
Learning: In repo lightseekorg/smg, `grpc_servicer/smg_grpc_servicer/vllm/servicer.py`: `stop_token_ids` from generation-config defaults should NOT be merged (set-union) with the client's `stop_token_ids` in the servicer. The servicer passes through the client's `stop_token_ids` as-is; the vLLM engine handles generation-config stop behavior internally via `update_from_generation_config()` in `sampling_params.py`, consistent with the HTTP Responses path (`vllm/entrypoints/openai/responses/serving.py`).
Learnt from: key4ng
Repo: lightseekorg/smg PR: 1034
File: model_gateway/src/routers/grpc/multimodal.rs:602-610
Timestamp: 2026-04-04T00:29:52.690Z
Learning: In repo lightseekorg/smg, the gRPC regular pipeline runs stages in order: Preparation (tokenization + multimodal media fetch + preprocessing) → Worker Selection → Request Building. Because backend identity (e.g., GrpcClient::Mlx) is only resolved after worker selection, it is architecturally impossible to short-circuit MLX multimodal rejection before process_multimodal() / process_multimodal_messages() runs in the Preparation stage. The earliest feasible rejection point for backend-specific multimodal unsupported errors is at assemble_multimodal_data in request_building (after preprocessing but before generation). Moving rejection earlier would require a larger architectural change (e.g., moving worker selection before preparation). Do not flag assemble_multimodal_data's GrpcClient::Mlx error arm as "too late" in the pipeline.
Learnt from: slin1237
Repo: lightseekorg/smg PR: 489
File: model_gateway/benches/wasm_middleware_latency.rs:88-91
Timestamp: 2026-02-21T02:37:02.009Z
Learning: Repo: lightseekorg/smg — For clippy-only/enforcement PRs (e.g., PR `#489`), even micro-optimizations (like replacing an async closure with std::future::ready in benches such as model_gateway/benches/wasm_middleware_latency.rs) should be deferred to a follow-up PR rather than included inline.
Learnt from: vschandramourya
Repo: lightseekorg/smg PR: 978
File: model_gateway/src/routers/grpc/regular/streaming.rs:2275-2276
Timestamp: 2026-03-31T03:43:30.981Z
Learning: In repo lightseekorg/smg, file model_gateway/src/routers/grpc/regular/streaming.rs: `logprobs: None` in `CompletionStreamChoice` within `process_completion_streaming_chunks` is intentional. Per-token logprob streaming is deferred across all streaming endpoints (completions, generate) because it requires backend-side plumbing not yet available. The same `// TODO: wire per-token logprob streaming` deferral exists in `process_generate_streaming` (around line 702). Non-streaming completions has the same TODO. Do not flag `logprobs: None` or the TODO comment in the completions or generate streaming paths as a missing feature or silent data loss until a dedicated follow-up PR wires backend per-token logprob support.
Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 495
File: grpc_client/proto/sglang_scheduler.proto:140-172
Timestamp: 2026-02-21T11:57:36.560Z
Learning: In repo lightseekorg/smg, the multimodal proto fields in `grpc_client/proto/sglang_scheduler.proto` (including `MultimodalInputs.pixel_values`, `model_specific_tensors`, `im_token_id`, and `mm_placeholders`) were introduced in PR `#495` as the first real usage; prior internal-only development fields like `processed_features` were never shipped to production, so wire-compatibility concerns do not apply to these field number assignments.
Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 606
File: e2e_test/chat_completions/test_openai_server.py:611-612
Timestamp: 2026-03-04T04:43:55.458Z
Learning: In `lightseekorg/smg` PR `#606`, `TestChatCompletionGptOss` in `e2e_test/chat_completions/test_openai_server.py`: `STOP_SEQUENCE_TRIMMED = True` by default (Harmony/gpt-oss trims stop sequences). In `test_stop_sequences` (non-streaming), both `is_vllm()` and `is_sglang()` override this to `False` because neither backend trims stop sequences in non-streaming mode. In `test_stop_sequences_stream` (streaming), only `is_vllm()` overrides to `False`. This asymmetry is intentional. Do not flag `is_sglang()` in the non-streaming condition as incorrect.
Learnt from: key4ng
Repo: lightseekorg/smg PR: 1034
File: crates/grpc_client/src/mlx_engine.rs:135-145
Timestamp: 2026-04-04T00:29:40.626Z
Learning: In repo lightseekorg/smg, the gRPC client Channel builders for all backends (SGLang, vLLM, TRT-LLM, and MLX in crates/grpc_client/src/) intentionally omit `.connect_timeout()`. Adding a connect timeout uniformly across all four clients is deferred to a follow-up PR. Do not flag the missing connect_timeout on any individual backend client as a blocking issue.
Learnt from: vschandramourya
Repo: lightseekorg/smg PR: 964
File: model_gateway/src/routers/grpc/pd_router.rs:233-240
Timestamp: 2026-03-28T04:56:07.632Z
Learning: In repo lightseekorg/smg, `GrpcPDRouter` (model_gateway/src/routers/grpc/pd_router.rs) intentionally passes `&self.retry_config` (the router-level default) to `RetryExecutor::execute_response_with_retry` for ALL endpoints (generate, chat, messages, completion). Per-model retry overrides via `WorkerRegistry::get_retry_config` are NOT applied in the PD router — that pattern is used only in `GrpcRouter` (regular mode). Adding per-model retry support to individual PD endpoints in isolation would be inconsistent; any such change must cover all PD endpoints in a dedicated follow-up PR. Do not flag the missing per-model retry lookup in any single PD router endpoint as a bug.
Learnt from: key4ng
Repo: lightseekorg/smg PR: 556
File: e2e_test/fixtures/hooks.py:239-245
Timestamp: 2026-02-27T19:34:12.191Z
Learning: When implementing or reviewing affinity tracking in e2e tests, prefer deriving backend grouping from aggregated backends specified by pytest parametrize markers rather than per-item backends derived from callspec.params. This mirrors the worker pre-launch pattern and yields a conservative GPU affinity grouping (all variants share a single group) in cases where parametrizations mix local and cloud variants, reducing risk of missed grouping. Apply this guideline to Python e2e test hooks and fixtures files under e2e_test/ (not just a single file).
Signed-off-by: key4ng <rukeyang@gmail.com>
…lder Signed-off-by: key4ng <rukeyang@gmail.com>
…ilder Signed-off-by: key4ng <rukeyang@gmail.com>
Signed-off-by: key4ng <rukeyang@gmail.com>
ResponsesRequest exposes a stop field (Option<StringOrArray>), but the vLLM, SGLang, and TRT-LLM Responses builders hardcoded stop: vec![] with a misleading comment claiming the Responses API lacks stop sequence support. Pass the field through, matching the Chat Completions path. Also adds a negative-top_k clamp assertion to the vLLM test per code review. Signed-off-by: key4ng <rukeyang@gmail.com>
Revert stop-sequence pass-through (deferred to follow-up). Add negative top_k clamp test per CodeRabbit review. Signed-off-by: key4ng <rukeyang@gmail.com>
Signed-off-by: key4ng <rukeyang@gmail.com>
Signed-off-by: key4ng <rukeyang@gmail.com>
…T-LLM Signed-off-by: key4ng <rukeyang@gmail.com>
- Use pytest.approx for top_p float comparison (f32 round-trip precision) - Guard against usage.output_tokens being None - Increase max_output_tokens to 256 so reasoning models have room for text - Assert on resp.output (non-empty) instead of resp.output_text (can be empty when reasoning consumes all tokens) Signed-off-by: key4ng <rukeyang@gmail.com>
4540de6 to
501715d
Compare
|
accidentally create pr from fork repo |
|
new pr: #1161 |
Description
Problem
The
ResponsesRequestbuilders for vLLM, SGLang, and TRT-LLM gRPC clients hardcodedtop_k,min_p,repetition_penalty,frequency_penalty, andpresence_penaltyto defaults instead of reading them from the request. Harmony/v1/responsescallers that set these sampling parameters received a successful response while the backend silently decoded with default sampling behavior.The comments claimed
ResponsesRequestdidn't expose these fields — it does (crates/protocols/src/responses.rs:732-757).Solution
Replace the hardcoded values in all three builders with the actual request fields, matching each backend's proto conventions:
u32 top_k,0= disabled):request.top_k.max(0) as u32i32 top_k,-1= disabled, matchesResponsesRequestdefault): pass-throughOption<i32> top_k,0= disabled per proto comment):Some(request.top_k.max(0)), mirroring the existingbuild_sampling_config_from_chatpatternOption-typed penalty fields use.unwrap_or(0.0)for vLLM/SGLang's non-optional proto fields, and passOptionstraight through for TRT-LLM.Changes
crates/grpc_client/src/vllm_engine.rs—build_grpc_sampling_params_from_responses: pass through all five params + unit testcrates/grpc_client/src/sglang_scheduler.rs— same, with SGLang proto sentinels + unit testcrates/grpc_client/src/trtllm_service.rs—build_sampling_config_from_responses: wrap inSome(...)/ passOptionthrough + unit testTest Plan
Per-builder unit tests set non-default values on a
ResponsesRequestand assert the produced proto carries them through.Checklist
cargo +nightly fmtpassescargo clippy --all-targets --all-features -- -D warningspassesSummary by CodeRabbit
Release Notes
Bug Fixes
Tests