fix(grpc_client): honor ResponsesRequest sampling params in vLLM/SGLang/TRT-LLM by key4ng · Pull Request #1106 · lightseekorg/smg

key4ng · 2026-04-13T02:08:14Z

Description

Problem

The ResponsesRequest builders for vLLM, SGLang, and TRT-LLM gRPC clients hardcoded top_k, min_p, repetition_penalty, frequency_penalty, and presence_penalty to defaults instead of reading them from the request. Harmony /v1/responses callers that set these sampling parameters received a successful response while the backend silently decoded with default sampling behavior.

The comments claimed ResponsesRequest didn't expose these fields — it does (crates/protocols/src/responses.rs:732-757).

Solution

Replace the hardcoded values in all three builders with the actual request fields, matching each backend's proto conventions:

vLLM (u32 top_k, 0 = disabled): request.top_k.max(0) as u32
SGLang (i32 top_k, -1 = disabled, matches ResponsesRequest default): pass-through
TRT-LLM (Option<i32> top_k, 0 = disabled per proto comment): Some(request.top_k.max(0)), mirroring the existing build_sampling_config_from_chat pattern

Option-typed penalty fields use .unwrap_or(0.0) for vLLM/SGLang's non-optional proto fields, and pass Option straight through for TRT-LLM.

Changes

crates/grpc_client/src/vllm_engine.rs — build_grpc_sampling_params_from_responses: pass through all five params + unit test
crates/grpc_client/src/sglang_scheduler.rs — same, with SGLang proto sentinels + unit test
crates/grpc_client/src/trtllm_service.rs — build_sampling_config_from_responses: wrap in Some(...) / pass Option through + unit test

Test Plan

Per-builder unit tests set non-default values on a ResponsesRequest and assert the produced proto carries them through.

cargo test -p smg-grpc-client
# test result: ok. 49 passed; 0 failed

cargo clippy -p smg-grpc-client --all-targets --all-features -- -D warnings
# clean

cargo +nightly fmt --all -- --check
# clean

Checklist

cargo +nightly fmt passes
cargo clippy --all-targets --all-features -- -D warnings passes
(Optional) Documentation updated
(Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

Release Notes

Bug Fixes
- Sampling parameters (top_k, min_p, repetition_penalty, frequency_penalty, presence_penalty) are now passed through from requests to inference backends, fixing prior defaults and clamping behavior where applicable.
Tests
- Added unit and end-to-end tests verifying sampling parameters are reflected in generated and streaming responses, including determinism checks for temperature=0.

coderabbitai · 2026-04-13T02:08:28Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

The PR maps ResponsesRequest sampling fields (top_k, min_p, frequency_penalty, presence_penalty, repetition_penalty) through to gRPC SamplingParams/SamplingConfig for SGLang, vLLM, and TRT-LLM backends and adds unit and e2e tests validating the pass-through.

Changes

Cohort / File(s)	Summary
gRPC backend mapping `crates/grpc_client/src/sglang_scheduler.rs`, `crates/grpc_client/src/vllm_engine.rs`, `crates/grpc_client/src/trtllm_service.rs`	Replace hardcoded/default sampling values with direct mapping from `ResponsesRequest` into `proto::SamplingParams`/`SamplingConfig`; clamp/convert `top_k`, pass through `min_p` and `repetition_penalty`, use `unwrap_or(0.0)` for optional penalties; preserve empty `stop` vectors (note: stop not yet passed). Unit tests added per file.
End-to-end tests `e2e_test/responses/test_sampling_params.py`	New e2e test module with `_SamplingParamsBase` verifying sampling params accepted/echoed, determinism at `temperature=0`, and streaming behavior; two concrete test classes target local and multi-engine (harmony) configurations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

refactor(grpc): remove redundant harmony_stop_ids from build_generate_request_from_responses #880 — Modifies gRPC client request-building and stop-token handling in the same backend files touched here.
fix(responses): stop defaulting top_p for omitted requests #1043 — Changes deserialization/default semantics for ResponsesRequest.top_p, which affects pass-through behavior validated by these changes.

Suggested reviewers

slin1237
XinyueZhang369

Poem

🐰 I hopped through code to make params clear,
top_k and penalties now travel near.
Tests watch the path, both unit and end,
So requests and backends agree, my friend. 🥕

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely summarizes the main change: fixing gRPC client builders to pass through ResponsesRequest sampling parameters in vLLM, SGLang, and TRT-LLM backends.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request updates the gRPC clients for SGLang, TRT-LLM, and vLLM to map sampling parameters such as top_k, min_p, and penalties directly from the ResponsesRequest, replacing previous hardcoded defaults. Corresponding unit tests were added for each client. Feedback indicates that the stop field is currently omitted from the mapping logic and that internal comments incorrectly claim the Responses API lacks stop sequence support.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/grpc_client/src/vllm_engine.rs`:
- Around line 929-953: Add a disabled-case assertion to the existing
test_responses_sampling_params_are_passed_through to lock in the clamp behavior:
create a ResponsesRequest with top_k set to a negative value (e.g., -1), call
VllmEngineClient::build_grpc_sampling_params_from_responses(&request, None) and
assert that the returned params.top_k equals 0 to ensure negative top_k is
clamped to zero; place this check alongside the other assertions in the same
test so the behavior is verified.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 74bb011a-0040-4a67-b736-bde1e51634be

📥 Commits

Reviewing files that changed from the base of the PR and between d954c14 and cda9294.

📒 Files selected for processing (3)

crates/grpc_client/src/sglang_scheduler.rs
crates/grpc_client/src/trtllm_service.rs
crates/grpc_client/src/vllm_engine.rs

CatherineSue

Can we add an e2e test for this?

pending e2e tests

mergify · 2026-04-15T04:02:00Z

Hi @key4ng, this PR has merge conflicts that must be resolved before it can be merged. Please rebase your branch:

git fetch origin main
git rebase origin/main
# resolve any conflicts, then:
git push --force-with-lease

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c4e2a8dbe0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/pr-test-rust.yml:
- Around line 499-503: The workflow condition treats only github.event_name !=
'pull_request' so pull_request_target bypasses the detect-changes gating; update
the conditional that gates running e2e-1gpu-completions to also treat
'pull_request_target' like 'pull_request' (e.g., check github.event_name !=
'pull_request' && github.event_name != 'pull_request_target' or invert to
explicitly allow only non-PR events), ensuring
needs.detect-changes.outputs.completions is consulted; target the same
conditional block referencing github.event_name and
needs.detect-changes.outputs.completions and the e2e-1gpu-completions job.

In `@model_gateway/src/service_discovery.rs`:
- Around line 641-643: When enqueuing Job::RemoveWorker do not use
expected_revision: None; instead read the currently installed worker's revision
and set expected_revision to that value so stale remove jobs become no-ops if
the worker has been replaced. Locate the places creating Job::RemoveWorker (the
snippet with expected_revision: None and the other occurrences around the other
removal sites) and replace expected_revision: None with the current revision
fetched from the registry (e.g., get the WorkerEntry or Worker struct for
old_url and use its revision field), ensuring you handle the case where the
worker is missing by leaving expected_revision as None only if no installed
revision exists.

In `@model_gateway/src/worker/manager.rs`:
- Around line 403-407: The removal job enqueues the concrete DP-aware worker URL
via worker.url(), which for DP workers like "...@rank" prevents
find_workers_by_url() (which matches "{url}@") from finding them; normalize
DP-aware URLs to their base form before creating RemovalCandidate: detect and
strip the "@<suffix>" DP qualifier from worker.url() (use the same normalization
where removal logic at the other occurrence is applied) and push
RemovalCandidate with the base URL and expected_revision so prefix-matching in
find_workers_by_url() will succeed; apply the same change to the other similar
push site referenced in the diff.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d2b5bd57-ccc1-4dd3-995d-244a417e1dca

📥 Commits

Reviewing files that changed from the base of the PR and between 201eb3d and c4e2a8d.

📒 Files selected for processing (23)

.github/workflows/pr-test-rust.yml
bindings/golang/src/policy.rs
crates/grpc_client/src/sglang_scheduler.rs
crates/grpc_client/src/trtllm_service.rs
crates/grpc_client/src/vllm_engine.rs
e2e_test/completions/__init__.py
e2e_test/completions/test_basic.py
model_gateway/src/observability/metrics_ws/collectors.rs
model_gateway/src/policies/manual.rs
model_gateway/src/policies/mod.rs
model_gateway/src/routers/http/pd_router.rs
model_gateway/src/server.rs
model_gateway/src/service_discovery.rs
model_gateway/src/worker/builder.rs
model_gateway/src/worker/circuit_breaker.rs
model_gateway/src/worker/manager.rs
model_gateway/src/worker/registry.rs
model_gateway/src/worker/service.rs
model_gateway/src/worker/worker.rs
model_gateway/src/workflow/job_queue.rs
model_gateway/src/workflow/steps/local/find_workers_to_remove.rs
model_gateway/src/workflow/steps/local/mod.rs
model_gateway/src/workflow/steps/local/update_worker_properties.rs

key4ng · 2026-04-15T04:31:25Z

Note: the stop sequence pass-through flagged by @gemini-code-assist is intentionally deferred to a follow-up PR. This PR scopes to the five sampling parameters (top_k, min_p, repetition_penalty, frequency_penalty, presence_penalty). The stop field is a separate fix with its own test surface.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/grpc_client/src/sglang_scheduler.rs`:
- Around line 997-1022: Add a negative-sentinel assertion to the existing
test_responses_sampling_params_are_passed_through: call
SglangSchedulerClient::build_grpc_sampling_params_from_responses with a
ResponsesRequest where top_k is set to -1 (and other fields as needed or
default), then assert that the produced params.top_k equals -1 to verify the
disabled-sentinel behavior; reference the test function name and
build_grpc_sampling_params_from_responses so you add this extra case in the same
test (or a tiny new test) to lock the backend contract for the sentinel value.

In `@crates/grpc_client/src/trtllm_service.rs`:
- Around line 1002-1025: Extend the test for
TrtllmServiceClient::build_sampling_config_from_responses
(test_responses_sampling_config_is_passed_through) to include a negative-top_k
scenario: construct a ResponsesRequest with top_k set to a negative value and
call build_sampling_config_from_responses, then assert that the resulting
cfg.top_k is Some(0) to lock in the disabled-edge behavior (top_k < 0 ->
Some(0)). Ensure you reference the same function and request struct names
(ResponsesRequest and TrtllmServiceClient::build_sampling_config_from_responses)
when adding the new assertion or sub-case in the test.

In `@e2e_test/responses/test_sampling_params.py`:
- Around line 20-47: In test_all_sampling_params_accepted, add assertions that
the sampling parameters sent in extra_body are echoed back in the response so
propagation is validated; specifically assert that resp contains top_k == 40,
min_p == 0.05, frequency_penalty == 0.5, presence_penalty == 0.3 and
repetition_penalty == 1.1 (access them from the response payload/metadata where
echoed fields live, e.g., resp.extra or resp.<field> depending on the response
shape) in addition to the existing temperature/top_p/max_output_tokens checks.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 58c3af9b-a860-4d68-acaf-8cc5707ed7a1

📥 Commits

Reviewing files that changed from the base of the PR and between c4e2a8d and 4bea74b.

📒 Files selected for processing (4)

crates/grpc_client/src/sglang_scheduler.rs
crates/grpc_client/src/trtllm_service.rs
crates/grpc_client/src/vllm_engine.rs
e2e_test/responses/test_sampling_params.py

coderabbitai · 2026-04-15T04:35:00Z

+    def test_all_sampling_params_accepted(self, model, api_client):
+        """Non-default sampling params produce a valid response with correct echo-back."""
+        resp = api_client.responses.create(
+            model=model,
+            input="What is 2+2?",
+            temperature=0.5,
+            top_p=0.9,
+            max_output_tokens=32,
+            extra_body={
+                "top_k": 40,
+                "min_p": 0.05,
+                "frequency_penalty": 0.5,
+                "presence_penalty": 0.3,
+                "repetition_penalty": 1.1,
+            },
+        )
+        assert resp.status == "completed"
+        assert resp.error is None
+        assert len(resp.output_text) > 0
+
+        assert resp.temperature == 0.5
+        assert resp.top_p == 0.9
+        assert resp.max_output_tokens == 32
+
+        assert resp.usage is not None
+        assert resp.usage.output_tokens > 0
+        assert resp.usage.output_tokens <= 32
+


🧹 Nitpick | 🔵 Trivial

Strengthen assertions for extra_body sampling fields.

This test sends top_k, min_p, and penalty fields but only asserts echoed temperature/top_p/max_output_tokens. If these fields are present in the response payload, assert them too to validate true end-to-end propagation.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@e2e_test/responses/test_sampling_params.py` around lines 20 - 47, In test_all_sampling_params_accepted, add assertions that the sampling parameters sent in extra_body are echoed back in the response so propagation is validated; specifically assert that resp contains top_k == 40, min_p == 0.05, frequency_penalty == 0.5, presence_penalty == 0.3 and repetition_penalty == 1.1 (access them from the response payload/metadata where echoed fields live, e.g., resp.extra or resp.<field> depending on the response shape) in addition to the existing temperature/top_p/max_output_tokens checks.

The ResponsesResponse proto only echoes temperature, top_p, and max_output_tokens — the extension fields (top_k, min_p, penalties) are not in the response schema. We can't assert what isn't returned. The unit tests cover the actual builder pass-through; the E2E test confirms the full pipeline doesn't reject them.

@key4ng, that makes sense — if ResponsesResponse doesn't echo those extension fields back, there's nothing to assert at the E2E layer for them. The division of responsibility is clear: unit tests validate builder pass-through, and the E2E test confirms the pipeline accepts the parameters without error. I'll drop this suggestion.

✏️ Learnings added

Learnt from: key4ng Repo: lightseekorg/smg PR: 1106 File: e2e_test/responses/test_sampling_params.py:20-47 Timestamp: 2026-04-15T04:44:14.643Z Learning: In repo lightseekorg/smg, `ResponsesResponse` proto only echoes back `temperature`, `top_p`, and `max_output_tokens`. Extension sampling fields (`top_k`, `min_p`, `frequency_penalty`, `presence_penalty`, `repetition_penalty`) are **not** part of the response schema and are therefore not present in `resp.*` in E2E tests (e.g., `e2e_test/responses/test_sampling_params.py`). Do not flag the absence of assertions for these fields in E2E response-sampling tests; builder pass-through for these fields is validated by unit tests in `crates/grpc_client/src/` instead.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

🧠 Learnings used

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 812 File: e2e_test/embeddings/test_correctness.py:223-223 Timestamp: 2026-03-20T00:00:47.235Z Learning: In `lightseekorg/smg` PR `#812`, `TestEmbeddingCorrectness` in `e2e_test/embeddings/test_correctness.py` uses the `api_client` fixture without class-level parametrization by design. SmgClient coverage for embedding correctness (HF parity assertions in `test_semantic_similarity` and `test_relevance_scores`) is intentionally opt-in and deferred until SmgClient embedding compatibility is verified. Do not flag the absence of SmgClient parametrization in this test class as a gap.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 1122 File: e2e_test/chat_completions/test_validation.py:365-387 Timestamp: 2026-04-14T17:59:46.231Z Learning: In `lightseekorg/smg` PR `#1122`, `e2e_test/chat_completions/test_validation.py` `TestEosTokenStripping.test_no_stop_trim_with_skip_special_false`: When `no_stop_trim=True` AND `skip_special_tokens=False`, EOS tokens (e.g. `<|eom_id|>`, `<|eot_id|>`, `<|end_of_text|>`) are expected to be **visible** in the decoded output. The test correctly asserts `has_eos == True`. The EOS hidden-stop stripping only applies when `no_stop_trim` is not set; with `no_stop_trim=True`, the stop token is not trimmed, and with `skip_special_tokens=False` it decodes to text. Do not flag this assertion as inverted.

Learnt from: vschandramourya Repo: lightseekorg/smg PR: 915 File: model_gateway/src/routers/grpc/client.rs:387-423 Timestamp: 2026-03-26T17:06:14.307Z Learning: In repo lightseekorg/smg, in `model_gateway/src/routers/grpc/client.rs` and the corresponding backend builders (`crates/grpc_client/src/sglang_scheduler.rs`, `vllm_engine.rs`, `trtllm_service.rs`): The per-backend divergence in handling `CompletionRequest.max_tokens == None` is intentional. SGLang and vLLM pass `None` through to their proto builders, while TRT-LLM falls back to `16`. This matches the pre-existing per-backend pattern used in the chat/messages request builders. Do not flag this divergence as a bug or request normalization at the `build_completion_request` dispatcher layer in `client.rs`.

Learnt from: ai-jz Repo: lightseekorg/smg PR: 0 File: :0-0 Timestamp: 2026-04-12T20:47:06.274Z Learning: In repo lightseekorg/smg, `grpc_servicer/smg_grpc_servicer/vllm/servicer.py`: `stop_token_ids` from generation-config defaults should NOT be merged (set-union) with the client's `stop_token_ids` in the servicer. The servicer passes through the client's `stop_token_ids` as-is; the vLLM engine handles generation-config stop behavior internally via `update_from_generation_config()` in `sampling_params.py`, consistent with the HTTP Responses path (`vllm/entrypoints/openai/responses/serving.py`).

Learnt from: key4ng Repo: lightseekorg/smg PR: 1034 File: model_gateway/src/routers/grpc/multimodal.rs:602-610 Timestamp: 2026-04-04T00:29:52.690Z Learning: In repo lightseekorg/smg, the gRPC regular pipeline runs stages in order: Preparation (tokenization + multimodal media fetch + preprocessing) → Worker Selection → Request Building. Because backend identity (e.g., GrpcClient::Mlx) is only resolved after worker selection, it is architecturally impossible to short-circuit MLX multimodal rejection before process_multimodal() / process_multimodal_messages() runs in the Preparation stage. The earliest feasible rejection point for backend-specific multimodal unsupported errors is at assemble_multimodal_data in request_building (after preprocessing but before generation). Moving rejection earlier would require a larger architectural change (e.g., moving worker selection before preparation). Do not flag assemble_multimodal_data's GrpcClient::Mlx error arm as "too late" in the pipeline.

Learnt from: slin1237 Repo: lightseekorg/smg PR: 489 File: model_gateway/benches/wasm_middleware_latency.rs:88-91 Timestamp: 2026-02-21T02:37:02.009Z Learning: Repo: lightseekorg/smg — For clippy-only/enforcement PRs (e.g., PR `#489`), even micro-optimizations (like replacing an async closure with std::future::ready in benches such as model_gateway/benches/wasm_middleware_latency.rs) should be deferred to a follow-up PR rather than included inline.

Learnt from: vschandramourya Repo: lightseekorg/smg PR: 978 File: model_gateway/src/routers/grpc/regular/streaming.rs:2275-2276 Timestamp: 2026-03-31T03:43:30.981Z Learning: In repo lightseekorg/smg, file model_gateway/src/routers/grpc/regular/streaming.rs: `logprobs: None` in `CompletionStreamChoice` within `process_completion_streaming_chunks` is intentional. Per-token logprob streaming is deferred across all streaming endpoints (completions, generate) because it requires backend-side plumbing not yet available. The same `// TODO: wire per-token logprob streaming` deferral exists in `process_generate_streaming` (around line 702). Non-streaming completions has the same TODO. Do not flag `logprobs: None` or the TODO comment in the completions or generate streaming paths as a missing feature or silent data loss until a dedicated follow-up PR wires backend per-token logprob support.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 495 File: grpc_client/proto/sglang_scheduler.proto:140-172 Timestamp: 2026-02-21T11:57:36.560Z Learning: In repo lightseekorg/smg, the multimodal proto fields in `grpc_client/proto/sglang_scheduler.proto` (including `MultimodalInputs.pixel_values`, `model_specific_tensors`, `im_token_id`, and `mm_placeholders`) were introduced in PR `#495` as the first real usage; prior internal-only development fields like `processed_features` were never shipped to production, so wire-compatibility concerns do not apply to these field number assignments.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 606 File: e2e_test/chat_completions/test_openai_server.py:611-612 Timestamp: 2026-03-04T04:43:55.458Z Learning: In `lightseekorg/smg` PR `#606`, `TestChatCompletionGptOss` in `e2e_test/chat_completions/test_openai_server.py`: `STOP_SEQUENCE_TRIMMED = True` by default (Harmony/gpt-oss trims stop sequences). In `test_stop_sequences` (non-streaming), both `is_vllm()` and `is_sglang()` override this to `False` because neither backend trims stop sequences in non-streaming mode. In `test_stop_sequences_stream` (streaming), only `is_vllm()` overrides to `False`. This asymmetry is intentional. Do not flag `is_sglang()` in the non-streaming condition as incorrect.

Learnt from: key4ng Repo: lightseekorg/smg PR: 1034 File: crates/grpc_client/src/mlx_engine.rs:135-145 Timestamp: 2026-04-04T00:29:40.626Z Learning: In repo lightseekorg/smg, the gRPC client Channel builders for all backends (SGLang, vLLM, TRT-LLM, and MLX in crates/grpc_client/src/) intentionally omit `.connect_timeout()`. Adding a connect timeout uniformly across all four clients is deferred to a follow-up PR. Do not flag the missing connect_timeout on any individual backend client as a blocking issue.

Learnt from: vschandramourya Repo: lightseekorg/smg PR: 964 File: model_gateway/src/routers/grpc/pd_router.rs:233-240 Timestamp: 2026-03-28T04:56:07.632Z Learning: In repo lightseekorg/smg, `GrpcPDRouter` (model_gateway/src/routers/grpc/pd_router.rs) intentionally passes `&self.retry_config` (the router-level default) to `RetryExecutor::execute_response_with_retry` for ALL endpoints (generate, chat, messages, completion). Per-model retry overrides via `WorkerRegistry::get_retry_config` are NOT applied in the PD router — that pattern is used only in `GrpcRouter` (regular mode). Adding per-model retry support to individual PD endpoints in isolation would be inconsistent; any such change must cover all PD endpoints in a dedicated follow-up PR. Do not flag the missing per-model retry lookup in any single PD router endpoint as a bug.

Learnt from: key4ng Repo: lightseekorg/smg PR: 556 File: e2e_test/fixtures/hooks.py:239-245 Timestamp: 2026-02-27T19:34:12.191Z Learning: When implementing or reviewing affinity tracking in e2e tests, prefer deriving backend grouping from aggregated backends specified by pytest parametrize markers rather than per-item backends derived from callspec.params. This mirrors the worker pre-launch pattern and yields a conservative GPU affinity grouping (all variants share a single group) in cases where parametrizations mix local and cloud variants, reducing risk of missed grouping. Apply this guideline to Python e2e test hooks and fixtures files under e2e_test/ (not just a single file).

Signed-off-by: key4ng <rukeyang@gmail.com>

…lder Signed-off-by: key4ng <rukeyang@gmail.com>

…ilder Signed-off-by: key4ng <rukeyang@gmail.com>

Signed-off-by: key4ng <rukeyang@gmail.com>

ResponsesRequest exposes a stop field (Option<StringOrArray>), but the vLLM, SGLang, and TRT-LLM Responses builders hardcoded stop: vec![] with a misleading comment claiming the Responses API lacks stop sequence support. Pass the field through, matching the Chat Completions path. Also adds a negative-top_k clamp assertion to the vLLM test per code review. Signed-off-by: key4ng <rukeyang@gmail.com>

Revert stop-sequence pass-through (deferred to follow-up). Add negative top_k clamp test per CodeRabbit review. Signed-off-by: key4ng <rukeyang@gmail.com>

Signed-off-by: key4ng <rukeyang@gmail.com>

…T-LLM Signed-off-by: key4ng <rukeyang@gmail.com>

- Use pytest.approx for top_p float comparison (f32 round-trip precision) - Guard against usage.output_tokens being None - Increase max_output_tokens to 256 so reasoning models have room for text - Assert on resp.output (non-empty) instead of resp.output_text (can be empty when reasoning consumes all tokens) Signed-off-by: key4ng <rukeyang@gmail.com>

key4ng · 2026-04-15T23:50:42Z

accidentally create pr from fork repo

key4ng · 2026-04-15T23:51:22Z

new pr: #1161

key4ng requested review from CatherineSue and slin1237 as code owners April 13, 2026 02:08

github-actions bot added the grpc gRPC client and router changes label Apr 13, 2026

slin1237 previously approved these changes Apr 13, 2026

View reviewed changes

gemini-code-assist bot reviewed Apr 13, 2026

View reviewed changes

Comment thread crates/grpc_client/src/sglang_scheduler.rs

Comment thread crates/grpc_client/src/vllm_engine.rs

coderabbitai bot requested changes Apr 13, 2026

View reviewed changes

Comment thread crates/grpc_client/src/vllm_engine.rs

coderabbitai bot approved these changes Apr 13, 2026

View reviewed changes

CatherineSue reviewed Apr 13, 2026

View reviewed changes

slin1237 self-requested a review April 14, 2026 02:44

key4ng requested review from XinyueZhang369 and whybeyoung as code owners April 15, 2026 04:01

github-actions bot added ci CI/CD configuration changes tests Test changes model-gateway Model gateway crate changes labels Apr 15, 2026

mergify bot added the needs-rebase PR has merge conflicts that need to be resolved label Apr 15, 2026

chatgpt-codex-connector bot reviewed Apr 15, 2026

View reviewed changes

Comment thread model_gateway/src/worker/manager.rs

Comment thread .github/workflows/pr-test-rust.yml

coderabbitai bot requested changes Apr 15, 2026

View reviewed changes

Comment thread .github/workflows/pr-test-rust.yml

Comment thread model_gateway/src/service_discovery.rs

Comment thread model_gateway/src/worker/manager.rs

key4ng force-pushed the fix/responses-sampling-params-hardcoded branch from efdc5c7 to 4bea74b Compare April 15, 2026 04:28

github-actions bot removed ci CI/CD configuration changes model-gateway Model gateway crate changes labels Apr 15, 2026

mergify bot removed the needs-rebase PR has merge conflicts that need to be resolved label Apr 15, 2026

coderabbitai bot requested changes Apr 15, 2026

View reviewed changes

key4ng added 4 commits April 15, 2026 15:51

fix(grpc_client): pass ResponsesRequest sampling params to vLLM builder

1db3ded

Signed-off-by: key4ng <rukeyang@gmail.com>

fix(grpc_client): pass ResponsesRequest sampling params to SGLang bui…

87e25c4

…lder Signed-off-by: key4ng <rukeyang@gmail.com>

fix(grpc_client): pass ResponsesRequest sampling params to TRT-LLM bu…

3fa3c03

…ilder Signed-off-by: key4ng <rukeyang@gmail.com>

style(grpc_client): apply rustfmt to responses sampling test additions

b607a70

Signed-off-by: key4ng <rukeyang@gmail.com>

key4ng added 6 commits April 15, 2026 15:51

test(grpc_client): add vLLM negative top_k clamp assertion

9093392

Revert stop-sequence pass-through (deferred to follow-up). Add negative top_k clamp test per CodeRabbit review. Signed-off-by: key4ng <rukeyang@gmail.com>

test(responses): add E2E tests for sampling parameter pass-through

ee6e99c

Signed-off-by: key4ng <rukeyang@gmail.com>

style(grpc_client): annotate deferred stop-sequence pass-through

e29e122

Signed-off-by: key4ng <rukeyang@gmail.com>

test(grpc_client): add disabled-sentinel assertions for SGLang and TR…

c35a95d

…T-LLM Signed-off-by: key4ng <rukeyang@gmail.com>

key4ng force-pushed the fix/responses-sampling-params-hardcoded branch from 4540de6 to 501715d Compare April 15, 2026 22:51

mergify bot mentioned this pull request Apr 15, 2026

fix(grpc_client): pass ResponsesRequest sampling params to all backends #1161

Open

5 tasks

key4ng closed this Apr 15, 2026

Conversation

key4ng commented Apr 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

Changes

Test Plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

CatherineSue left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Apr 15, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

key4ng commented Apr 15, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

key4ng Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

key4ng commented Apr 15, 2026

Uh oh!

key4ng commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

key4ng commented Apr 13, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 13, 2026 •

edited

Loading

coderabbitai bot Apr 15, 2026 •

edited

Loading