Skip to content

fix(grpc_client): honor ResponsesRequest sampling params in vLLM/SGLang/TRT-LLM#1106

Closed
key4ng wants to merge 10 commits intolightseekorg:mainfrom
key4ng:fix/responses-sampling-params-hardcoded
Closed

fix(grpc_client): honor ResponsesRequest sampling params in vLLM/SGLang/TRT-LLM#1106
key4ng wants to merge 10 commits intolightseekorg:mainfrom
key4ng:fix/responses-sampling-params-hardcoded

Conversation

@key4ng
Copy link
Copy Markdown
Collaborator

@key4ng key4ng commented Apr 13, 2026

Description

Problem

The ResponsesRequest builders for vLLM, SGLang, and TRT-LLM gRPC clients hardcoded top_k, min_p, repetition_penalty, frequency_penalty, and presence_penalty to defaults instead of reading them from the request. Harmony /v1/responses callers that set these sampling parameters received a successful response while the backend silently decoded with default sampling behavior.

The comments claimed ResponsesRequest didn't expose these fields — it does (crates/protocols/src/responses.rs:732-757).

Solution

Replace the hardcoded values in all three builders with the actual request fields, matching each backend's proto conventions:

  • vLLM (u32 top_k, 0 = disabled): request.top_k.max(0) as u32
  • SGLang (i32 top_k, -1 = disabled, matches ResponsesRequest default): pass-through
  • TRT-LLM (Option<i32> top_k, 0 = disabled per proto comment): Some(request.top_k.max(0)), mirroring the existing build_sampling_config_from_chat pattern

Option-typed penalty fields use .unwrap_or(0.0) for vLLM/SGLang's non-optional proto fields, and pass Option straight through for TRT-LLM.

Changes

  • crates/grpc_client/src/vllm_engine.rsbuild_grpc_sampling_params_from_responses: pass through all five params + unit test
  • crates/grpc_client/src/sglang_scheduler.rs — same, with SGLang proto sentinels + unit test
  • crates/grpc_client/src/trtllm_service.rsbuild_sampling_config_from_responses: wrap in Some(...) / pass Option through + unit test

Test Plan

Per-builder unit tests set non-default values on a ResponsesRequest and assert the produced proto carries them through.

cargo test -p smg-grpc-client
# test result: ok. 49 passed; 0 failed

cargo clippy -p smg-grpc-client --all-targets --all-features -- -D warnings
# clean

cargo +nightly fmt --all -- --check
# clean
Checklist
  • cargo +nightly fmt passes
  • cargo clippy --all-targets --all-features -- -D warnings passes
  • (Optional) Documentation updated
  • (Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Sampling parameters (top_k, min_p, repetition_penalty, frequency_penalty, presence_penalty) are now passed through from requests to inference backends, fixing prior defaults and clamping behavior where applicable.
  • Tests

    • Added unit and end-to-end tests verifying sampling parameters are reflected in generated and streaming responses, including determinism checks for temperature=0.

@github-actions github-actions bot added the grpc gRPC client and router changes label Apr 13, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 13, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR maps ResponsesRequest sampling fields (top_k, min_p, frequency_penalty, presence_penalty, repetition_penalty) through to gRPC SamplingParams/SamplingConfig for SGLang, vLLM, and TRT-LLM backends and adds unit and e2e tests validating the pass-through.

Changes

Cohort / File(s) Summary
gRPC backend mapping
crates/grpc_client/src/sglang_scheduler.rs, crates/grpc_client/src/vllm_engine.rs, crates/grpc_client/src/trtllm_service.rs
Replace hardcoded/default sampling values with direct mapping from ResponsesRequest into proto::SamplingParams/SamplingConfig; clamp/convert top_k, pass through min_p and repetition_penalty, use unwrap_or(0.0) for optional penalties; preserve empty stop vectors (note: stop not yet passed). Unit tests added per file.
End-to-end tests
e2e_test/responses/test_sampling_params.py
New e2e test module with _SamplingParamsBase verifying sampling params accepted/echoed, determinism at temperature=0, and streaming behavior; two concrete test classes target local and multi-engine (harmony) configurations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • slin1237
  • XinyueZhang369

Poem

🐰 I hopped through code to make params clear,
top_k and penalties now travel near.
Tests watch the path, both unit and end,
So requests and backends agree, my friend. 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main change: fixing gRPC client builders to pass through ResponsesRequest sampling parameters in vLLM, SGLang, and TRT-LLM backends.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

slin1237
slin1237 previously approved these changes Apr 13, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the gRPC clients for SGLang, TRT-LLM, and vLLM to map sampling parameters such as top_k, min_p, and penalties directly from the ResponsesRequest, replacing previous hardcoded defaults. Corresponding unit tests were added for each client. Feedback indicates that the stop field is currently omitted from the mapping logic and that internal comments incorrectly claim the Responses API lacks stop sequence support.

Comment thread crates/grpc_client/src/sglang_scheduler.rs
Comment thread crates/grpc_client/src/vllm_engine.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/grpc_client/src/vllm_engine.rs`:
- Around line 929-953: Add a disabled-case assertion to the existing
test_responses_sampling_params_are_passed_through to lock in the clamp behavior:
create a ResponsesRequest with top_k set to a negative value (e.g., -1), call
VllmEngineClient::build_grpc_sampling_params_from_responses(&request, None) and
assert that the returned params.top_k equals 0 to ensure negative top_k is
clamped to zero; place this check alongside the other assertions in the same
test so the behavior is verified.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 74bb011a-0040-4a67-b736-bde1e51634be

📥 Commits

Reviewing files that changed from the base of the PR and between d954c14 and cda9294.

📒 Files selected for processing (3)
  • crates/grpc_client/src/sglang_scheduler.rs
  • crates/grpc_client/src/trtllm_service.rs
  • crates/grpc_client/src/vllm_engine.rs

Comment thread crates/grpc_client/src/vllm_engine.rs
Copy link
Copy Markdown
Member

@CatherineSue CatherineSue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add an e2e test for this?

@slin1237 slin1237 self-requested a review April 14, 2026 02:44
@slin1237 slin1237 dismissed their stale review April 14, 2026 02:45

pending e2e tests

@github-actions github-actions bot added ci CI/CD configuration changes tests Test changes model-gateway Model gateway crate changes labels Apr 15, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 15, 2026

Hi @key4ng, this PR has merge conflicts that must be resolved before it can be merged. Please rebase your branch:

git fetch origin main
git rebase origin/main
# resolve any conflicts, then:
git push --force-with-lease

@mergify mergify bot added the needs-rebase PR has merge conflicts that need to be resolved label Apr 15, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c4e2a8dbe0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread model_gateway/src/worker/manager.rs
Comment thread .github/workflows/pr-test-rust.yml
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/pr-test-rust.yml:
- Around line 499-503: The workflow condition treats only github.event_name !=
'pull_request' so pull_request_target bypasses the detect-changes gating; update
the conditional that gates running e2e-1gpu-completions to also treat
'pull_request_target' like 'pull_request' (e.g., check github.event_name !=
'pull_request' && github.event_name != 'pull_request_target' or invert to
explicitly allow only non-PR events), ensuring
needs.detect-changes.outputs.completions is consulted; target the same
conditional block referencing github.event_name and
needs.detect-changes.outputs.completions and the e2e-1gpu-completions job.

In `@model_gateway/src/service_discovery.rs`:
- Around line 641-643: When enqueuing Job::RemoveWorker do not use
expected_revision: None; instead read the currently installed worker's revision
and set expected_revision to that value so stale remove jobs become no-ops if
the worker has been replaced. Locate the places creating Job::RemoveWorker (the
snippet with expected_revision: None and the other occurrences around the other
removal sites) and replace expected_revision: None with the current revision
fetched from the registry (e.g., get the WorkerEntry or Worker struct for
old_url and use its revision field), ensuring you handle the case where the
worker is missing by leaving expected_revision as None only if no installed
revision exists.

In `@model_gateway/src/worker/manager.rs`:
- Around line 403-407: The removal job enqueues the concrete DP-aware worker URL
via worker.url(), which for DP workers like "...@rank" prevents
find_workers_by_url() (which matches "{url}@") from finding them; normalize
DP-aware URLs to their base form before creating RemovalCandidate: detect and
strip the "@<suffix>" DP qualifier from worker.url() (use the same normalization
where removal logic at the other occurrence is applied) and push
RemovalCandidate with the base URL and expected_revision so prefix-matching in
find_workers_by_url() will succeed; apply the same change to the other similar
push site referenced in the diff.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d2b5bd57-ccc1-4dd3-995d-244a417e1dca

📥 Commits

Reviewing files that changed from the base of the PR and between 201eb3d and c4e2a8d.

📒 Files selected for processing (23)
  • .github/workflows/pr-test-rust.yml
  • bindings/golang/src/policy.rs
  • crates/grpc_client/src/sglang_scheduler.rs
  • crates/grpc_client/src/trtllm_service.rs
  • crates/grpc_client/src/vllm_engine.rs
  • e2e_test/completions/__init__.py
  • e2e_test/completions/test_basic.py
  • model_gateway/src/observability/metrics_ws/collectors.rs
  • model_gateway/src/policies/manual.rs
  • model_gateway/src/policies/mod.rs
  • model_gateway/src/routers/http/pd_router.rs
  • model_gateway/src/server.rs
  • model_gateway/src/service_discovery.rs
  • model_gateway/src/worker/builder.rs
  • model_gateway/src/worker/circuit_breaker.rs
  • model_gateway/src/worker/manager.rs
  • model_gateway/src/worker/registry.rs
  • model_gateway/src/worker/service.rs
  • model_gateway/src/worker/worker.rs
  • model_gateway/src/workflow/job_queue.rs
  • model_gateway/src/workflow/steps/local/find_workers_to_remove.rs
  • model_gateway/src/workflow/steps/local/mod.rs
  • model_gateway/src/workflow/steps/local/update_worker_properties.rs

Comment thread .github/workflows/pr-test-rust.yml
Comment thread model_gateway/src/service_discovery.rs
Comment thread model_gateway/src/worker/manager.rs
@key4ng key4ng force-pushed the fix/responses-sampling-params-hardcoded branch from efdc5c7 to 4bea74b Compare April 15, 2026 04:28
@github-actions github-actions bot removed ci CI/CD configuration changes model-gateway Model gateway crate changes labels Apr 15, 2026
@mergify mergify bot removed the needs-rebase PR has merge conflicts that need to be resolved label Apr 15, 2026
@key4ng
Copy link
Copy Markdown
Collaborator Author

key4ng commented Apr 15, 2026

Note: the stop sequence pass-through flagged by @gemini-code-assist is intentionally deferred to a follow-up PR. This PR scopes to the five sampling parameters (top_k, min_p, repetition_penalty, frequency_penalty, presence_penalty). The stop field is a separate fix with its own test surface.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/grpc_client/src/sglang_scheduler.rs`:
- Around line 997-1022: Add a negative-sentinel assertion to the existing
test_responses_sampling_params_are_passed_through: call
SglangSchedulerClient::build_grpc_sampling_params_from_responses with a
ResponsesRequest where top_k is set to -1 (and other fields as needed or
default), then assert that the produced params.top_k equals -1 to verify the
disabled-sentinel behavior; reference the test function name and
build_grpc_sampling_params_from_responses so you add this extra case in the same
test (or a tiny new test) to lock the backend contract for the sentinel value.

In `@crates/grpc_client/src/trtllm_service.rs`:
- Around line 1002-1025: Extend the test for
TrtllmServiceClient::build_sampling_config_from_responses
(test_responses_sampling_config_is_passed_through) to include a negative-top_k
scenario: construct a ResponsesRequest with top_k set to a negative value and
call build_sampling_config_from_responses, then assert that the resulting
cfg.top_k is Some(0) to lock in the disabled-edge behavior (top_k < 0 ->
Some(0)). Ensure you reference the same function and request struct names
(ResponsesRequest and TrtllmServiceClient::build_sampling_config_from_responses)
when adding the new assertion or sub-case in the test.

In `@e2e_test/responses/test_sampling_params.py`:
- Around line 20-47: In test_all_sampling_params_accepted, add assertions that
the sampling parameters sent in extra_body are echoed back in the response so
propagation is validated; specifically assert that resp contains top_k == 40,
min_p == 0.05, frequency_penalty == 0.5, presence_penalty == 0.3 and
repetition_penalty == 1.1 (access them from the response payload/metadata where
echoed fields live, e.g., resp.extra or resp.<field> depending on the response
shape) in addition to the existing temperature/top_p/max_output_tokens checks.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 58c3af9b-a860-4d68-acaf-8cc5707ed7a1

📥 Commits

Reviewing files that changed from the base of the PR and between c4e2a8d and 4bea74b.

📒 Files selected for processing (4)
  • crates/grpc_client/src/sglang_scheduler.rs
  • crates/grpc_client/src/trtllm_service.rs
  • crates/grpc_client/src/vllm_engine.rs
  • e2e_test/responses/test_sampling_params.py

Comment thread crates/grpc_client/src/sglang_scheduler.rs
Comment thread crates/grpc_client/src/trtllm_service.rs
Comment on lines +20 to +47
def test_all_sampling_params_accepted(self, model, api_client):
"""Non-default sampling params produce a valid response with correct echo-back."""
resp = api_client.responses.create(
model=model,
input="What is 2+2?",
temperature=0.5,
top_p=0.9,
max_output_tokens=32,
extra_body={
"top_k": 40,
"min_p": 0.05,
"frequency_penalty": 0.5,
"presence_penalty": 0.3,
"repetition_penalty": 1.1,
},
)
assert resp.status == "completed"
assert resp.error is None
assert len(resp.output_text) > 0

assert resp.temperature == 0.5
assert resp.top_p == 0.9
assert resp.max_output_tokens == 32

assert resp.usage is not None
assert resp.usage.output_tokens > 0
assert resp.usage.output_tokens <= 32

Copy link
Copy Markdown

@coderabbitai coderabbitai bot Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Strengthen assertions for extra_body sampling fields.

This test sends top_k, min_p, and penalty fields but only asserts echoed temperature/top_p/max_output_tokens. If these fields are present in the response payload, assert them too to validate true end-to-end propagation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@e2e_test/responses/test_sampling_params.py` around lines 20 - 47, In
test_all_sampling_params_accepted, add assertions that the sampling parameters
sent in extra_body are echoed back in the response so propagation is validated;
specifically assert that resp contains top_k == 40, min_p == 0.05,
frequency_penalty == 0.5, presence_penalty == 0.3 and repetition_penalty == 1.1
(access them from the response payload/metadata where echoed fields live, e.g.,
resp.extra or resp.<field> depending on the response shape) in addition to the
existing temperature/top_p/max_output_tokens checks.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ResponsesResponse proto only echoes temperature, top_p, and max_output_tokens — the extension fields (top_k, min_p, penalties) are not in the response schema. We can't assert what isn't returned. The unit tests cover the actual builder pass-through; the E2E test confirms the full pipeline doesn't reject them.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@key4ng, that makes sense — if ResponsesResponse doesn't echo those extension fields back, there's nothing to assert at the E2E layer for them. The division of responsibility is clear: unit tests validate builder pass-through, and the E2E test confirms the pipeline accepts the parameters without error. I'll drop this suggestion.


✏️ Learnings added
Learnt from: key4ng
Repo: lightseekorg/smg PR: 1106
File: e2e_test/responses/test_sampling_params.py:20-47
Timestamp: 2026-04-15T04:44:14.643Z
Learning: In repo lightseekorg/smg, `ResponsesResponse` proto only echoes back `temperature`, `top_p`, and `max_output_tokens`. Extension sampling fields (`top_k`, `min_p`, `frequency_penalty`, `presence_penalty`, `repetition_penalty`) are **not** part of the response schema and are therefore not present in `resp.*` in E2E tests (e.g., `e2e_test/responses/test_sampling_params.py`). Do not flag the absence of assertions for these fields in E2E response-sampling tests; builder pass-through for these fields is validated by unit tests in `crates/grpc_client/src/` instead.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 812
File: e2e_test/embeddings/test_correctness.py:223-223
Timestamp: 2026-03-20T00:00:47.235Z
Learning: In `lightseekorg/smg` PR `#812`, `TestEmbeddingCorrectness` in `e2e_test/embeddings/test_correctness.py` uses the `api_client` fixture without class-level parametrization by design. SmgClient coverage for embedding correctness (HF parity assertions in `test_semantic_similarity` and `test_relevance_scores`) is intentionally opt-in and deferred until SmgClient embedding compatibility is verified. Do not flag the absence of SmgClient parametrization in this test class as a gap.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 1122
File: e2e_test/chat_completions/test_validation.py:365-387
Timestamp: 2026-04-14T17:59:46.231Z
Learning: In `lightseekorg/smg` PR `#1122`, `e2e_test/chat_completions/test_validation.py` `TestEosTokenStripping.test_no_stop_trim_with_skip_special_false`: When `no_stop_trim=True` AND `skip_special_tokens=False`, EOS tokens (e.g. `<|eom_id|>`, `<|eot_id|>`, `<|end_of_text|>`) are expected to be **visible** in the decoded output. The test correctly asserts `has_eos == True`. The EOS hidden-stop stripping only applies when `no_stop_trim` is not set; with `no_stop_trim=True`, the stop token is not trimmed, and with `skip_special_tokens=False` it decodes to text. Do not flag this assertion as inverted.

Learnt from: vschandramourya
Repo: lightseekorg/smg PR: 915
File: model_gateway/src/routers/grpc/client.rs:387-423
Timestamp: 2026-03-26T17:06:14.307Z
Learning: In repo lightseekorg/smg, in `model_gateway/src/routers/grpc/client.rs` and the corresponding backend builders (`crates/grpc_client/src/sglang_scheduler.rs`, `vllm_engine.rs`, `trtllm_service.rs`): The per-backend divergence in handling `CompletionRequest.max_tokens == None` is intentional. SGLang and vLLM pass `None` through to their proto builders, while TRT-LLM falls back to `16`. This matches the pre-existing per-backend pattern used in the chat/messages request builders. Do not flag this divergence as a bug or request normalization at the `build_completion_request` dispatcher layer in `client.rs`.

Learnt from: ai-jz
Repo: lightseekorg/smg PR: 0
File: :0-0
Timestamp: 2026-04-12T20:47:06.274Z
Learning: In repo lightseekorg/smg, `grpc_servicer/smg_grpc_servicer/vllm/servicer.py`: `stop_token_ids` from generation-config defaults should NOT be merged (set-union) with the client's `stop_token_ids` in the servicer. The servicer passes through the client's `stop_token_ids` as-is; the vLLM engine handles generation-config stop behavior internally via `update_from_generation_config()` in `sampling_params.py`, consistent with the HTTP Responses path (`vllm/entrypoints/openai/responses/serving.py`).

Learnt from: key4ng
Repo: lightseekorg/smg PR: 1034
File: model_gateway/src/routers/grpc/multimodal.rs:602-610
Timestamp: 2026-04-04T00:29:52.690Z
Learning: In repo lightseekorg/smg, the gRPC regular pipeline runs stages in order: Preparation (tokenization + multimodal media fetch + preprocessing) → Worker Selection → Request Building. Because backend identity (e.g., GrpcClient::Mlx) is only resolved after worker selection, it is architecturally impossible to short-circuit MLX multimodal rejection before process_multimodal() / process_multimodal_messages() runs in the Preparation stage. The earliest feasible rejection point for backend-specific multimodal unsupported errors is at assemble_multimodal_data in request_building (after preprocessing but before generation). Moving rejection earlier would require a larger architectural change (e.g., moving worker selection before preparation). Do not flag assemble_multimodal_data's GrpcClient::Mlx error arm as "too late" in the pipeline.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 489
File: model_gateway/benches/wasm_middleware_latency.rs:88-91
Timestamp: 2026-02-21T02:37:02.009Z
Learning: Repo: lightseekorg/smg — For clippy-only/enforcement PRs (e.g., PR `#489`), even micro-optimizations (like replacing an async closure with std::future::ready in benches such as model_gateway/benches/wasm_middleware_latency.rs) should be deferred to a follow-up PR rather than included inline.

Learnt from: vschandramourya
Repo: lightseekorg/smg PR: 978
File: model_gateway/src/routers/grpc/regular/streaming.rs:2275-2276
Timestamp: 2026-03-31T03:43:30.981Z
Learning: In repo lightseekorg/smg, file model_gateway/src/routers/grpc/regular/streaming.rs: `logprobs: None` in `CompletionStreamChoice` within `process_completion_streaming_chunks` is intentional. Per-token logprob streaming is deferred across all streaming endpoints (completions, generate) because it requires backend-side plumbing not yet available. The same `// TODO: wire per-token logprob streaming` deferral exists in `process_generate_streaming` (around line 702). Non-streaming completions has the same TODO. Do not flag `logprobs: None` or the TODO comment in the completions or generate streaming paths as a missing feature or silent data loss until a dedicated follow-up PR wires backend per-token logprob support.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 495
File: grpc_client/proto/sglang_scheduler.proto:140-172
Timestamp: 2026-02-21T11:57:36.560Z
Learning: In repo lightseekorg/smg, the multimodal proto fields in `grpc_client/proto/sglang_scheduler.proto` (including `MultimodalInputs.pixel_values`, `model_specific_tensors`, `im_token_id`, and `mm_placeholders`) were introduced in PR `#495` as the first real usage; prior internal-only development fields like `processed_features` were never shipped to production, so wire-compatibility concerns do not apply to these field number assignments.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 606
File: e2e_test/chat_completions/test_openai_server.py:611-612
Timestamp: 2026-03-04T04:43:55.458Z
Learning: In `lightseekorg/smg` PR `#606`, `TestChatCompletionGptOss` in `e2e_test/chat_completions/test_openai_server.py`: `STOP_SEQUENCE_TRIMMED = True` by default (Harmony/gpt-oss trims stop sequences). In `test_stop_sequences` (non-streaming), both `is_vllm()` and `is_sglang()` override this to `False` because neither backend trims stop sequences in non-streaming mode. In `test_stop_sequences_stream` (streaming), only `is_vllm()` overrides to `False`. This asymmetry is intentional. Do not flag `is_sglang()` in the non-streaming condition as incorrect.

Learnt from: key4ng
Repo: lightseekorg/smg PR: 1034
File: crates/grpc_client/src/mlx_engine.rs:135-145
Timestamp: 2026-04-04T00:29:40.626Z
Learning: In repo lightseekorg/smg, the gRPC client Channel builders for all backends (SGLang, vLLM, TRT-LLM, and MLX in crates/grpc_client/src/) intentionally omit `.connect_timeout()`. Adding a connect timeout uniformly across all four clients is deferred to a follow-up PR. Do not flag the missing connect_timeout on any individual backend client as a blocking issue.

Learnt from: vschandramourya
Repo: lightseekorg/smg PR: 964
File: model_gateway/src/routers/grpc/pd_router.rs:233-240
Timestamp: 2026-03-28T04:56:07.632Z
Learning: In repo lightseekorg/smg, `GrpcPDRouter` (model_gateway/src/routers/grpc/pd_router.rs) intentionally passes `&self.retry_config` (the router-level default) to `RetryExecutor::execute_response_with_retry` for ALL endpoints (generate, chat, messages, completion). Per-model retry overrides via `WorkerRegistry::get_retry_config` are NOT applied in the PD router — that pattern is used only in `GrpcRouter` (regular mode). Adding per-model retry support to individual PD endpoints in isolation would be inconsistent; any such change must cover all PD endpoints in a dedicated follow-up PR. Do not flag the missing per-model retry lookup in any single PD router endpoint as a bug.

Learnt from: key4ng
Repo: lightseekorg/smg PR: 556
File: e2e_test/fixtures/hooks.py:239-245
Timestamp: 2026-02-27T19:34:12.191Z
Learning: When implementing or reviewing affinity tracking in e2e tests, prefer deriving backend grouping from aggregated backends specified by pytest parametrize markers rather than per-item backends derived from callspec.params. This mirrors the worker pre-launch pattern and yields a conservative GPU affinity grouping (all variants share a single group) in cases where parametrizations mix local and cloud variants, reducing risk of missed grouping. Apply this guideline to Python e2e test hooks and fixtures files under e2e_test/ (not just a single file).

key4ng added 6 commits April 15, 2026 15:51
ResponsesRequest exposes a stop field (Option<StringOrArray>), but the vLLM,
SGLang, and TRT-LLM Responses builders hardcoded stop: vec![] with a
misleading comment claiming the Responses API lacks stop sequence support.
Pass the field through, matching the Chat Completions path.

Also adds a negative-top_k clamp assertion to the vLLM test per code review.

Signed-off-by: key4ng <rukeyang@gmail.com>
Revert stop-sequence pass-through (deferred to follow-up).
Add negative top_k clamp test per CodeRabbit review.

Signed-off-by: key4ng <rukeyang@gmail.com>
Signed-off-by: key4ng <rukeyang@gmail.com>
…T-LLM

Signed-off-by: key4ng <rukeyang@gmail.com>
- Use pytest.approx for top_p float comparison (f32 round-trip precision)
- Guard against usage.output_tokens being None
- Increase max_output_tokens to 256 so reasoning models have room for text
- Assert on resp.output (non-empty) instead of resp.output_text (can be
  empty when reasoning consumes all tokens)

Signed-off-by: key4ng <rukeyang@gmail.com>
@key4ng key4ng force-pushed the fix/responses-sampling-params-hardcoded branch from 4540de6 to 501715d Compare April 15, 2026 22:51
@key4ng
Copy link
Copy Markdown
Collaborator Author

key4ng commented Apr 15, 2026

accidentally create pr from fork repo

@key4ng key4ng closed this Apr 15, 2026
@key4ng
Copy link
Copy Markdown
Collaborator Author

key4ng commented Apr 15, 2026

new pr: #1161

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

grpc gRPC client and router changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants