Skip to content

feat(grpc): support vLLM --generation-config in gRPC mode#1103

Closed
ai-jz wants to merge 9 commits intolightseekorg:mainfrom
ai-jz:feat/grpc-generation-config
Closed

feat(grpc): support vLLM --generation-config in gRPC mode#1103
ai-jz wants to merge 9 commits intolightseekorg:mainfrom
ai-jz:feat/grpc-generation-config

Conversation

@ai-jz
Copy link
Copy Markdown
Collaborator

@ai-jz ai-jz commented Apr 12, 2026

Motivation

vLLM's --generation-config flag (default: auto) reads the model's generation_config.json — a model artifact shipped alongside weights and chat template — and applies the model author's recommended sampling parameters (temperature, top_p, top_k, etc.) when the API request doesn't specify them.

This is already supported in the HTTP serving path. This PR extends the same support to the gRPC path, so the same --generation-config flag produces identical sampling behavior regardless of whether the request reaches vLLM via HTTP or gRPC. All three modes are supported: auto, vllm, and custom path.

Description

Problem

The gRPC path does not yet support --generation-config. The HTTP path reads the model's generation_config.json and applies defaults (e.g., temperature=0.6 for Qwen3/Llama-3) for unset request fields. The gRPC path uses vLLM's neutral defaults (temperature=1.0, top_p=1.0) regardless of the flag value.

Root cause spans three layers:

  1. Proto: top_p, top_k, min_p, frequency_penalty, presence_penalty, repetition_penalty are plain proto3 fields — the zero value is indistinguishable from "not set"
  2. Rust client: unwrap_or(default) fills neutral values before sending, losing the "user didn't set this" signal
  3. Python servicer: never calls model_config.get_diff_sampling_param(), hardcodes fallbacks

Solution

Propagate the "unset" signal from client to servicer across all three layers:

  1. Proto: make the 6 fields optional — enables HasField() detection
  2. Rust client: pass Option through instead of unwrap_or(default) in all 5 sampling param builders
  3. Python servicer: call get_diff_sampling_param() at init (same as HTTP path), use HasField() fallbacks for all fields

Context-length enforcement (max_tokens clamping) and Harmony stop_token_ids injection are handled by vLLM's engine internally (input_processor.py), so the servicer does not duplicate that logic.

Changes

File What
vllm_engine.proto 6 sampling fields → optional (enables HasField detection)
vllm_engine.rs All 5 builders pass Option through; 3 new + 2 updated unit tests
servicer.py Load generation-config defaults at init, apply via HasField fallbacks

Design Decisions

Why server-side defaults (in servicer) over client-side merge (in Rust gateway):

The engine process is authoritative for default resolution — it owns generation_config.json parsing, --generation-config mode selection, and model-type-aware logic. vLLM exposes all of this through get_diff_sampling_param() which the servicer reuses directly.

Making the proto fields optional also ensures omission semantics live in the gRPC request contract itself — which benefits any future non-gateway gRPC clients.

SGLang support will be addressed in a separate PR.

Other notes:

  • Proto optional is wire-compatible — rolling upgrade safe. Old clients always set fields explicitly, so HasField() = true and behavior is unchanged.
  • Context-length enforcement is left to vLLM's engine (input_processor.py) — not duplicated in the servicer.

Test Plan

Existing tests (regression):

Suite Count Result
cargo test -p smg-grpc-client 48/48 All pass — 2 existing tests updated to match Option types
E2e test_openai_server.py (openai-grpc) 17/17 All pass — no regressions

New tests (added in this PR):

Suite Count Result
Rust: proto optional semantics (3 tests) 3/3 All pass
Checklist
  • Proto change is wire-compatible (optional addition)
  • All 5 Rust sampling param builders updated
  • Existing Rust tests updated for Option types
  • New Rust tests for optional semantics
  • Python servicer matches HTTP path logic
  • Full existing e2e suite passes (no regressions)
  • Rolling upgrade safe (old client → new servicer)

Summary by CodeRabbit

Release Notes

  • New Features
    • Sampling parameters are now optional in requests—users can omit them instead of specifying explicit values.
    • When sampling parameters are not provided, the system now applies model-specific default values for smarter generation behavior out of the box.

@github-actions github-actions Bot added grpc gRPC client and router changes tests Test changes labels Apr 12, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 12, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: e29f8324-01f8-4755-958a-f49a11bb660f

📥 Commits

Reviewing files that changed from the base of the PR and between 4ee8b93 and c70ef85.

📒 Files selected for processing (1)
  • grpc_servicer/smg_grpc_servicer/vllm/servicer.py

📝 Walkthrough

Walkthrough

The PR modifies sampling parameter handling across proto definitions, Rust client code, and Python servicer to distinguish between unset and explicitly-set values. Proto fields are marked optional, client code propagates Option values instead of unwrapping with defaults, and servicer now applies model default sampling parameters when proto fields are unset.

Changes

Cohort / File(s) Summary
Proto Definition
crates/grpc_client/proto/vllm_engine.proto
Made top_p, top_k, min_p, frequency_penalty, presence_penalty, and repetition_penalty fields optional in SamplingParams message to distinguish unset from explicitly set values.
Rust Client
crates/grpc_client/src/vllm_engine.rs
Updated sampling parameter builders for ChatCompletion, Completion, and Generate to propagate Option values directly to proto instead of unwrapping with defaults; removed hardcoded defaults in plain sampling builder and added test assertions for None handling.
Python Servicer
grpc_servicer/smg_grpc_servicer/vllm/servicer.py
Extended _sampling_params_from_proto to accept model default sampling parameters and apply them when proto fields are unset; updated Generate to load and pass model defaults into parameter construction.

Sequence Diagram

sequenceDiagram
    actor Client
    participant gRPC
    participant Servicer
    participant ModelConfig
    
    Client->>gRPC: Send SamplingParams (some fields unset)
    gRPC->>Servicer: Receive proto message with optional fields
    Servicer->>ModelConfig: Load model default sampling params
    ModelConfig-->>Servicer: Return defaults dict
    Servicer->>Servicer: For each sampling field:<br/>if unset in proto → use default<br/>else → use proto value
    Servicer-->>Client: Apply merged params to generation
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • slin1237
  • njhill
  • key4ng

Poem

🐰 Oh what joy, these params now optional!
No more forced defaults, it's devotional—
Unset stays None, explicit stays true,
Model defaults flourish when needed by you!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and specifically describes the main change: enabling --generation-config support in gRPC mode, which is the core objective of this pull request.
Docstring Coverage ✅ Passed Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 12, 2026

Hi @ai-jz, the DCO sign-off check has failed. All commits must include a Signed-off-by line.

To fix existing commits:

# Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-lease

To sign off future commits automatically:

  • Use git commit -s every time, or
  • VSCode: enable Git: Always Sign Off in Settings
  • PyCharm: enable Sign-off commit in the Commit tool window

@ai-jz ai-jz force-pushed the feat/grpc-generation-config branch from 7fb8355 to 55ec145 Compare April 12, 2026 02:49
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the gRPC interface to support model-specific generation configuration defaults by making sampling parameters optional in the protobuf definition and the Rust client. The Python servicer now correctly loads and applies these defaults when fields are omitted in the request, ensuring parity with the HTTP interface. Feedback was provided regarding the Rust client to use a safer numeric conversion for the top_k field to handle potential overflows or negative values.

Comment thread crates/grpc_client/src/vllm_engine.rs
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7fb8355f2b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +584 to +588
top_k=params.top_k if params.HasField("top_k")
else defaults.get("top_k", 0),
min_p=params.min_p if params.HasField("min_p")
else defaults.get("min_p", 0.0),
frequency_penalty=params.frequency_penalty if params.HasField("frequency_penalty")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve zero-valued sampling fields from legacy clients

This fallback now interprets missing sampling fields as "unset" and pulls model defaults, but proto3 legacy clients (built before these fields were optional) omit scalar default values (0/0.0) on the wire. In a mixed-version rollout, explicit zero-valued settings like top_k=0, min_p=0, or zero penalties from older clients will arrive as absent and be replaced by generation-config defaults, changing generation behavior unexpectedly.

Useful? React with 👍 / 👎.

Comment on lines +570 to +574
if max_model_len is not None and input_length > 0:
model_max = max_model_len - input_length
candidates = [v for v in (model_max, raw_max_tokens, override_max_tokens) if v is not None]
max_tokens = min(candidates) if candidates else None
else:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Apply override max_tokens when prompt is text input

override_max_tokens is only applied inside the input_length > 0 branch, so requests using the text input oneof (or empty tokenized input) skip that branch and ignore the configured override cap. In those cases max_tokens falls back to raw_max_tokens only, which can bypass override_generation_config.max_new_tokens limits.

Useful? React with 👍 / 👎.

@ai-jz
Copy link
Copy Markdown
Collaborator Author

ai-jz commented Apr 12, 2026

Review Response

Addressed findings from Codex and CodeRabbit reviews in e73a892:

Addressed

1. [HIGH] Safe top_k conversion (vllm_engine.rs:586)
Fixed — changed request.top_k to request.top_k.and_then(|v| u32::try_from(v).ok()) in the Messages API builder. This safely drops out-of-range values instead of casting, consistent with the other builders that use .max(0) as u32.

2. [P2] override_max_tokens skipped for text input (servicer.py:574)
Fixed — the else branch (text input / empty tokenized) now applies override_max_tokens as a cap, not just raw_max_tokens.

Acknowledged (no change needed)

3. [P1] Legacy client zero-value concern (servicer.py:588)
This describes a theoretical risk during rolling upgrade where old proto3 clients omit zero-valued scalars on the wire. In practice, old SMG Rust clients always send explicit values via unwrap_or(1.0) / unwrap_or(0) — they never construct Some(0.0) for top_p or Some(0.0) for repetition_penalty. So HasField() returns true for all fields from old clients, and model defaults are never applied. The behavior is identical to pre-patch. See Backward Compatibility section in PR description.

ai-jz added 2 commits April 11, 2026 19:58
Support all three --generation-config modes (auto, vllm, custom path)
in the vLLM gRPC serving path, achieving parity with the HTTP path.

Previously, the gRPC servicer ignored the model generation_config.json
and hardcoded neutral defaults (temperature=1.0, top_p=1.0, etc.). Now
it reads the same defaults as the HTTP path via get_diff_sampling_param()
and applies them when request fields are unset.

Three-layer change:
- Proto: make top_p, top_k, min_p, frequency_penalty, presence_penalty,
  repetition_penalty optional so HasField can detect unset fields
- Rust client: pass Option through instead of unwrap_or(default) in
  all 5 vLLM sampling param builders, preserving the unset signal
- Python servicer: load model defaults at init, use HasField for all
  fields, clamp max_tokens to remaining context, merge Harmony
  stop_token_ids for gpt_oss models

Tests:
- 3 new + 2 updated Rust unit tests for proto optional semantics
- 4 new e2e tests verifying generation_config defaults in gRPC mode
- All 48 Rust tests pass, 17/17 existing e2e openai-grpc tests pass

Signed-off-by: ai-jz <ai-jz@users.noreply.github.com>
…or text input

Signed-off-by: ai-jz <ai-jz@users.noreply.github.com>
@ai-jz ai-jz force-pushed the feat/grpc-generation-config branch from e73a892 to 78072f3 Compare April 12, 2026 02:58
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@e2e_test/chat_completions/test_generation_config.py`:
- Around line 30-38: The test test_default_temperature_applied currently unpacks
client from setup_backend but should use the api_client fixture directly; remove
extracting the client from "_, model_path, client, _ = setup_backend" and
instead use the api_client parameter as the client variable, keeping only
whatever value you still need from setup_backend (e.g., model_path) by adjusting
the unpack to exclude client (or access only model_path), and update any
references to "client" in the test to use "api_client".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 313f1c2f-3415-4515-9770-90dd6fc7646e

📥 Commits

Reviewing files that changed from the base of the PR and between ae14a9c and 55ec145.

📒 Files selected for processing (4)
  • crates/grpc_client/proto/vllm_engine.proto
  • crates/grpc_client/src/vllm_engine.rs
  • e2e_test/chat_completions/test_generation_config.py
  • grpc_servicer/smg_grpc_servicer/vllm/servicer.py

Comment thread e2e_test/chat_completions/test_generation_config.py Outdated
Signed-off-by: ai-jz <ai-jz@users.noreply.github.com>
@github-actions github-actions Bot added the ci CI/CD configuration changes label Apr 12, 2026
@ai-jz ai-jz force-pushed the feat/grpc-generation-config branch from 0916c74 to e8df77c Compare April 12, 2026 03:28
@github-actions github-actions Bot removed the ci CI/CD configuration changes label Apr 12, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e8df77ca9a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

params.max_tokens if params.HasField("max_tokens")
else defaults.get("max_tokens")
)
if max_model_len is not None and input_length > 0:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clamp max_tokens for zero-length prompts

Include max_model_len in the max_tokens clamp even when input_length == 0. The current guard (input_length > 0) skips the model-length cap for empty tokenized prompts and all text oneof prompts (which pass input_length=0), so raw_max_tokens/override_max_tokens can exceed context length and produce avoidable vLLM validation failures instead of being bounded to remaining capacity.

Useful? React with 👍 / 👎.

- Fix AttributeError when override_generation_config is None: use
  (getattr(..., None) or {}) to coalesce missing and None cases
- Defensive copy of get_diff_sampling_param() to prevent shared-state
  mutation of ModelConfig internals
- Add ValueError guard when input_length exceeds max_model_len
- Clamp max_tokens to context length for text input prompts
- Fix dead fallback: return None when candidates list is empty
- Fix clippy: remove useless u32::try_from(u32) in Messages builder
- Apply ruff format to servicer.py
- Clarify test_default_temperature_applied is a smoke test

Signed-off-by: ai-jz <ai-jz@users.noreply.github.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2065fec386

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread grpc_servicer/smg_grpc_servicer/vllm/servicer.py Outdated
optional float top_p = 2;
optional uint32 top_k = 3;
optional float min_p = 4;
optional float frequency_penalty = 5;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recommend changing types of the sampling parameters here. Making it optional would mean it is 0 in default. However, the default of the above sampling parameters could be 1.0 in backend engine.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proto3 optional is specifically what lets us distinguish "not set" from "explicitly 0" — without it, an unset float is silently 0.0 on the wire, and the servicer can't tell if the client meant zero or just didn't specify.

With optional, HasField("top_p") returns False when the field wasn't set, so the servicer falls through to the model's generation_config.json default. When it returns True, the value is used as-is (even if it's 0). The proto-level default is still 0, but the servicer never sees it because it checks HasField() first.

This matches how the HTTP path works — JSON naturally distinguishes a missing key from "top_p": 0. The optional keyword gives the gRPC path the same capability.

Signed-off-by: ai-jz <ai-jz@users.noreply.github.com>
class TestGenerationConfigDefaults:
"""Verify --generation-config defaults apply in gRPC mode."""

def test_default_temperature_applied(self, setup_backend, api_client):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this test is necessary as it doesn't necessarily test the sampling parameters override.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — removed the test file in 60a723b. The core change is in proto + Rust client + servicer; tests can be added later when there's a way to observe the effective sampling params through the API.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/grpc_client/src/vllm_engine.rs`:
- Around line 917-957: The tests only construct proto::SamplingParams directly
and omit frequency_penalty/presence_penalty/repetition_penalty and thus don't
exercise the builder/conversion paths that regressed; update the tests
(test_sampling_params_optional_preserves_user_values and
test_sampling_params_unset_fields_distinguishable_from_zero) to (1) include
frequency_penalty, presence_penalty, and repetition_penalty in both the user-set
and zero cases, and (2) create and assert equality/inequality by passing those
proto::SamplingParams through the same builder/conversion path used in
production (the SamplingParams builder/converter that consumes
proto::SamplingParams) rather than only using the raw proto struct, so the
translation logic is exercised. Ensure assertions still check Some(value) vs
None and Some(0) distinctions after conversion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: bd57ef16-dcd2-4d5f-98c0-1f4f7d2884e9

📥 Commits

Reviewing files that changed from the base of the PR and between 2065fec and 335b813.

📒 Files selected for processing (1)
  • crates/grpc_client/src/vllm_engine.rs

Comment on lines +917 to +957
#[test]
fn test_sampling_params_optional_preserves_user_values() {
// When user sets values, they should be preserved as Some(value)
let params = proto::SamplingParams {
temperature: Some(0.6),
top_p: Some(0.95),
top_k: Some(20),
min_p: Some(0.1),
frequency_penalty: Some(0.5),
presence_penalty: Some(0.3),
repetition_penalty: Some(1.2),
max_tokens: Some(100),
..Default::default()
};
assert_eq!(params.temperature, Some(0.6));
assert_eq!(params.top_p, Some(0.95));
assert_eq!(params.top_k, Some(20));
assert_eq!(params.min_p, Some(0.1));
assert_eq!(params.frequency_penalty, Some(0.5));
assert_eq!(params.presence_penalty, Some(0.3));
assert_eq!(params.repetition_penalty, Some(1.2));
assert_eq!(params.max_tokens, Some(100));
}

#[test]
fn test_sampling_params_unset_fields_distinguishable_from_zero() {
// Unset (None) must be different from explicit zero — this is
// the core invariant for --generation-config support
let unset = proto::SamplingParams::default();
let zero = proto::SamplingParams {
temperature: Some(0.0),
top_p: Some(0.0),
top_k: Some(0),
min_p: Some(0.0),
..Default::default()
};
assert_ne!(unset.temperature, zero.temperature);
assert_ne!(unset.top_p, zero.top_p);
assert_ne!(unset.top_k, zero.top_k);
assert_ne!(unset.min_p, zero.min_p);
}
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Exercise the builder paths in the regression tests.

This block proves the generated proto type can represent None vs Some(0), but it never hits the translation paths that regressed in this PR, and it omits frequency_penalty, presence_penalty, and repetition_penalty. A future unwrap_or(...) reintroduction in one of the builders would still leave this suite green.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/grpc_client/src/vllm_engine.rs` around lines 917 - 957, The tests only
construct proto::SamplingParams directly and omit
frequency_penalty/presence_penalty/repetition_penalty and thus don't exercise
the builder/conversion paths that regressed; update the tests
(test_sampling_params_optional_preserves_user_values and
test_sampling_params_unset_fields_distinguishable_from_zero) to (1) include
frequency_penalty, presence_penalty, and repetition_penalty in both the user-set
and zero cases, and (2) create and assert equality/inequality by passing those
proto::SamplingParams through the same builder/conversion path used in
production (the SamplingParams builder/converter that consumes
proto::SamplingParams) rather than only using the raw proto struct, so the
translation logic is exercised. Ensure assertions still check Some(value) vs
None and Some(0) distinctions after conversion.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point — the test validates proto-level semantics (None vs Some(0)) but does not exercise the builder translation paths. The existing 48 Rust unit tests and 17 e2e openai-grpc tests cover those paths; this test was specifically for the new optional field behavior. Adding builder-path coverage is a good follow-up.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ai-jz Understood — the distinction makes sense. The two new tests are purposefully scoped to proto-layer invariants (confirming None is representable and distinguishable from Some(0) after the optional keyword change), which is orthogonal to exercising the builder translation logic already covered by the broader unit and e2e suite. Happy to track the builder-path regression tests as a follow-up if that's the plan.


🧠 Learnings used
Learnt from: vschandramourya
Repo: lightseekorg/smg PR: 915
File: model_gateway/src/routers/grpc/client.rs:387-423
Timestamp: 2026-03-26T17:06:14.307Z
Learning: In repo lightseekorg/smg, in `model_gateway/src/routers/grpc/client.rs` and the corresponding backend builders (`crates/grpc_client/src/sglang_scheduler.rs`, `vllm_engine.rs`, `trtllm_service.rs`): The per-backend divergence in handling `CompletionRequest.max_tokens == None` is intentional. SGLang and vLLM pass `None` through to their proto builders, while TRT-LLM falls back to `16`. This matches the pre-existing per-backend pattern used in the chat/messages request builders. Do not flag this divergence as a bug or request normalization at the `build_completion_request` dispatcher layer in `client.rs`.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 489
File: multimodal/tests/vision_golden_tests.rs:576-578
Timestamp: 2026-02-21T02:39:51.670Z
Learning: Repo lightseekorg/smg — For PR `#489` (clippy/lint-only), do not replace unwrap() with expect(...) in test files when a file-/crate-level #![expect(clippy::unwrap_used)] is present (e.g., multimodal/tests/vision_golden_tests.rs). Such stylistic swaps are out of scope.

Learnt from: vschandramourya
Repo: lightseekorg/smg PR: 978
File: model_gateway/src/routers/grpc/regular/streaming.rs:2275-2276
Timestamp: 2026-03-31T03:43:30.981Z
Learning: In repo lightseekorg/smg, file model_gateway/src/routers/grpc/regular/streaming.rs: `logprobs: None` in `CompletionStreamChoice` within `process_completion_streaming_chunks` is intentional. Per-token logprob streaming is deferred across all streaming endpoints (completions, generate) because it requires backend-side plumbing not yet available. The same `// TODO: wire per-token logprob streaming` deferral exists in `process_generate_streaming` (around line 702). Non-streaming completions has the same TODO. Do not flag `logprobs: None` or the TODO comment in the completions or generate streaming paths as a missing feature or silent data loss until a dedicated follow-up PR wires backend per-token logprob support.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 489
File: multimodal/tests/vision_golden_tests.rs:402-404
Timestamp: 2026-02-21T02:39:23.481Z
Learning: Repo lightseekorg/smg — For clippy-only/lint-enforcement PRs (e.g., PR `#489`), do not replace unwrap() with expect() across tests/benches when a crate-level `#![expect(clippy::unwrap_used)]` is present. Such per-call swaps are treated as out-of-scope stylistic changes. Example: multimodal/tests/vision_golden_tests.rs.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 812
File: e2e_test/embeddings/test_correctness.py:223-223
Timestamp: 2026-03-20T00:00:47.235Z
Learning: In `lightseekorg/smg` PR `#812`, `TestEmbeddingCorrectness` in `e2e_test/embeddings/test_correctness.py` uses the `api_client` fixture without class-level parametrization by design. SmgClient coverage for embedding correctness (HF parity assertions in `test_semantic_similarity` and `test_relevance_scores`) is intentionally opt-in and deferred until SmgClient embedding compatibility is verified. Do not flag the absence of SmgClient parametrization in this test class as a gap.

Learnt from: XinyueZhang369
Repo: lightseekorg/smg PR: 723
File: model_gateway/tests/api/interactions_api_test.rs:251-282
Timestamp: 2026-03-11T23:26:31.100Z
Learning: In repo lightseekorg/smg, the Gemini-specific header path (apply_provider_headers emitting x-goog-api-key for googleapis.com URLs, and extract_gemini_auth_header) is intentionally not covered by the mock-server integration tests in model_gateway/tests/api/interactions_api_test.rs (PR `#723`). Coverage against a real Gemini backend URL will be added in a follow-up e2e test PR. Do not flag the absence of this test coverage as a gap in PR `#723` or related PRs.

Learnt from: XinyueZhang369
Repo: lightseekorg/smg PR: 723
File: model_gateway/tests/api/interactions_api_test.rs:316-346
Timestamp: 2026-03-11T05:32:45.536Z
Learning: In repo lightseekorg/smg, `test_interactions_multiple_workers` in `model_gateway/tests/api/interactions_api_test.rs` intentionally only verifies that requests succeed with multiple Gemini backends (connectivity), not load distribution. Load distribution across workers is covered separately in `tests/routing/load_balancing_test.rs`. Do not flag the absence of per-worker request count assertions in this test as a gap — distribution verification is deferred to a follow-up PR once the Gemini router implementation matures.

Learnt from: pallasathena92
Repo: lightseekorg/smg PR: 687
File: model_gateway/src/routers/openai/realtime/webrtc.rs:238-289
Timestamp: 2026-03-11T01:29:56.655Z
Learning: In repo lightseekorg/smg, file model_gateway/src/routers/openai/realtime/webrtc.rs: `Metrics::record_router_request` is already emitted in router.rs (around line 1152) before `handle_realtime_webrtc` is called, and `Metrics::record_router_error` is emitted inside `handle_realtime_webrtc` for the no-workers case. The missing instrumentation is success/duration recording after `setup_and_spawn_bridge` returns — this is a metrics improvement deferred to a follow-up PR, not a correctness gap. Do not flag missing success/duration metrics as a blocking issue for PR `#687`.

Learnt from: key4ng
Repo: lightseekorg/smg PR: 1030
File: crates/tool_parser/src/parsers/deepseek32.rs:352-400
Timestamp: 2026-04-06T20:50:55.263Z
Learning: In `crates/tool_parser/src/parsers/deepseek32.rs` (and other parsers in `crates/tool_parser/src/parsers/`), `prev_tool_call_arr` entries are initialized with `"arguments": {}` (a JSON object), not a string. Code that retrieves the previous arguments via `.as_str()` will get `None` for this initial value. The correct pattern is to handle `prev_args = None` explicitly (e.g., treat it as an empty string and emit from `sent_len`), rather than changing the initialization to `"arguments": ""`. This is a deliberate cross-parser convention — do not flag the `{}` initialization as a bug.

Learnt from: key4ng
Repo: lightseekorg/smg PR: 1006
File: crates/tool_parser/src/parsers/deepseek31.rs:162-181
Timestamp: 2026-04-01T04:14:46.469Z
Learning: In repo lightseekorg/smg, `crates/tool_parser/src/parsers/deepseek31.rs` (and the analogous V3 parser `crates/tool_parser/src/parsers/deepseek.rs` lines 186-197): The `parse_incremental` method does NOT split off a plain-text prefix before the first tool marker within the same chunk. This is intentional because the gRPC streaming path delivers tokens individually, so normal text content and tool-call markers always arrive in separate chunks — a prefix and a tool marker will never coexist in the same chunk. Do not flag the absence of a within-chunk prefix-split as a bug; the `test_deepseek31_streaming_text_before_tools` test covers the realistic multi-chunk case.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 588
File: model_gateway/src/routers/grpc/multimodal.rs:453-514
Timestamp: 2026-03-03T18:03:45.713Z
Learning: In repo lightseekorg/smg, backend assembly functions in model_gateway/src/routers/grpc/multimodal.rs (e.g., assemble_sglang, assemble_vllm, assemble_trtllm) are tested via E2E tests rather than unit tests, as unit tests for these functions are not considered worthwhile.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 495
File: grpc_client/proto/sglang_scheduler.proto:140-172
Timestamp: 2026-02-21T11:57:36.560Z
Learning: In repo lightseekorg/smg, the multimodal proto fields in `grpc_client/proto/sglang_scheduler.proto` (including `MultimodalInputs.pixel_values`, `model_specific_tensors`, `im_token_id`, and `mm_placeholders`) were introduced in PR `#495` as the first real usage; prior internal-only development fields like `processed_features` were never shipped to production, so wire-compatibility concerns do not apply to these field number assignments.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 638
File: grpc_servicer/smg_grpc_servicer/vllm/servicer.py:371-406
Timestamp: 2026-03-05T04:48:49.033Z
Learning: In repo lightseekorg/smg, `grpc_servicer/smg_grpc_servicer/vllm/servicer.py` (`_build_preprocessed_mm_inputs`): multimodal metadata fields (`flat_keys`, `batched_keys`, `mm_placeholders`, `model_specific_tensors`) are produced exclusively by the internal Rust router and are treated as trusted data. Defensive input-validation guards (e.g., checking flat_keys targets exist in hf_dict, validating PlaceholderRange bounds) are intentionally omitted to avoid hot-path overhead. Do not flag missing validation on these fields in future reviews.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 489
File: model_gateway/src/routers/grpc/client.rs:189-194
Timestamp: 2026-02-21T02:32:13.396Z
Learning: Repo: lightseekorg/smg PR: 489
File: model_gateway/src/routers/grpc/client.rs
Context: GrpcClient::generate fallback match arm
Learning: The catch-all panic for mismatched client/request types in GrpcClient::generate is intentional to enforce pipeline invariants; do not convert this to a returned error in PR `#489` or similar lint-only changes.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 447
File: model_gateway/src/routers/grpc/client.rs:312-328
Timestamp: 2026-02-17T20:30:41.243Z
Learning: In model_gateway gRPC metadata discovery, SGLang's `server_args` (a `prost_types::Struct`) natively uses short-form keys like `tp_size`, `dp_size`, `pp_size` that match `SGLANG_GRPC_KEYS` and are extracted via `pick_prost_fields()` without normalization. vLLM and TRT-LLM gRPC protos use long-form keys like `tensor_parallel_size`, `pipeline_parallel_size`, which go through `flat_labels()` and are normalized by `normalize_grpc_keys()` in `discover_metadata.rs` after both `model_info.to_labels()` and `server_info.to_labels()` return.

Learnt from: key4ng
Repo: lightseekorg/smg PR: 1034
File: model_gateway/src/routers/grpc/multimodal.rs:602-610
Timestamp: 2026-04-04T00:29:52.690Z
Learning: In repo lightseekorg/smg, the gRPC regular pipeline runs stages in order: Preparation (tokenization + multimodal media fetch + preprocessing) → Worker Selection → Request Building. Because backend identity (e.g., GrpcClient::Mlx) is only resolved after worker selection, it is architecturally impossible to short-circuit MLX multimodal rejection before process_multimodal() / process_multimodal_messages() runs in the Preparation stage. The earliest feasible rejection point for backend-specific multimodal unsupported errors is at assemble_multimodal_data in request_building (after preprocessing but before generation). Moving rejection earlier would require a larger architectural change (e.g., moving worker selection before preparation). Do not flag assemble_multimodal_data's GrpcClient::Mlx error arm as "too late" in the pipeline.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 606
File: e2e_test/chat_completions/test_openai_server.py:600-600
Timestamp: 2026-03-04T04:05:00.273Z
Learning: In the `lightseekorg/smg` repository, as of PR `#606`, Harmony (gpt-oss) DOES support stop sequence trimming. `TestChatCompletionGptOss` in `e2e_test/chat_completions/test_openai_server.py` correctly sets `STOP_SEQUENCE_TRIMMED = True` by default. The vLLM backend (is_vllm) overrides this to `False` in both `test_stop_sequences` and `test_stop_sequences_stream` because vLLM does not trim stop sequences. Do not flag `STOP_SEQUENCE_TRIMMED = True` in `TestChatCompletionGptOss` as incorrect.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 1031
File: model_gateway/src/routers/grpc/regular/streaming.rs:1579-1600
Timestamp: 2026-04-03T05:31:33.707Z
Learning: In repo lightseekorg/smg, `model_gateway/src/routers/grpc/regular/streaming.rs` (`process_messages_streaming_chunks`): The reasoning parser is gated on explicit user opt-in (`ThinkingConfig::Enabled`) via `separate_reasoning = matches!(&original_request.thinking, Some(ThinkingConfig::Enabled { .. }))`, NOT on `thinking_override` (which only controls whether `mark_reasoning_started()` is called on the parser, not whether the parser runs at all). Chat API uses the `separate_reasoning` field on `ChatCompletionRequest` instead. Do not suggest removing or replacing the `ThinkingConfig::Enabled` gate in the Messages path.

Learnt from: TingtingZhou7
Repo: lightseekorg/smg PR: 1057
File: model_gateway/src/routers/openai/mcp/tool_loop.rs:856-885
Timestamp: 2026-04-08T00:08:05.944Z
Learning: In repo lightseekorg/smg, `sanitize_builtin_tool_arguments` in `model_gateway/src/routers/openai/mcp/tool_loop.rs` intentionally drops image-generation options (size, quality, background, output_format, compression) when handling `ResponseFormat::ImageGenerationCall`, keeping only `model` (hardcoded `IMAGE_MODEL`) and `revised_prompt`. This is a deliberate scoped decision for the initial image-generation tool integration; per-option overrides/defaults are planned for a follow-up PR. Do not flag the truncation as a bug or request preservation of extra fields.

Learnt from: XinyueZhang369
Repo: lightseekorg/smg PR: 399
File: protocols/src/interactions.rs:505-509
Timestamp: 2026-02-19T03:08:50.192Z
Learning: In code reviews for Rust projects using the validator crate (v0.20.0), ensure that custom validation functions for numeric primitive types (e.g., f32, i32, u32, i16, etc.) accept the value by value, not by reference. Example: fn validate(value: f32) { ... }. The validator derive macro has a hardcoded list of numeric types that are passed by value, while all other types are passed by reference. Apply this guideline whenever validating numeric fields to align with the derive macro behavior.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 489
File: model_gateway/src/core/token_bucket.rs:58-63
Timestamp: 2026-02-21T02:30:51.443Z
Learning: For lint-only/Clippy enforcement PRs in this repository, avoid introducing behavioral changes (e.g., new input validation or logic changes). Treat such PRs as non-functional changes and plan a separate follow-up issue/PR for hardening or behavior changes. This applies broadly to Rust files across the repo; during review, focus on lint/style corrections and clearly note any intentional exceptions. 

Learnt from: slin1237
Repo: lightseekorg/smg PR: 489
File: protocols/src/responses.rs:928-931
Timestamp: 2026-02-21T02:36:00.882Z
Learning: In Rust code across the repository, use the marker INVARIANT: to document assumptions in safe code. Reserve SAFETY: for explaining why unsafe blocks are sound. This improves clarity of invariants and safety reasoning. Example reference: protocols/src/responses.rs near validate_tool_choice_with_tools().

Learnt from: slin1237
Repo: lightseekorg/smg PR: 489
File: mesh/src/sync.rs:83-83
Timestamp: 2026-02-21T02:37:01.416Z
Learning: General Rust formatting rule: format! with implicit captures only supports simple identifiers, not full expressions like {state.model_id}. For cases where you want to interpolate a field or expression, bind the value first and interpolate the binding, e.g., let model_id = &state.model_id; and then use format!("policy:{}", model_id). In the specific file mesh/src/sync.rs, prefer format!("policy:{}", state.model_id) or bind to a local variable if you need named interpolation, to keep clarity and avoid unintended captures.

Learnt from: zhaowenzi
Repo: lightseekorg/smg PR: 807
File: model_gateway/src/middleware.rs:61-81
Timestamp: 2026-03-18T21:32:00.041Z
Learning: In Rust code using the http crate, HeaderMap::get() is effectively case-insensitive because HeaderName normalizes keys to lowercase on insertion and lookup. Do not require or perform explicit .to_lowercase() before HeaderMap::get() calls. Mark as not a concern for case-sensitivity in lookups; only consider normalization when inserting or comparing via HeaderName, not in lookups.

Learnt from: key4ng
Repo: lightseekorg/smg PR: 867
File: tui/src/app.rs:798-813
Timestamp: 2026-03-22T20:13:55.778Z
Learning: In this repo (lightseekorg/smg), treat the workspace `Cargo.toml`’s `package.rust-version` (MSRV) as the source of truth (e.g., `rust-version = "1.85"`). When reviewing Rust changes, do not flag usage of Rust language/library features that were stabilized on or before the MSRV (e.g., `Option::is_none_or`, stabilized in 1.82, is compatible with an MSRV of 1.85). Always verify the MSRV from the workspace `Cargo.toml` rather than relying on issue templates.

Learnt from: key4ng
Repo: lightseekorg/smg PR: 1034
File: crates/grpc_client/src/mlx_engine.rs:135-145
Timestamp: 2026-04-04T00:29:38.039Z
Learning: In this repo (lightseekorg/smg), the gRPC Channel builder implementations for the backend clients (SGLang, vLLM, TRT-LLM, and MLX) intentionally omit `.connect_timeout()`. When reviewing code changes in `crates/grpc_client/src/`, do not treat a missing `.connect_timeout()` in these backend client channel builders as a blocking issue; it is planned for a uniform addition in a follow-up PR.

Learnt from: key4ng
Repo: lightseekorg/smg PR: 1034
File: crates/grpc_client/src/mlx_engine.rs:61-66
Timestamp: 2026-04-04T00:29:37.400Z
Learning: In lightseekorg/smg’s gRPC backend clients that use the `AbortOnDropStream` pattern (e.g., vLLM, SGLang, TRT-LLM, MLX), `poll_next` is intentionally allowed to return `Poll::Ready(None)` on normal EOF without auto-flipping the completed flag. Do not review-fail for the absence of an EOF-driven `mark_completed()`/completed-flag update in these `AbortOnDropStream` implementations. The stream consumer is responsible for calling `mark_completed()` when the stream ends normally; any spurious/cheap idempotent `Abort` RPC that happens when `mark_completed()` wasn’t called before drop on natural EOF is an accepted tradeoff, and should not be treated as a bug.

Learnt from: key4ng
Repo: lightseekorg/smg PR: 1034
File: crates/grpc_client/src/mlx_engine.rs:250-256
Timestamp: 2026-04-04T00:29:40.508Z
Learning: In lightseekorg/smg’s gRPC backend implementation under crates/grpc_client/src/, tool-calling enforcement is router-side: SMG’s router-side ToolParser parses tool calls from the backend’s text output. Therefore, MLX (and other gRPC backend) builder functions should NOT be required to reject/guard tool-related fields (e.g., tools, tool_choice) at the backend builder boundary just because they are missing. Only reject tool-related content when there’s an explicit, backend-specific structural constraint to enforce at the builder boundary (e.g., response_format / structural_tag or other required schema/structure invariants). When reviewing, do not flag the absence of tool-call guards in MLX/gRPC backend builders as missing enforcement.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 335b8130cd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +573 to +577
if input_length > 0:
model_max = max_model_len - input_length
if model_max <= 0:
raise ValueError(
f"Input length ({input_length}) exceeds max model length ({max_model_len})"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Respect prompt truncation before enforcing context-length guard

The new input_length > 0 guard rejects requests before generation using the raw tokenized prompt length, but truncation is applied later via tokenization_kwargs (truncate_prompt_tokens). For requests where len(input_ids) > max_model_len but truncate_prompt_tokens would bring the prompt under the limit, this now raises INVALID_ARGUMENT even though vLLM could have served the request after truncation. This introduces a regression for direct gRPC callers using prompt truncation and breaks parity with the HTTP flow that computes limits from the effective prompt length.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

@ai-jz ai-jz Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HTTP path's get_max_tokens() has the same guard — it checks raw input_length before truncation is applied by the engine. Made a minor edit in 4ee8b93 to align the guard condition (< instead of <=) and error message to match the HTTP path exactly.

The test doesn't verify actual sampling parameter values, only that
requests succeed. Remove it to keep the test suite lean — the core
change is in proto + Rust client + servicer.

Signed-off-by: ai-jz <ai-jz@users.noreply.github.com>
@github-actions github-actions Bot removed the tests Test changes label Apr 12, 2026
Signed-off-by: ai-jz <ai-jz@users.noreply.github.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4ee8b936ec

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +583 to +585
candidates = [
v for v in (model_max, raw_max_tokens, override_max_tokens) if v is not None
]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clamp max_tokens against platform output limits

The new max_tokens clamp only considers model_max, request/default max_tokens, and override_max_tokens, but it omits vLLM’s platform-specific cap (current_platform.get_max_output_tokens(...)) that HTTP uses in get_max_tokens. On platforms that enforce a lower output-token ceiling than context-based model_max, requests with omitted/large max_tokens can now be forwarded with an oversized value and fail later with avoidable validation/runtime errors instead of being bounded up front.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current_platform.get_max_output_tokens() is an internal vLLM entrypoint API — the gRPC servicer runs outside that context and does not have access to it. The engine applies platform limits internally after receiving the request, same as it does for all other entrypoints.

# Harmony model stop tokens — same as chat_completion/serving.py
if mc.hf_config.model_type == "gpt_oss":
from vllm.entrypoints.openai.parser.harmony_utils import (
get_stop_tokens_for_assistant_actions,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is injected in harmony pipeline. No need to inject here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it — removed in b534cc2.

logger.info("VllmEngineServicer initialized")

# Load model's default sampling params (respects --generation-config)
mc = async_llm.model_config
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the full name variables here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — renamed to model_config in b534cc2.

mc = async_llm.model_config
self.default_sampling_params: dict = dict(mc.get_diff_sampling_param())

# max_tokens override — same logic as chat_completion/serving.py
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include detailed url link with specific commit.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added — see serving.py#L141 in b534cc2.


Returns:
vLLM SamplingParams with detokenize=False and structured_outputs
vLLM SamplingParams with model defaults applied for unset fields
Copy link
Copy Markdown
Member

@CatherineSue CatherineSue Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phrase model defaults here is misleading. What if user passes --generation-config vllm or an overridden config?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to "generation-config defaults" in b534cc2get_diff_sampling_param() covers all three --generation-config modes.

params.max_tokens if params.HasField("max_tokens") else defaults.get("max_tokens")
)
if max_model_len is not None:
if input_length > 0:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this come from the vllm logic? If so, please put the source link.

And it is quite surprised to me that this is not handled in AsyncLLM already. It should be. This is pure backend logic, should not be in the front end servicer.py

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — removed the entire clamping block in b534cc2. The engine handles this in input_processor.py via AsyncLLM.add_request()process_inputs().

…logic

- Remove Harmony stop_token_ids block (gpt_oss model type check)
- Rename 'mc' to 'model_config' for clarity in __init__
- Add vLLM HTTP path source link comment
- Rename 'model defaults' to 'generation-config defaults' in comment
- Remove max_tokens clamping logic (max_model_len, input_length,
  override_max_tokens) — let vLLM handle context-length enforcement

Signed-off-by: ai-jz <ai-jz@users.noreply.github.com>
@github-actions github-actions Bot added the ci CI/CD configuration changes label Apr 12, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 81822bd7ed

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread .github/workflows/fork-ci-review.yml Outdated
Comment on lines +53 to +55
- uses: actions/checkout@v6
with:
fetch-depth: 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Checkout pull-request ref for issue_comment reviews

When this workflow is triggered by issue_comment, actions/checkout without an explicit ref checks out the repository default branch rather than the commented PR branch. That means the Claude run can review the wrong code (or an empty main...HEAD diff) when someone invokes @claude from a PR thread. Set the checkout ref for comment-triggered runs (for example, refs/pull/${{ github.event.issue.number }}/head) so the review analyzes the intended changes.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was included by mistake — fork-ci-review.yml is fork-only CI infrastructure and has been removed from this PR (force-pushed to drop the commit). These findings are valid and will be addressed on the fork separately.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was included by mistake and has been removed from this PR.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
grpc_servicer/smg_grpc_servicer/vllm/servicer.py (1)

495-500: ⚠️ Potential issue | 🟠 Major

Don't merge stop_token_ids in the servicer.

This union changes request semantics because stop_token_ids is still a repeated proto field with no presence bit: omitted and explicitly empty look identical here, so callers lose any way to clear generation-config stop IDs. It also reintroduces the frontend-side stop-token injection that was already called out earlier.

Suggested fix
-        # Merge stop_token_ids: request + model defaults
         stop_token_ids = list(params.stop_token_ids) if params.stop_token_ids else None
-        default_stop_ids = defaults.get("stop_token_ids")
-        if default_stop_ids:
-            stop_token_ids = list(set(stop_token_ids or []) | set(default_stop_ids))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@grpc_servicer/smg_grpc_servicer/vllm/servicer.py` around lines 495 - 500, The
current code merges request stop_token_ids with model defaults (using
params.stop_token_ids and defaults.get("stop_token_ids")), which changes caller
semantics; stop_token_ids is a repeated proto without presence so we must not
union them. Change the logic in servicer.py around stop_token_ids so you do NOT
take the set-union: instead honor the request value as-is (use
params.stop_token_ids directly when provided) and only fall back to
defaults.get("stop_token_ids") when the request did not provide any value,
avoiding combining the two lists; update references to stop_token_ids,
params.stop_token_ids, and defaults.get("stop_token_ids") accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/fork-ci-review.yml:
- Around line 53-55: The workflow currently uses actions/checkout@v6 (the two
occurrences of actions/checkout@v6) which is a floating tag contradicting the
repo's SHA-pinning policy; replace each actions/checkout@v6 with the exact
commit SHA pin used for other actions (e.g., actions/checkout@<commit-sha>) or
remove the SHA-pinning claim in the workflow header; update both occurrences of
actions/checkout so they match the pinned format used for
anthropics/claude-code-action and openai/codex-action.
- Around line 44-51: The "Validate secret" step currently exits 0 and does not
prevent downstream steps from running when ANTHROPIC_API_KEY is missing; change
the validation to emit a step output (e.g., set a has_key/has_anthropic output
via GITHUB_OUTPUT) indicating whether the secret exists, and then add if:
conditionals on the subsequent "checkout" and "Claude Code review" steps to run
only when that output is true; apply the same pattern to the codex-review job's
validation step and its dependent steps (use consistent output names like
has_key or has_codex_key and reference them in if: conditions).
- Around line 21-22: The workflow currently triggers on issue_comment and will
run for plain issues and default-branch refs; update the job condition (where
the workflow checks before running) to require github.event.issue.pull_request
!= null so it only runs for PR comments and add a secondary guard ensuring
github.event_name == 'issue_comment' && contains(github.event.comment.body,
'@claude') if applicable; also change the actions/checkout step (the checkout
job using actions/checkout) to pass an explicit ref:
github.event.issue.pull_request.head.ref or github.event.pull_request.head.sha
(use the correct event field in your job) so the job checks out the PR head
instead of the default branch.

---

Duplicate comments:
In `@grpc_servicer/smg_grpc_servicer/vllm/servicer.py`:
- Around line 495-500: The current code merges request stop_token_ids with model
defaults (using params.stop_token_ids and defaults.get("stop_token_ids")), which
changes caller semantics; stop_token_ids is a repeated proto without presence so
we must not union them. Change the logic in servicer.py around stop_token_ids so
you do NOT take the set-union: instead honor the request value as-is (use
params.stop_token_ids directly when provided) and only fall back to
defaults.get("stop_token_ids") when the request did not provide any value,
avoiding combining the two lists; update references to stop_token_ids,
params.stop_token_ids, and defaults.get("stop_token_ids") accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ed101e87-d959-449d-86c6-c498ffd200d3

📥 Commits

Reviewing files that changed from the base of the PR and between 4ee8b93 and 81822bd.

📒 Files selected for processing (2)
  • .github/workflows/fork-ci-review.yml
  • grpc_servicer/smg_grpc_servicer/vllm/servicer.py

Comment thread .github/workflows/fork-ci-review.yml Outdated
Comment on lines +21 to +22
issue_comment:
types: [created]
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd -type f -name "fork-ci-review.yml" | head -20

Repository: lightseekorg/smg

Length of output: 230


🏁 Script executed:

fd -t f "fork-ci-review.yml"

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

fd -t f "fork" .github/workflows/

Repository: lightseekorg/smg

Length of output: 96


🏁 Script executed:

cat -n .github/workflows/fork-ci-review.yml

Repository: lightseekorg/smg

Length of output: 5504


@claude comment runs are not scoped to pull requests—they can execute on plain issues or use the default branch.

The issue_comment trigger at line 21-22 fires on both pull requests and issues. Without a check for github.event.issue.pull_request != null in the job condition (line 35), a comment on a plain issue can trigger the job. Additionally, actions/checkout at line 53 lacks an explicit ref parameter, so it defaults to the repository's default branch instead of the PR head—even on valid PR comments.

Suggested fix
   claude-review:
     if: >-
       github.actor == 'ai-jz' && (
         github.event_name == 'pull_request' ||
-        (github.event_name == 'issue_comment' && contains(github.event.comment.body, '@claude'))
+        (
+          github.event_name == 'issue_comment' &&
+          github.event.issue.pull_request != null &&
+          contains(github.event.comment.body, '@claude')
+        )
       )
@@
       - uses: actions/checkout@v6
         with:
           fetch-depth: 0
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || format('refs/pull/{0}/head', github.event.issue.number) }}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/fork-ci-review.yml around lines 21 - 22, The workflow
currently triggers on issue_comment and will run for plain issues and
default-branch refs; update the job condition (where the workflow checks before
running) to require github.event.issue.pull_request != null so it only runs for
PR comments and add a secondary guard ensuring github.event_name ==
'issue_comment' && contains(github.event.comment.body, '@claude') if applicable;
also change the actions/checkout step (the checkout job using actions/checkout)
to pass an explicit ref: github.event.issue.pull_request.head.ref or
github.event.pull_request.head.sha (use the correct event field in your job) so
the job checks out the PR head instead of the default branch.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was included by mistake — fork-ci-review.yml is fork-only CI infrastructure and has been removed from this PR (force-pushed to drop the commit). These findings are valid and will be addressed on the fork separately.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ai-jz, understood — thanks for the clarification! Glad the findings were useful; feel free to apply them when revisiting that workflow on the fork. No further action needed here.


🧠 Learnings used
Learnt from: key4ng
Repo: lightseekorg/smg PR: 1003
File: .github/workflows/claude-code-review.yml:77-81
Timestamp: 2026-03-31T23:05:49.930Z
Learning: In repo lightseekorg/smg, `github-actions[bot]` does NOT post inline PR review comments, so it does not need to be included in cross-bot deduplication lists for Claude review workflows (e.g., in `.github/workflows/claude-code-review.yml`). Do not flag its absence from such lists.

Learnt from: key4ng
Repo: lightseekorg/smg PR: 1036
File: .github/workflows/claude-code-review.yml:71-73
Timestamp: 2026-04-06T16:49:28.633Z
Learning: In GitHub Actions workflow YAMLs, treat any `prompt:` field intended for an LLM (e.g., Claude) as natural-language instructions for the model. Do not embed shell-script-specific constructs or error-handling patterns inside the prompt content (e.g., `git rev-parse --verify`, `2>/dev/null`, explicit exit-code checks like `if [ $? -ne 0 ]`). Instead, express fallback/error-handling guidance in natural language so the LLM can reason about failures from context rather than parsing shell stderr/stdout semantics.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 633
File: .github/workflows/pr-test-rust.yml:484-485
Timestamp: 2026-03-04T20:40:48.357Z
Learning: In repo lightseekorg/smg, the composite action `.github/actions/run-e2e` (at `.github/actions/run-e2e/action.yml`) includes a `actions/checkoutv6` step internally as its first step. Callers (e.g., `e2e-always`, `e2e-chat-completions`, `e2e-agentic` jobs in `.github/workflows/pr-test-rust.yml`) do NOT need a separate checkout step before invoking this action — do not flag missing checkout for these jobs.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 633
File: .github/workflows/pr-test-rust.yml:439-440
Timestamp: 2026-03-04T20:41:53.013Z
Learning: Ensure GitHub Actions workflows always include an actions/checkout step when referencing local composite actions (uses: ./... or similar) on self-hosted runners. Do not rely on a pre-checked-out workspace unless your environment explicitly guarantees it; follow the standard behavior and include checkout to ensure reliable action resolution.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 638
File: .github/workflows/release-grpc.yml:35-37
Timestamp: 2026-03-05T05:50:07.158Z
Learning: In workflows under .github/workflows that trigger only on push to main, you do not need to guard against the all-zeros github.event.before (0000000000000000000000000000000000000000) since that value only occurs on initial branch creation and cannot happen for an already-existing main. Apply this guidance to similar push-to-main workflows across the repository (pattern: **/.github/workflows/*.yml). If a workflow also handles other events or branches, retain appropriate guards for those cases.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 633
File: .github/workflows/pr-test-rust.yml:439-440
Timestamp: 2026-03-04T20:42:14.058Z
Learning: In lightseekorg/smg, the self-hosted k8s-runner-gpu and k8s-runner-cpu runners appear to have a pre-checked-out workspace, which allows local composite action references (e.g., uses: ./.github/actions/run-e2e) to resolve without an explicit actions/checkout step. Per standard GitHub Actions behavior, checkout IS required before local `./` composite action references; this repo is an exception due to its persistent self-hosted runner workspace.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 851
File: docs/_audit/reliability-worker.md:216-228
Timestamp: 2026-03-21T06:14:03.701Z
Learning: In repo lightseekorg/smg, files under `docs/_audit/` (e.g., `docs/_audit/reliability-worker.md`) are internal audit checklists used during the review process and are not shipped as user-facing documentation. Do not flag markdown formatting nitpicks (e.g., missing code block language specifiers, lint warnings) in these files — formatting polish is out of scope for them.

Learnt from: vschandramourya
Repo: lightseekorg/smg PR: 915
File: model_gateway/src/routers/grpc/client.rs:387-423
Timestamp: 2026-03-26T17:06:14.307Z
Learning: In repo lightseekorg/smg, in `model_gateway/src/routers/grpc/client.rs` and the corresponding backend builders (`crates/grpc_client/src/sglang_scheduler.rs`, `vllm_engine.rs`, `trtllm_service.rs`): The per-backend divergence in handling `CompletionRequest.max_tokens == None` is intentional. SGLang and vLLM pass `None` through to their proto builders, while TRT-LLM falls back to `16`. This matches the pre-existing per-backend pattern used in the chat/messages request builders. Do not flag this divergence as a bug or request normalization at the `build_completion_request` dispatcher layer in `client.rs`.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 495
File: grpc_client/proto/sglang_scheduler.proto:140-172
Timestamp: 2026-02-21T11:57:36.560Z
Learning: In repo lightseekorg/smg, the multimodal proto fields in `grpc_client/proto/sglang_scheduler.proto` (including `MultimodalInputs.pixel_values`, `model_specific_tensors`, `im_token_id`, and `mm_placeholders`) were introduced in PR `#495` as the first real usage; prior internal-only development fields like `processed_features` were never shipped to production, so wire-compatibility concerns do not apply to these field number assignments.

Learnt from: vschandramourya
Repo: lightseekorg/smg PR: 978
File: model_gateway/src/routers/grpc/regular/streaming.rs:2275-2276
Timestamp: 2026-03-31T03:43:30.981Z
Learning: In repo lightseekorg/smg, file model_gateway/src/routers/grpc/regular/streaming.rs: `logprobs: None` in `CompletionStreamChoice` within `process_completion_streaming_chunks` is intentional. Per-token logprob streaming is deferred across all streaming endpoints (completions, generate) because it requires backend-side plumbing not yet available. The same `// TODO: wire per-token logprob streaming` deferral exists in `process_generate_streaming` (around line 702). Non-streaming completions has the same TODO. Do not flag `logprobs: None` or the TODO comment in the completions or generate streaming paths as a missing feature or silent data loss until a dedicated follow-up PR wires backend per-token logprob support.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 638
File: grpc_servicer/smg_grpc_servicer/vllm/servicer.py:371-406
Timestamp: 2026-03-05T04:48:49.033Z
Learning: In repo lightseekorg/smg, `grpc_servicer/smg_grpc_servicer/vllm/servicer.py` (`_build_preprocessed_mm_inputs`): multimodal metadata fields (`flat_keys`, `batched_keys`, `mm_placeholders`, `model_specific_tensors`) are produced exclusively by the internal Rust router and are treated as trusted data. Defensive input-validation guards (e.g., checking flat_keys targets exist in hf_dict, validating PlaceholderRange bounds) are intentionally omitted to avoid hot-path overhead. Do not flag missing validation on these fields in future reviews.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 447
File: model_gateway/src/routers/grpc/client.rs:312-328
Timestamp: 2026-02-17T20:30:41.243Z
Learning: In model_gateway gRPC metadata discovery, SGLang's `server_args` (a `prost_types::Struct`) natively uses short-form keys like `tp_size`, `dp_size`, `pp_size` that match `SGLANG_GRPC_KEYS` and are extracted via `pick_prost_fields()` without normalization. vLLM and TRT-LLM gRPC protos use long-form keys like `tensor_parallel_size`, `pipeline_parallel_size`, which go through `flat_labels()` and are normalized by `normalize_grpc_keys()` in `discover_metadata.rs` after both `model_info.to_labels()` and `server_info.to_labels()` return.

Learnt from: key4ng
Repo: lightseekorg/smg PR: 1034
File: model_gateway/src/routers/grpc/multimodal.rs:602-610
Timestamp: 2026-04-04T00:29:52.690Z
Learning: In repo lightseekorg/smg, the gRPC regular pipeline runs stages in order: Preparation (tokenization + multimodal media fetch + preprocessing) → Worker Selection → Request Building. Because backend identity (e.g., GrpcClient::Mlx) is only resolved after worker selection, it is architecturally impossible to short-circuit MLX multimodal rejection before process_multimodal() / process_multimodal_messages() runs in the Preparation stage. The earliest feasible rejection point for backend-specific multimodal unsupported errors is at assemble_multimodal_data in request_building (after preprocessing but before generation). Moving rejection earlier would require a larger architectural change (e.g., moving worker selection before preparation). Do not flag assemble_multimodal_data's GrpcClient::Mlx error arm as "too late" in the pipeline.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 606
File: e2e_test/chat_completions/test_openai_server.py:600-600
Timestamp: 2026-03-04T04:05:00.273Z
Learning: In the `lightseekorg/smg` repository, as of PR `#606`, Harmony (gpt-oss) DOES support stop sequence trimming. `TestChatCompletionGptOss` in `e2e_test/chat_completions/test_openai_server.py` correctly sets `STOP_SEQUENCE_TRIMMED = True` by default. The vLLM backend (is_vllm) overrides this to `False` in both `test_stop_sequences` and `test_stop_sequences_stream` because vLLM does not trim stop sequences. Do not flag `STOP_SEQUENCE_TRIMMED = True` in `TestChatCompletionGptOss` as incorrect.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was included by mistake and has been removed from this PR.

Comment thread .github/workflows/fork-ci-review.yml Outdated
Comment on lines +44 to +51
- name: Validate secret
env:
KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
if [ -z "$KEY" ]; then
echo "::warning::ANTHROPIC_API_KEY not set — skipping Claude review"
exit 0
fi
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd -t f fork-ci-review.yml

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

ls -la .github/workflows/ | grep -i fork

Repository: lightseekorg/smg

Length of output: 128


🏁 Script executed:

cat -n .github/workflows/fork-ci-review.yml

Repository: lightseekorg/smg

Length of output: 5504


Missing secrets do not skip either job; validation steps need output-based conditionals.

The validation steps exit successfully (exit 0), allowing subsequent steps to execute with empty API keys. In GitHub Actions, a step exiting with code 0 is successful; it does not prevent the next step from running. This causes the Claude Code review and Codex review actions to fail with missing credentials instead of cleanly skipping.

Add step outputs to the validation steps and apply if: conditionals to the checkout and review steps to actually skip them when secrets are missing:

Suggested fix
      - name: Validate secret
+       id: anthropic_secret
         env:
           KEY: ${{ secrets.ANTHROPIC_API_KEY }}
         run: |
           if [ -z "$KEY" ]; then
             echo "::warning::ANTHROPIC_API_KEY not set — skipping Claude review"
+            echo "present=false" >> "$GITHUB_OUTPUT"
             exit 0
           fi
+          echo "present=true" >> "$GITHUB_OUTPUT"

       - uses: actions/checkout@v6
+        if: steps.anthropic_secret.outputs.present == 'true'
         with:
           fetch-depth: 0

       - name: Claude Code review
+        if: steps.anthropic_secret.outputs.present == 'true'
         uses: anthropics/claude-code-action@c7c8889b30499b4e46f4c32b892e43cd364bc2fe  # v1

Apply the same pattern to the codex-review job (lines 89–96).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Validate secret
env:
KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
if [ -z "$KEY" ]; then
echo "::warning::ANTHROPIC_API_KEY not set — skipping Claude review"
exit 0
fi
- name: Validate secret
id: anthropic_secret
env:
KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
if [ -z "$KEY" ]; then
echo "::warning::ANTHROPIC_API_KEY not set — skipping Claude review"
echo "present=false" >> "$GITHUB_OUTPUT"
exit 0
fi
echo "present=true" >> "$GITHUB_OUTPUT"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/fork-ci-review.yml around lines 44 - 51, The "Validate
secret" step currently exits 0 and does not prevent downstream steps from
running when ANTHROPIC_API_KEY is missing; change the validation to emit a step
output (e.g., set a has_key/has_anthropic output via GITHUB_OUTPUT) indicating
whether the secret exists, and then add if: conditionals on the subsequent
"checkout" and "Claude Code review" steps to run only when that output is true;
apply the same pattern to the codex-review job's validation step and its
dependent steps (use consistent output names like has_key or has_codex_key and
reference them in if: conditions).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was included by mistake — fork-ci-review.yml is fork-only CI infrastructure and has been removed from this PR (force-pushed to drop the commit). These findings are valid and will be addressed on the fork separately.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ai-jz, understood — thanks for the clarification. Since the file has been dropped from this PR entirely, the findings are moot here. No action needed on your end for this review; glad the findings are useful for hardening the fork workflow separately.


🧠 Learnings used
Learnt from: key4ng
Repo: lightseekorg/smg PR: 1036
File: .github/workflows/claude-code-review.yml:71-73
Timestamp: 2026-04-06T16:49:28.633Z
Learning: In GitHub Actions workflow YAMLs, treat any `prompt:` field intended for an LLM (e.g., Claude) as natural-language instructions for the model. Do not embed shell-script-specific constructs or error-handling patterns inside the prompt content (e.g., `git rev-parse --verify`, `2>/dev/null`, explicit exit-code checks like `if [ $? -ne 0 ]`). Instead, express fallback/error-handling guidance in natural language so the LLM can reason about failures from context rather than parsing shell stderr/stdout semantics.

Learnt from: key4ng
Repo: lightseekorg/smg PR: 1003
File: .github/workflows/claude-code-review.yml:77-81
Timestamp: 2026-03-31T23:05:49.930Z
Learning: In repo lightseekorg/smg, `github-actions[bot]` does NOT post inline PR review comments, so it does not need to be included in cross-bot deduplication lists for Claude review workflows (e.g., in `.github/workflows/claude-code-review.yml`). Do not flag its absence from such lists.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 633
File: .github/workflows/pr-test-rust.yml:484-485
Timestamp: 2026-03-04T20:40:48.357Z
Learning: In repo lightseekorg/smg, the composite action `.github/actions/run-e2e` (at `.github/actions/run-e2e/action.yml`) includes a `actions/checkoutv6` step internally as its first step. Callers (e.g., `e2e-always`, `e2e-chat-completions`, `e2e-agentic` jobs in `.github/workflows/pr-test-rust.yml`) do NOT need a separate checkout step before invoking this action — do not flag missing checkout for these jobs.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 633
File: .github/workflows/pr-test-rust.yml:439-440
Timestamp: 2026-03-04T20:41:53.013Z
Learning: Ensure GitHub Actions workflows always include an actions/checkout step when referencing local composite actions (uses: ./... or similar) on self-hosted runners. Do not rely on a pre-checked-out workspace unless your environment explicitly guarantees it; follow the standard behavior and include checkout to ensure reliable action resolution.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 851
File: docs/_audit/reliability-worker.md:216-228
Timestamp: 2026-03-21T06:14:03.701Z
Learning: In repo lightseekorg/smg, files under `docs/_audit/` (e.g., `docs/_audit/reliability-worker.md`) are internal audit checklists used during the review process and are not shipped as user-facing documentation. Do not flag markdown formatting nitpicks (e.g., missing code block language specifiers, lint warnings) in these files — formatting polish is out of scope for them.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 633
File: .github/actions/run-e2e/action.yml:65-83
Timestamp: 2026-03-04T20:39:43.892Z
Learning: Repo lightseekorg/smg uses custom/internal action versions that differ from public GitHub marketplace releases. Specifically: `actions/checkoutv6`, `actions/download-artifactv8`, and `actions/upload-artifactv7` are all valid and intentional throughout this repo's workflows (e.g., pr-test-rust.yml and composite actions under .github/actions/). Do not flag these versions as non-existent or incorrect.

Learnt from: vschandramourya
Repo: lightseekorg/smg PR: 915
File: model_gateway/src/routers/grpc/client.rs:387-423
Timestamp: 2026-03-26T17:06:14.307Z
Learning: In repo lightseekorg/smg, in `model_gateway/src/routers/grpc/client.rs` and the corresponding backend builders (`crates/grpc_client/src/sglang_scheduler.rs`, `vllm_engine.rs`, `trtllm_service.rs`): The per-backend divergence in handling `CompletionRequest.max_tokens == None` is intentional. SGLang and vLLM pass `None` through to their proto builders, while TRT-LLM falls back to `16`. This matches the pre-existing per-backend pattern used in the chat/messages request builders. Do not flag this divergence as a bug or request normalization at the `build_completion_request` dispatcher layer in `client.rs`.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 495
File: grpc_client/proto/sglang_scheduler.proto:140-172
Timestamp: 2026-02-21T11:57:36.560Z
Learning: In repo lightseekorg/smg, the multimodal proto fields in `grpc_client/proto/sglang_scheduler.proto` (including `MultimodalInputs.pixel_values`, `model_specific_tensors`, `im_token_id`, and `mm_placeholders`) were introduced in PR `#495` as the first real usage; prior internal-only development fields like `processed_features` were never shipped to production, so wire-compatibility concerns do not apply to these field number assignments.

Learnt from: vschandramourya
Repo: lightseekorg/smg PR: 978
File: model_gateway/src/routers/grpc/regular/streaming.rs:2275-2276
Timestamp: 2026-03-31T03:43:30.981Z
Learning: In repo lightseekorg/smg, file model_gateway/src/routers/grpc/regular/streaming.rs: `logprobs: None` in `CompletionStreamChoice` within `process_completion_streaming_chunks` is intentional. Per-token logprob streaming is deferred across all streaming endpoints (completions, generate) because it requires backend-side plumbing not yet available. The same `// TODO: wire per-token logprob streaming` deferral exists in `process_generate_streaming` (around line 702). Non-streaming completions has the same TODO. Do not flag `logprobs: None` or the TODO comment in the completions or generate streaming paths as a missing feature or silent data loss until a dedicated follow-up PR wires backend per-token logprob support.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 638
File: grpc_servicer/smg_grpc_servicer/vllm/servicer.py:371-406
Timestamp: 2026-03-05T04:48:49.033Z
Learning: In repo lightseekorg/smg, `grpc_servicer/smg_grpc_servicer/vllm/servicer.py` (`_build_preprocessed_mm_inputs`): multimodal metadata fields (`flat_keys`, `batched_keys`, `mm_placeholders`, `model_specific_tensors`) are produced exclusively by the internal Rust router and are treated as trusted data. Defensive input-validation guards (e.g., checking flat_keys targets exist in hf_dict, validating PlaceholderRange bounds) are intentionally omitted to avoid hot-path overhead. Do not flag missing validation on these fields in future reviews.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 447
File: model_gateway/src/routers/grpc/client.rs:312-328
Timestamp: 2026-02-17T20:30:41.243Z
Learning: In model_gateway gRPC metadata discovery, SGLang's `server_args` (a `prost_types::Struct`) natively uses short-form keys like `tp_size`, `dp_size`, `pp_size` that match `SGLANG_GRPC_KEYS` and are extracted via `pick_prost_fields()` without normalization. vLLM and TRT-LLM gRPC protos use long-form keys like `tensor_parallel_size`, `pipeline_parallel_size`, which go through `flat_labels()` and are normalized by `normalize_grpc_keys()` in `discover_metadata.rs` after both `model_info.to_labels()` and `server_info.to_labels()` return.

Learnt from: key4ng
Repo: lightseekorg/smg PR: 1034
File: model_gateway/src/routers/grpc/multimodal.rs:602-610
Timestamp: 2026-04-04T00:29:52.690Z
Learning: In repo lightseekorg/smg, the gRPC regular pipeline runs stages in order: Preparation (tokenization + multimodal media fetch + preprocessing) → Worker Selection → Request Building. Because backend identity (e.g., GrpcClient::Mlx) is only resolved after worker selection, it is architecturally impossible to short-circuit MLX multimodal rejection before process_multimodal() / process_multimodal_messages() runs in the Preparation stage. The earliest feasible rejection point for backend-specific multimodal unsupported errors is at assemble_multimodal_data in request_building (after preprocessing but before generation). Moving rejection earlier would require a larger architectural change (e.g., moving worker selection before preparation). Do not flag assemble_multimodal_data's GrpcClient::Mlx error arm as "too late" in the pipeline.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 606
File: e2e_test/chat_completions/test_openai_server.py:600-600
Timestamp: 2026-03-04T04:05:00.273Z
Learning: In the `lightseekorg/smg` repository, as of PR `#606`, Harmony (gpt-oss) DOES support stop sequence trimming. `TestChatCompletionGptOss` in `e2e_test/chat_completions/test_openai_server.py` correctly sets `STOP_SEQUENCE_TRIMMED = True` by default. The vLLM backend (is_vllm) overrides this to `False` in both `test_stop_sequences` and `test_stop_sequences_stream` because vLLM does not trim stop sequences. Do not flag `STOP_SEQUENCE_TRIMMED = True` in `TestChatCompletionGptOss` as incorrect.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 638
File: .github/workflows/release-grpc.yml:35-37
Timestamp: 2026-03-05T05:50:07.158Z
Learning: In workflows under .github/workflows that trigger only on push to main, you do not need to guard against the all-zeros github.event.before (0000000000000000000000000000000000000000) since that value only occurs on initial branch creation and cannot happen for an already-existing main. Apply this guidance to similar push-to-main workflows across the repository (pattern: **/.github/workflows/*.yml). If a workflow also handles other events or branches, retain appropriate guards for those cases.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was included by mistake and has been removed from this PR.

Comment thread .github/workflows/fork-ci-review.yml Outdated
Comment on lines +53 to +55
- uses: actions/checkout@v6
with:
fetch-depth: 0
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n .github/workflows/fork-ci-review.yml | head -120

Repository: lightseekorg/smg

Length of output: 5079


Pin actions/checkout to an exact commit SHA or remove the security claim.

Line 6 asserts "Actions pinned to SHA (not floating tags) to prevent supply chain attacks," but actions/checkout@v6 on lines 53 and 98 uses a floating tag. Other actions in this file (anthropics/claude-code-action and openai/codex-action) are correctly pinned to commit SHAs. Either pin checkout to a specific commit hash like the others, or remove the SHA-pinning security claim from the header.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/fork-ci-review.yml around lines 53 - 55, The workflow
currently uses actions/checkout@v6 (the two occurrences of actions/checkout@v6)
which is a floating tag contradicting the repo's SHA-pinning policy; replace
each actions/checkout@v6 with the exact commit SHA pin used for other actions
(e.g., actions/checkout@<commit-sha>) or remove the SHA-pinning claim in the
workflow header; update both occurrences of actions/checkout so they match the
pinned format used for anthropics/claude-code-action and openai/codex-action.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was included by mistake — fork-ci-review.yml is fork-only CI infrastructure and has been removed from this PR (force-pushed to drop the commit). These findings are valid and will be addressed on the fork separately.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ai-jz, understood — thanks for the clarification! Makes sense to keep fork-only CI infrastructure out of the upstream PR. Good to hear the SHA-pinning findings will be addressed on the fork side.


🧠 Learnings used
Learnt from: key4ng
Repo: lightseekorg/smg PR: 1003
File: .github/workflows/claude-code-review.yml:77-81
Timestamp: 2026-03-31T23:05:49.930Z
Learning: In repo lightseekorg/smg, `github-actions[bot]` does NOT post inline PR review comments, so it does not need to be included in cross-bot deduplication lists for Claude review workflows (e.g., in `.github/workflows/claude-code-review.yml`). Do not flag its absence from such lists.

Learnt from: key4ng
Repo: lightseekorg/smg PR: 1036
File: .github/workflows/claude-code-review.yml:71-73
Timestamp: 2026-04-06T16:49:28.633Z
Learning: In GitHub Actions workflow YAMLs, treat any `prompt:` field intended for an LLM (e.g., Claude) as natural-language instructions for the model. Do not embed shell-script-specific constructs or error-handling patterns inside the prompt content (e.g., `git rev-parse --verify`, `2>/dev/null`, explicit exit-code checks like `if [ $? -ne 0 ]`). Instead, express fallback/error-handling guidance in natural language so the LLM can reason about failures from context rather than parsing shell stderr/stdout semantics.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 633
File: .github/workflows/pr-test-rust.yml:484-485
Timestamp: 2026-03-04T20:40:48.357Z
Learning: In repo lightseekorg/smg, the composite action `.github/actions/run-e2e` (at `.github/actions/run-e2e/action.yml`) includes a `actions/checkoutv6` step internally as its first step. Callers (e.g., `e2e-always`, `e2e-chat-completions`, `e2e-agentic` jobs in `.github/workflows/pr-test-rust.yml`) do NOT need a separate checkout step before invoking this action — do not flag missing checkout for these jobs.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 633
File: .github/workflows/pr-test-rust.yml:439-440
Timestamp: 2026-03-04T20:41:53.013Z
Learning: Ensure GitHub Actions workflows always include an actions/checkout step when referencing local composite actions (uses: ./... or similar) on self-hosted runners. Do not rely on a pre-checked-out workspace unless your environment explicitly guarantees it; follow the standard behavior and include checkout to ensure reliable action resolution.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 633
File: .github/actions/run-e2e/action.yml:65-83
Timestamp: 2026-03-04T20:39:43.892Z
Learning: Repo lightseekorg/smg uses custom/internal action versions that differ from public GitHub marketplace releases. Specifically: `actions/checkoutv6`, `actions/download-artifactv8`, and `actions/upload-artifactv7` are all valid and intentional throughout this repo's workflows (e.g., pr-test-rust.yml and composite actions under .github/actions/). Do not flag these versions as non-existent or incorrect.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 633
File: .github/workflows/pr-test-rust.yml:439-440
Timestamp: 2026-03-04T20:42:14.058Z
Learning: In lightseekorg/smg, the self-hosted k8s-runner-gpu and k8s-runner-cpu runners appear to have a pre-checked-out workspace, which allows local composite action references (e.g., uses: ./.github/actions/run-e2e) to resolve without an explicit actions/checkout step. Per standard GitHub Actions behavior, checkout IS required before local `./` composite action references; this repo is an exception due to its persistent self-hosted runner workspace.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 851
File: docs/_audit/reliability-worker.md:216-228
Timestamp: 2026-03-21T06:14:03.701Z
Learning: In repo lightseekorg/smg, files under `docs/_audit/` (e.g., `docs/_audit/reliability-worker.md`) are internal audit checklists used during the review process and are not shipped as user-facing documentation. Do not flag markdown formatting nitpicks (e.g., missing code block language specifiers, lint warnings) in these files — formatting polish is out of scope for them.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 638
File: .github/workflows/release-grpc.yml:35-37
Timestamp: 2026-03-05T05:50:07.158Z
Learning: In workflows under .github/workflows that trigger only on push to main, you do not need to guard against the all-zeros github.event.before (0000000000000000000000000000000000000000) since that value only occurs on initial branch creation and cannot happen for an already-existing main. Apply this guidance to similar push-to-main workflows across the repository (pattern: **/.github/workflows/*.yml). If a workflow also handles other events or branches, retain appropriate guards for those cases.

Learnt from: vschandramourya
Repo: lightseekorg/smg PR: 915
File: model_gateway/src/routers/grpc/client.rs:387-423
Timestamp: 2026-03-26T17:06:14.307Z
Learning: In repo lightseekorg/smg, in `model_gateway/src/routers/grpc/client.rs` and the corresponding backend builders (`crates/grpc_client/src/sglang_scheduler.rs`, `vllm_engine.rs`, `trtllm_service.rs`): The per-backend divergence in handling `CompletionRequest.max_tokens == None` is intentional. SGLang and vLLM pass `None` through to their proto builders, while TRT-LLM falls back to `16`. This matches the pre-existing per-backend pattern used in the chat/messages request builders. Do not flag this divergence as a bug or request normalization at the `build_completion_request` dispatcher layer in `client.rs`.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 495
File: grpc_client/proto/sglang_scheduler.proto:140-172
Timestamp: 2026-02-21T11:57:36.560Z
Learning: In repo lightseekorg/smg, the multimodal proto fields in `grpc_client/proto/sglang_scheduler.proto` (including `MultimodalInputs.pixel_values`, `model_specific_tensors`, `im_token_id`, and `mm_placeholders`) were introduced in PR `#495` as the first real usage; prior internal-only development fields like `processed_features` were never shipped to production, so wire-compatibility concerns do not apply to these field number assignments.

Learnt from: vschandramourya
Repo: lightseekorg/smg PR: 978
File: model_gateway/src/routers/grpc/regular/streaming.rs:2275-2276
Timestamp: 2026-03-31T03:43:30.981Z
Learning: In repo lightseekorg/smg, file model_gateway/src/routers/grpc/regular/streaming.rs: `logprobs: None` in `CompletionStreamChoice` within `process_completion_streaming_chunks` is intentional. Per-token logprob streaming is deferred across all streaming endpoints (completions, generate) because it requires backend-side plumbing not yet available. The same `// TODO: wire per-token logprob streaming` deferral exists in `process_generate_streaming` (around line 702). Non-streaming completions has the same TODO. Do not flag `logprobs: None` or the TODO comment in the completions or generate streaming paths as a missing feature or silent data loss until a dedicated follow-up PR wires backend per-token logprob support.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 638
File: grpc_servicer/smg_grpc_servicer/vllm/servicer.py:371-406
Timestamp: 2026-03-05T04:48:49.033Z
Learning: In repo lightseekorg/smg, `grpc_servicer/smg_grpc_servicer/vllm/servicer.py` (`_build_preprocessed_mm_inputs`): multimodal metadata fields (`flat_keys`, `batched_keys`, `mm_placeholders`, `model_specific_tensors`) are produced exclusively by the internal Rust router and are treated as trusted data. Defensive input-validation guards (e.g., checking flat_keys targets exist in hf_dict, validating PlaceholderRange bounds) are intentionally omitted to avoid hot-path overhead. Do not flag missing validation on these fields in future reviews.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 447
File: model_gateway/src/routers/grpc/client.rs:312-328
Timestamp: 2026-02-17T20:30:41.243Z
Learning: In model_gateway gRPC metadata discovery, SGLang's `server_args` (a `prost_types::Struct`) natively uses short-form keys like `tp_size`, `dp_size`, `pp_size` that match `SGLANG_GRPC_KEYS` and are extracted via `pick_prost_fields()` without normalization. vLLM and TRT-LLM gRPC protos use long-form keys like `tensor_parallel_size`, `pipeline_parallel_size`, which go through `flat_labels()` and are normalized by `normalize_grpc_keys()` in `discover_metadata.rs` after both `model_info.to_labels()` and `server_info.to_labels()` return.

Learnt from: key4ng
Repo: lightseekorg/smg PR: 1034
File: model_gateway/src/routers/grpc/multimodal.rs:602-610
Timestamp: 2026-04-04T00:29:52.690Z
Learning: In repo lightseekorg/smg, the gRPC regular pipeline runs stages in order: Preparation (tokenization + multimodal media fetch + preprocessing) → Worker Selection → Request Building. Because backend identity (e.g., GrpcClient::Mlx) is only resolved after worker selection, it is architecturally impossible to short-circuit MLX multimodal rejection before process_multimodal() / process_multimodal_messages() runs in the Preparation stage. The earliest feasible rejection point for backend-specific multimodal unsupported errors is at assemble_multimodal_data in request_building (after preprocessing but before generation). Moving rejection earlier would require a larger architectural change (e.g., moving worker selection before preparation). Do not flag assemble_multimodal_data's GrpcClient::Mlx error arm as "too late" in the pipeline.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 606
File: e2e_test/chat_completions/test_openai_server.py:600-600
Timestamp: 2026-03-04T04:05:00.273Z
Learning: In the `lightseekorg/smg` repository, as of PR `#606`, Harmony (gpt-oss) DOES support stop sequence trimming. `TestChatCompletionGptOss` in `e2e_test/chat_completions/test_openai_server.py` correctly sets `STOP_SEQUENCE_TRIMMED = True` by default. The vLLM backend (is_vllm) overrides this to `False` in both `test_stop_sequences` and `test_stop_sequences_stream` because vLLM does not trim stop sequences. Do not flag `STOP_SEQUENCE_TRIMMED = True` in `TestChatCompletionGptOss` as incorrect.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was included by mistake and has been removed from this PR.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ai-jz, confirmed — thanks for removing it!


🧠 Learnings used
Learnt from: key4ng
Repo: lightseekorg/smg PR: 1003
File: .github/workflows/claude-code-review.yml:77-81
Timestamp: 2026-03-31T23:05:49.930Z
Learning: In repo lightseekorg/smg, `github-actions[bot]` does NOT post inline PR review comments, so it does not need to be included in cross-bot deduplication lists for Claude review workflows (e.g., in `.github/workflows/claude-code-review.yml`). Do not flag its absence from such lists.

Learnt from: key4ng
Repo: lightseekorg/smg PR: 1036
File: .github/workflows/claude-code-review.yml:71-73
Timestamp: 2026-04-06T16:49:28.633Z
Learning: In GitHub Actions workflow YAMLs, treat any `prompt:` field intended for an LLM (e.g., Claude) as natural-language instructions for the model. Do not embed shell-script-specific constructs or error-handling patterns inside the prompt content (e.g., `git rev-parse --verify`, `2>/dev/null`, explicit exit-code checks like `if [ $? -ne 0 ]`). Instead, express fallback/error-handling guidance in natural language so the LLM can reason about failures from context rather than parsing shell stderr/stdout semantics.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 606
File: e2e_test/chat_completions/test_openai_server.py:611-612
Timestamp: 2026-03-04T04:43:55.458Z
Learning: In `lightseekorg/smg` PR `#606`, `TestChatCompletionGptOss` in `e2e_test/chat_completions/test_openai_server.py`: `STOP_SEQUENCE_TRIMMED = True` by default (Harmony/gpt-oss trims stop sequences). In `test_stop_sequences` (non-streaming), both `is_vllm()` and `is_sglang()` override this to `False` because neither backend trims stop sequences in non-streaming mode. In `test_stop_sequences_stream` (streaming), only `is_vllm()` overrides to `False`. This asymmetry is intentional. Do not flag `is_sglang()` in the non-streaming condition as incorrect.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 633
File: .github/workflows/pr-test-rust.yml:484-485
Timestamp: 2026-03-04T20:40:48.357Z
Learning: In repo lightseekorg/smg, the composite action `.github/actions/run-e2e` (at `.github/actions/run-e2e/action.yml`) includes a `actions/checkoutv6` step internally as its first step. Callers (e.g., `e2e-always`, `e2e-chat-completions`, `e2e-agentic` jobs in `.github/workflows/pr-test-rust.yml`) do NOT need a separate checkout step before invoking this action — do not flag missing checkout for these jobs.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 606
File: e2e_test/chat_completions/test_openai_server.py:600-600
Timestamp: 2026-03-04T04:05:00.273Z
Learning: In the `lightseekorg/smg` repository, as of PR `#606`, Harmony (gpt-oss) DOES support stop sequence trimming. `TestChatCompletionGptOss` in `e2e_test/chat_completions/test_openai_server.py` correctly sets `STOP_SEQUENCE_TRIMMED = True` by default. The vLLM backend (is_vllm) overrides this to `False` in both `test_stop_sequences` and `test_stop_sequences_stream` because vLLM does not trim stop sequences. Do not flag `STOP_SEQUENCE_TRIMMED = True` in `TestChatCompletionGptOss` as incorrect.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 633
File: .github/workflows/pr-test-rust.yml:439-440
Timestamp: 2026-03-04T20:41:53.013Z
Learning: Ensure GitHub Actions workflows always include an actions/checkout step when referencing local composite actions (uses: ./... or similar) on self-hosted runners. Do not rely on a pre-checked-out workspace unless your environment explicitly guarantees it; follow the standard behavior and include checkout to ensure reliable action resolution.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 851
File: docs/_audit/reliability-worker.md:216-228
Timestamp: 2026-03-21T06:14:03.701Z
Learning: In repo lightseekorg/smg, files under `docs/_audit/` (e.g., `docs/_audit/reliability-worker.md`) are internal audit checklists used during the review process and are not shipped as user-facing documentation. Do not flag markdown formatting nitpicks (e.g., missing code block language specifiers, lint warnings) in these files — formatting polish is out of scope for them.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 489
File: model_gateway/benches/wasm_middleware_latency.rs:88-91
Timestamp: 2026-02-21T02:37:02.009Z
Learning: Repo: lightseekorg/smg — For clippy-only/enforcement PRs (e.g., PR `#489`), even micro-optimizations (like replacing an async closure with std::future::ready in benches such as model_gateway/benches/wasm_middleware_latency.rs) should be deferred to a follow-up PR rather than included inline.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 633
File: .github/actions/run-e2e/action.yml:65-83
Timestamp: 2026-03-04T20:39:43.892Z
Learning: Repo lightseekorg/smg uses custom/internal action versions that differ from public GitHub marketplace releases. Specifically: `actions/checkoutv6`, `actions/download-artifactv8`, and `actions/upload-artifactv7` are all valid and intentional throughout this repo's workflows (e.g., pr-test-rust.yml and composite actions under .github/actions/). Do not flag these versions as non-existent or incorrect.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 489
File: model_gateway/src/core/token_bucket.rs:58-63
Timestamp: 2026-02-21T02:30:58.753Z
Learning: For lint-only/Clippy enforcement PRs in this repo (e.g., PR `#489`), avoid introducing behavioral changes (like new input validation). Prefer a separate, follow-up issue/PR for such hardening.

Learnt from: vschandramourya
Repo: lightseekorg/smg PR: 915
File: model_gateway/src/routers/grpc/client.rs:387-423
Timestamp: 2026-03-26T17:06:14.307Z
Learning: In repo lightseekorg/smg, in `model_gateway/src/routers/grpc/client.rs` and the corresponding backend builders (`crates/grpc_client/src/sglang_scheduler.rs`, `vllm_engine.rs`, `trtllm_service.rs`): The per-backend divergence in handling `CompletionRequest.max_tokens == None` is intentional. SGLang and vLLM pass `None` through to their proto builders, while TRT-LLM falls back to `16`. This matches the pre-existing per-backend pattern used in the chat/messages request builders. Do not flag this divergence as a bug or request normalization at the `build_completion_request` dispatcher layer in `client.rs`.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 495
File: grpc_client/proto/sglang_scheduler.proto:140-172
Timestamp: 2026-02-21T11:57:36.560Z
Learning: In repo lightseekorg/smg, the multimodal proto fields in `grpc_client/proto/sglang_scheduler.proto` (including `MultimodalInputs.pixel_values`, `model_specific_tensors`, `im_token_id`, and `mm_placeholders`) were introduced in PR `#495` as the first real usage; prior internal-only development fields like `processed_features` were never shipped to production, so wire-compatibility concerns do not apply to these field number assignments.

Learnt from: vschandramourya
Repo: lightseekorg/smg PR: 978
File: model_gateway/src/routers/grpc/regular/streaming.rs:2275-2276
Timestamp: 2026-03-31T03:43:30.981Z
Learning: In repo lightseekorg/smg, file model_gateway/src/routers/grpc/regular/streaming.rs: `logprobs: None` in `CompletionStreamChoice` within `process_completion_streaming_chunks` is intentional. Per-token logprob streaming is deferred across all streaming endpoints (completions, generate) because it requires backend-side plumbing not yet available. The same `// TODO: wire per-token logprob streaming` deferral exists in `process_generate_streaming` (around line 702). Non-streaming completions has the same TODO. Do not flag `logprobs: None` or the TODO comment in the completions or generate streaming paths as a missing feature or silent data loss until a dedicated follow-up PR wires backend per-token logprob support.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 638
File: grpc_servicer/smg_grpc_servicer/vllm/servicer.py:371-406
Timestamp: 2026-03-05T04:48:49.033Z
Learning: In repo lightseekorg/smg, `grpc_servicer/smg_grpc_servicer/vllm/servicer.py` (`_build_preprocessed_mm_inputs`): multimodal metadata fields (`flat_keys`, `batched_keys`, `mm_placeholders`, `model_specific_tensors`) are produced exclusively by the internal Rust router and are treated as trusted data. Defensive input-validation guards (e.g., checking flat_keys targets exist in hf_dict, validating PlaceholderRange bounds) are intentionally omitted to avoid hot-path overhead. Do not flag missing validation on these fields in future reviews.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 447
File: model_gateway/src/routers/grpc/client.rs:312-328
Timestamp: 2026-02-17T20:30:41.243Z
Learning: In model_gateway gRPC metadata discovery, SGLang's `server_args` (a `prost_types::Struct`) natively uses short-form keys like `tp_size`, `dp_size`, `pp_size` that match `SGLANG_GRPC_KEYS` and are extracted via `pick_prost_fields()` without normalization. vLLM and TRT-LLM gRPC protos use long-form keys like `tensor_parallel_size`, `pipeline_parallel_size`, which go through `flat_labels()` and are normalized by `normalize_grpc_keys()` in `discover_metadata.rs` after both `model_info.to_labels()` and `server_info.to_labels()` return.

Learnt from: key4ng
Repo: lightseekorg/smg PR: 1034
File: model_gateway/src/routers/grpc/multimodal.rs:602-610
Timestamp: 2026-04-04T00:29:52.690Z
Learning: In repo lightseekorg/smg, the gRPC regular pipeline runs stages in order: Preparation (tokenization + multimodal media fetch + preprocessing) → Worker Selection → Request Building. Because backend identity (e.g., GrpcClient::Mlx) is only resolved after worker selection, it is architecturally impossible to short-circuit MLX multimodal rejection before process_multimodal() / process_multimodal_messages() runs in the Preparation stage. The earliest feasible rejection point for backend-specific multimodal unsupported errors is at assemble_multimodal_data in request_building (after preprocessing but before generation). Moving rejection earlier would require a larger architectural change (e.g., moving worker selection before preparation). Do not flag assemble_multimodal_data's GrpcClient::Mlx error arm as "too late" in the pipeline.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 638
File: .github/workflows/release-grpc.yml:35-37
Timestamp: 2026-03-05T05:50:07.158Z
Learning: In workflows under .github/workflows that trigger only on push to main, you do not need to guard against the all-zeros github.event.before (0000000000000000000000000000000000000000) since that value only occurs on initial branch creation and cannot happen for an already-existing main. Apply this guidance to similar push-to-main workflows across the repository (pattern: **/.github/workflows/*.yml). If a workflow also handles other events or branches, retain appropriate guards for those cases.

@ai-jz ai-jz force-pushed the feat/grpc-generation-config branch from 81822bd to b534cc2 Compare April 12, 2026 17:48
@github-actions github-actions Bot removed the ci CI/CD configuration changes label Apr 12, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b534cc23c1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

temperature=params.temperature
if params.HasField("temperature")
else defaults.get("temperature", 1.0),
top_p=params.top_p if params.HasField("top_p") else defaults.get("top_p", 1.0),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Require updated proto before calling new HasField checks

These new presence checks rely on top_p/top_k/... being proto3 optional, but smg-grpc-servicer still allows older smg-grpc-proto installs (e.g. 0.4.6) where those fields are plain scalars; in that environment params.HasField("top_p") raises ValueError and generation requests fail at runtime. This commit introduces that hard dependency, so the servicer release needs a minimum proto-version bump (or a compatibility guard) before this code is safe to deploy.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proto definition and servicer are in the same repo and released together — there's no independent versioning between them. A new servicer always ships with the matching proto.

@ai-jz
Copy link
Copy Markdown
Collaborator Author

ai-jz commented Apr 12, 2026

@CatherineSue Addressed all 5 comments in b534cc2:

  1. Removed Harmony stop_token_ids block — handled in pipeline
  2. Renamed mcmodel_config
  3. Added vLLM HTTP path source link
  4. Renamed "model defaults" → "generation-config defaults"
  5. Removed max_tokens clamping — engine handles this in input_processor.py

Also updated PR description to reflect the simplified scope.

Note: fork-ci-review.yml was included by mistake in an earlier commit — it's a fork-only workflow for local iteration before submitting to this PR. It has been removed. Sorry for the confusion.

Signed-off-by: ai-jz <ai-jz@users.noreply.github.com>
@ai-jz
Copy link
Copy Markdown
Collaborator Author

ai-jz commented Apr 12, 2026

Addressed the stop_token_ids merging concern flagged by @coderabbitai — removed the set-union logic in c70ef85. The servicer now passes through the client's stop_token_ids as-is. The engine handles generation-config stop behavior internally via update_from_generation_config() in sampling_params.py, consistent with the HTTP Responses path.

@coderabbitai re-review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 12, 2026

@ai-jz Thanks for addressing the stop_token_ids concern — deferring to update_from_generation_config() in sampling_params.py is the correct approach and keeps the servicer aligned with the HTTP Responses path. Triggering a re-review now.


✏️ Learnings added
Learnt from: ai-jz
URL: https://github.com/lightseekorg/smg/pull/1103

Timestamp: 2026-04-12T20:47:06.274Z
Learning: In repo lightseekorg/smg, `grpc_servicer/smg_grpc_servicer/vllm/servicer.py`: `stop_token_ids` from generation-config defaults should NOT be merged (set-union) with the client's `stop_token_ids` in the servicer. The servicer passes through the client's `stop_token_ids` as-is; the vLLM engine handles generation-config stop behavior internally via `update_from_generation_config()` in `sampling_params.py`, consistent with the HTTP Responses path (`vllm/entrypoints/openai/responses/serving.py`).

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: vschandramourya
Repo: lightseekorg/smg PR: 915
File: model_gateway/src/routers/grpc/client.rs:387-423
Timestamp: 2026-03-26T17:06:14.307Z
Learning: In repo lightseekorg/smg, in `model_gateway/src/routers/grpc/client.rs` and the corresponding backend builders (`crates/grpc_client/src/sglang_scheduler.rs`, `vllm_engine.rs`, `trtllm_service.rs`): The per-backend divergence in handling `CompletionRequest.max_tokens == None` is intentional. SGLang and vLLM pass `None` through to their proto builders, while TRT-LLM falls back to `16`. This matches the pre-existing per-backend pattern used in the chat/messages request builders. Do not flag this divergence as a bug or request normalization at the `build_completion_request` dispatcher layer in `client.rs`.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 606
File: e2e_test/chat_completions/test_openai_server.py:600-600
Timestamp: 2026-03-04T04:05:00.273Z
Learning: In the `lightseekorg/smg` repository, as of PR `#606`, Harmony (gpt-oss) DOES support stop sequence trimming. `TestChatCompletionGptOss` in `e2e_test/chat_completions/test_openai_server.py` correctly sets `STOP_SEQUENCE_TRIMMED = True` by default. The vLLM backend (is_vllm) overrides this to `False` in both `test_stop_sequences` and `test_stop_sequences_stream` because vLLM does not trim stop sequences. Do not flag `STOP_SEQUENCE_TRIMMED = True` in `TestChatCompletionGptOss` as incorrect.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 606
File: e2e_test/chat_completions/test_openai_server.py:611-612
Timestamp: 2026-03-04T04:43:55.458Z
Learning: In `lightseekorg/smg` PR `#606`, `TestChatCompletionGptOss` in `e2e_test/chat_completions/test_openai_server.py`: `STOP_SEQUENCE_TRIMMED = True` by default (Harmony/gpt-oss trims stop sequences). In `test_stop_sequences` (non-streaming), both `is_vllm()` and `is_sglang()` override this to `False` because neither backend trims stop sequences in non-streaming mode. In `test_stop_sequences_stream` (streaming), only `is_vllm()` overrides to `False`. This asymmetry is intentional. Do not flag `is_sglang()` in the non-streaming condition as incorrect.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 638
File: grpc_servicer/smg_grpc_servicer/vllm/servicer.py:371-406
Timestamp: 2026-03-05T04:48:49.033Z
Learning: In repo lightseekorg/smg, `grpc_servicer/smg_grpc_servicer/vllm/servicer.py` (`_build_preprocessed_mm_inputs`): multimodal metadata fields (`flat_keys`, `batched_keys`, `mm_placeholders`, `model_specific_tensors`) are produced exclusively by the internal Rust router and are treated as trusted data. Defensive input-validation guards (e.g., checking flat_keys targets exist in hf_dict, validating PlaceholderRange bounds) are intentionally omitted to avoid hot-path overhead. Do not flag missing validation on these fields in future reviews.

Learnt from: zhaowenzi
Repo: lightseekorg/smg PR: 807
File: model_gateway/src/routers/openai/responses/streaming.rs:821-855
Timestamp: 2026-03-18T21:57:03.433Z
Learning: In repo lightseekorg/smg, file model_gateway/src/routers/openai/responses/streaming.rs: The early `return` statements inside `handle_streaming_with_tool_interception` (on `tx` send failures in `forward_streaming_event`, `send_mcp_list_tools_events`, and the `is_in_progress`/`mcp_list_tools_sent` branches) are pre-existing behavior that predates PR `#807`. They cause the persistence phase (final response / conversation-backed storage writes at the end of the tool loop) to be skipped when the client disconnects mid-stream with `store=true` or a conversation-backed request. This is a known pre-existing gap in the MCP streaming path, not a regression introduced by the storage context header changes in PR `#807`.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 799
File: model_gateway/src/core/resilience.rs:42-49
Timestamp: 2026-03-18T04:52:09.084Z
Learning: In repo lightseekorg/smg, file model_gateway/src/core/resilience.rs: `RetryConfig` (from `model_gateway/src/config/types`) does NOT have a `retryable_status_codes` field. The `retryable_status_codes` field only exists on `ResilienceUpdate` (from `openai_protocol::worker`). The `DEFAULT_RETRYABLE_STATUS_CODES` constant ([408, 429, 500, 502, 503, 504]) IS the canonical router default and matches `is_retryable_status()` in `model_gateway/src/core/retry.rs` exactly. The fallback from `overrides.retryable_status_codes` to `DEFAULT_RETRYABLE_STATUS_CODES` in `resolve_resilience()` is correct — do not flag it as bypassing a router-level default, and do not suggest adding a `base_retryable_status_codes` parameter to `resolve_resilience()`.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 1031
File: model_gateway/src/routers/grpc/regular/streaming.rs:1579-1600
Timestamp: 2026-04-03T05:31:33.707Z
Learning: In repo lightseekorg/smg, `model_gateway/src/routers/grpc/regular/streaming.rs` (`process_messages_streaming_chunks`): The reasoning parser is gated on explicit user opt-in (`ThinkingConfig::Enabled`) via `separate_reasoning = matches!(&original_request.thinking, Some(ThinkingConfig::Enabled { .. }))`, NOT on `thinking_override` (which only controls whether `mark_reasoning_started()` is called on the parser, not whether the parser runs at all). Chat API uses the `separate_reasoning` field on `ChatCompletionRequest` instead. Do not suggest removing or replacing the `ThinkingConfig::Enabled` gate in the Messages path.

Learnt from: key4ng
Repo: lightseekorg/smg PR: 1034
File: model_gateway/src/routers/grpc/multimodal.rs:602-610
Timestamp: 2026-04-04T00:29:52.690Z
Learning: In repo lightseekorg/smg, the gRPC regular pipeline runs stages in order: Preparation (tokenization + multimodal media fetch + preprocessing) → Worker Selection → Request Building. Because backend identity (e.g., GrpcClient::Mlx) is only resolved after worker selection, it is architecturally impossible to short-circuit MLX multimodal rejection before process_multimodal() / process_multimodal_messages() runs in the Preparation stage. The earliest feasible rejection point for backend-specific multimodal unsupported errors is at assemble_multimodal_data in request_building (after preprocessing but before generation). Moving rejection earlier would require a larger architectural change (e.g., moving worker selection before preparation). Do not flag assemble_multimodal_data's GrpcClient::Mlx error arm as "too late" in the pipeline.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 936
File: model_gateway/src/routers/gemini/router.rs:82-90
Timestamp: 2026-03-30T18:52:19.689Z
Learning: In repo lightseekorg/smg, the Gemini router's (`model_gateway/src/routers/gemini/router.rs`) use of `model_id` for per-model retry config lookup in `route_interactions` has a pre-existing limitation: when the request carries only an agent identifier, `get_retry_config(agent_id)` finds no match because agent→model resolution happens later inside `driver::execute` (in `worker_selection.rs`). This same limitation affects policy lookup and worker selection in the Gemini router. It is NOT introduced by PR `#936` and is a known accepted gap. The Gemini router is not actively used currently. Do not flag this as a bug introduced by per-model retry config changes to the Gemini router.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 495
File: grpc_client/proto/sglang_scheduler.proto:140-172
Timestamp: 2026-02-21T11:57:36.560Z
Learning: In repo lightseekorg/smg, the multimodal proto fields in `grpc_client/proto/sglang_scheduler.proto` (including `MultimodalInputs.pixel_values`, `model_specific_tensors`, `im_token_id`, and `mm_placeholders`) were introduced in PR `#495` as the first real usage; prior internal-only development fields like `processed_features` were never shipped to production, so wire-compatibility concerns do not apply to these field number assignments.

Learnt from: vschandramourya
Repo: lightseekorg/smg PR: 978
File: model_gateway/src/routers/grpc/regular/streaming.rs:2275-2276
Timestamp: 2026-03-31T03:43:30.981Z
Learning: In repo lightseekorg/smg, file model_gateway/src/routers/grpc/regular/streaming.rs: `logprobs: None` in `CompletionStreamChoice` within `process_completion_streaming_chunks` is intentional. Per-token logprob streaming is deferred across all streaming endpoints (completions, generate) because it requires backend-side plumbing not yet available. The same `// TODO: wire per-token logprob streaming` deferral exists in `process_generate_streaming` (around line 702). Non-streaming completions has the same TODO. Do not flag `logprobs: None` or the TODO comment in the completions or generate streaming paths as a missing feature or silent data loss until a dedicated follow-up PR wires backend per-token logprob support.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 447
File: model_gateway/src/routers/grpc/client.rs:312-328
Timestamp: 2026-02-17T20:30:41.243Z
Learning: In model_gateway gRPC metadata discovery, SGLang's `server_args` (a `prost_types::Struct`) natively uses short-form keys like `tp_size`, `dp_size`, `pp_size` that match `SGLANG_GRPC_KEYS` and are extracted via `pick_prost_fields()` without normalization. vLLM and TRT-LLM gRPC protos use long-form keys like `tensor_parallel_size`, `pipeline_parallel_size`, which go through `flat_labels()` and are normalized by `normalize_grpc_keys()` in `discover_metadata.rs` after both `model_info.to_labels()` and `server_info.to_labels()` return.
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@ai-jz
Copy link
Copy Markdown
Collaborator Author

ai-jz commented Apr 12, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 12, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-approved grpc gRPC client and router changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants