feat(grpc): support vLLM --generation-config in gRPC mode by ai-jz · Pull Request #1103 · lightseekorg/smg

ai-jz · 2026-04-12T02:48:06Z

Motivation

vLLM's --generation-config flag (default: auto) reads the model's generation_config.json — a model artifact shipped alongside weights and chat template — and applies the model author's recommended sampling parameters (temperature, top_p, top_k, etc.) when the API request doesn't specify them.

This is already supported in the HTTP serving path. This PR extends the same support to the gRPC path, so the same --generation-config flag produces identical sampling behavior regardless of whether the request reaches vLLM via HTTP or gRPC. All three modes are supported: auto, vllm, and custom path.

Description

Problem

The gRPC path does not yet support --generation-config. The HTTP path reads the model's generation_config.json and applies defaults (e.g., temperature=0.6 for Qwen3/Llama-3) for unset request fields. The gRPC path uses vLLM's neutral defaults (temperature=1.0, top_p=1.0) regardless of the flag value.

Root cause spans three layers:

Proto: top_p, top_k, min_p, frequency_penalty, presence_penalty, repetition_penalty are plain proto3 fields — the zero value is indistinguishable from "not set"
Rust client: unwrap_or(default) fills neutral values before sending, losing the "user didn't set this" signal
Python servicer: never calls model_config.get_diff_sampling_param(), hardcodes fallbacks

Solution

Propagate the "unset" signal from client to servicer across all three layers:

Proto: make the 6 fields optional — enables HasField() detection
Rust client: pass Option through instead of unwrap_or(default) in all 5 sampling param builders
Python servicer: call get_diff_sampling_param() at init (same as HTTP path), use HasField() fallbacks for all fields

Context-length enforcement (max_tokens clamping) and Harmony stop_token_ids injection are handled by vLLM's engine internally (input_processor.py), so the servicer does not duplicate that logic.

Changes

File	What
`vllm_engine.proto`	6 sampling fields → `optional` (enables `HasField` detection)
`vllm_engine.rs`	All 5 builders pass `Option` through; 3 new + 2 updated unit tests
`servicer.py`	Load generation-config defaults at init, apply via `HasField` fallbacks

Design Decisions

Why server-side defaults (in servicer) over client-side merge (in Rust gateway):

The engine process is authoritative for default resolution — it owns generation_config.json parsing, --generation-config mode selection, and model-type-aware logic. vLLM exposes all of this through get_diff_sampling_param() which the servicer reuses directly.

Making the proto fields optional also ensures omission semantics live in the gRPC request contract itself — which benefits any future non-gateway gRPC clients.

SGLang support will be addressed in a separate PR.

Other notes:

Proto optional is wire-compatible — rolling upgrade safe. Old clients always set fields explicitly, so HasField() = true and behavior is unchanged.
Context-length enforcement is left to vLLM's engine (input_processor.py) — not duplicated in the servicer.

Test Plan

Existing tests (regression):

Suite	Count	Result
`cargo test -p smg-grpc-client`	48/48	All pass — 2 existing tests updated to match `Option` types
E2e `test_openai_server.py` (openai-grpc)	17/17	All pass — no regressions

New tests (added in this PR):

Suite	Count	Result
Rust: proto optional semantics (3 tests)	3/3	All pass

Checklist

Proto change is wire-compatible (optional addition)
All 5 Rust sampling param builders updated
Existing Rust tests updated for Option types
New Rust tests for optional semantics
Python servicer matches HTTP path logic
Full existing e2e suite passes (no regressions)
Rolling upgrade safe (old client → new servicer)

Summary by CodeRabbit

Release Notes

New Features
- Sampling parameters are now optional in requests—users can omit them instead of specifying explicit values.
- When sampling parameters are not provided, the system now applies model-specific default values for smarter generation behavior out of the box.

coderabbitai · 2026-04-12T02:48:21Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: e29f8324-01f8-4755-958a-f49a11bb660f

📥 Commits

Reviewing files that changed from the base of the PR and between 4ee8b93 and c70ef85.

📒 Files selected for processing (1)

grpc_servicer/smg_grpc_servicer/vllm/servicer.py

📝 Walkthrough

Walkthrough

The PR modifies sampling parameter handling across proto definitions, Rust client code, and Python servicer to distinguish between unset and explicitly-set values. Proto fields are marked optional, client code propagates Option values instead of unwrapping with defaults, and servicer now applies model default sampling parameters when proto fields are unset.

Changes

Cohort / File(s)	Summary
Proto Definition `crates/grpc_client/proto/vllm_engine.proto`	Made `top_p`, `top_k`, `min_p`, `frequency_penalty`, `presence_penalty`, and `repetition_penalty` fields optional in `SamplingParams` message to distinguish unset from explicitly set values.
Rust Client `crates/grpc_client/src/vllm_engine.rs`	Updated sampling parameter builders for ChatCompletion, Completion, and Generate to propagate `Option` values directly to proto instead of unwrapping with defaults; removed hardcoded defaults in plain sampling builder and added test assertions for `None` handling.
Python Servicer `grpc_servicer/smg_grpc_servicer/vllm/servicer.py`	Extended `_sampling_params_from_proto` to accept model default sampling parameters and apply them when proto fields are unset; updated `Generate` to load and pass model defaults into parameter construction.

Sequence Diagram

sequenceDiagram
    actor Client
    participant gRPC
    participant Servicer
    participant ModelConfig
    
    Client->>gRPC: Send SamplingParams (some fields unset)
    gRPC->>Servicer: Receive proto message with optional fields
    Servicer->>ModelConfig: Load model default sampling params
    ModelConfig-->>Servicer: Return defaults dict
    Servicer->>Servicer: For each sampling field:<br/>if unset in proto → use default<br/>else → use proto value
    Servicer-->>Client: Apply merged params to generation

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

refactor(grpc): optimize vLLM servicer and remove duplicate server.py #662: Modifies sampling-parameter handling and _sampling_params_from_proto logic in the vLLM servicer with proto field validation/defaults processing.

Suggested reviewers

slin1237
njhill
key4ng

Poem

🐰 Oh what joy, these params now optional!
No more forced defaults, it's devotional—
Unset stays None, explicit stays true,
Model defaults flourish when needed by you!
✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title clearly and specifically describes the main change: enabling --generation-config support in gRPC mode, which is the core objective of this pull request.
Docstring Coverage	✅ Passed	Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

mergify · 2026-04-12T02:48:43Z

Hi @ai-jz, the DCO sign-off check has failed. All commits must include a Signed-off-by line.

To fix existing commits:

# Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-lease

To sign off future commits automatically:

Use git commit -s every time, or
VSCode: enable Git: Always Sign Off in Settings
PyCharm: enable Sign-off commit in the Commit tool window

gemini-code-assist

Code Review

This pull request updates the gRPC interface to support model-specific generation configuration defaults by making sampling parameters optional in the protobuf definition and the Rust client. The Python servicer now correctly loads and applies these defaults when fields are omitted in the request, ensuring parity with the HTTP interface. Feedback was provided regarding the Rust client to use a safer numeric conversion for the top_k field to handle potential overflows or negative values.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7fb8355f2b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-12T02:53:05Z

+            top_k=params.top_k if params.HasField("top_k")
+            else defaults.get("top_k", 0),
+            min_p=params.min_p if params.HasField("min_p")
+            else defaults.get("min_p", 0.0),
+            frequency_penalty=params.frequency_penalty if params.HasField("frequency_penalty")


Preserve zero-valued sampling fields from legacy clients

This fallback now interprets missing sampling fields as "unset" and pulls model defaults, but proto3 legacy clients (built before these fields were optional) omit scalar default values (0/0.0) on the wire. In a mixed-version rollout, explicit zero-valued settings like top_k=0, min_p=0, or zero penalties from older clients will arrive as absent and be replaced by generation-config defaults, changing generation behavior unexpectedly.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-12T02:53:05Z

+        if max_model_len is not None and input_length > 0:
+            model_max = max_model_len - input_length
+            candidates = [v for v in (model_max, raw_max_tokens, override_max_tokens) if v is not None]
+            max_tokens = min(candidates) if candidates else None
+        else:


Apply override max_tokens when prompt is text input

override_max_tokens is only applied inside the input_length > 0 branch, so requests using the text input oneof (or empty tokenized input) skip that branch and ignore the configured override cap. In those cases max_tokens falls back to raw_max_tokens only, which can bypass override_generation_config.max_new_tokens limits.

Useful? React with 👍 / 👎.

ai-jz · 2026-04-12T02:56:34Z

Review Response

Addressed findings from Codex and CodeRabbit reviews in e73a892:

Addressed

1. [HIGH] Safe top_k conversion (vllm_engine.rs:586)
Fixed — changed request.top_k to request.top_k.and_then(|v| u32::try_from(v).ok()) in the Messages API builder. This safely drops out-of-range values instead of casting, consistent with the other builders that use .max(0) as u32.

2. [P2] override_max_tokens skipped for text input (servicer.py:574)
Fixed — the else branch (text input / empty tokenized) now applies override_max_tokens as a cap, not just raw_max_tokens.

Acknowledged (no change needed)

3. [P1] Legacy client zero-value concern (servicer.py:588)
This describes a theoretical risk during rolling upgrade where old proto3 clients omit zero-valued scalars on the wire. In practice, old SMG Rust clients always send explicit values via unwrap_or(1.0) / unwrap_or(0) — they never construct Some(0.0) for top_p or Some(0.0) for repetition_penalty. So HasField() returns true for all fields from old clients, and model defaults are never applied. The behavior is identical to pre-patch. See Backward Compatibility section in PR description.

Support all three --generation-config modes (auto, vllm, custom path) in the vLLM gRPC serving path, achieving parity with the HTTP path. Previously, the gRPC servicer ignored the model generation_config.json and hardcoded neutral defaults (temperature=1.0, top_p=1.0, etc.). Now it reads the same defaults as the HTTP path via get_diff_sampling_param() and applies them when request fields are unset. Three-layer change: - Proto: make top_p, top_k, min_p, frequency_penalty, presence_penalty, repetition_penalty optional so HasField can detect unset fields - Rust client: pass Option through instead of unwrap_or(default) in all 5 vLLM sampling param builders, preserving the unset signal - Python servicer: load model defaults at init, use HasField for all fields, clamp max_tokens to remaining context, merge Harmony stop_token_ids for gpt_oss models Tests: - 3 new + 2 updated Rust unit tests for proto optional semantics - 4 new e2e tests verifying generation_config defaults in gRPC mode - All 48 Rust tests pass, 17/17 existing e2e openai-grpc tests pass Signed-off-by: ai-jz <ai-jz@users.noreply.github.com>

…or text input Signed-off-by: ai-jz <ai-jz@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@e2e_test/chat_completions/test_generation_config.py`:
- Around line 30-38: The test test_default_temperature_applied currently unpacks
client from setup_backend but should use the api_client fixture directly; remove
extracting the client from "_, model_path, client, _ = setup_backend" and
instead use the api_client parameter as the client variable, keeping only
whatever value you still need from setup_backend (e.g., model_path) by adjusting
the unpack to exclude client (or access only model_path), and update any
references to "client" in the test to use "api_client".

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 313f1c2f-3415-4515-9770-90dd6fc7646e

📥 Commits

Reviewing files that changed from the base of the PR and between ae14a9c and 55ec145.

📒 Files selected for processing (4)

crates/grpc_client/proto/vllm_engine.proto
crates/grpc_client/src/vllm_engine.rs
e2e_test/chat_completions/test_generation_config.py
grpc_servicer/smg_grpc_servicer/vllm/servicer.py

Signed-off-by: ai-jz <ai-jz@users.noreply.github.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e8df77ca9a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-12T03:35:59Z

+            params.max_tokens if params.HasField("max_tokens")
+            else defaults.get("max_tokens")
+        )
+        if max_model_len is not None and input_length > 0:


Clamp max_tokens for zero-length prompts

Include max_model_len in the max_tokens clamp even when input_length == 0. The current guard (input_length > 0) skips the model-length cap for empty tokenized prompts and all text oneof prompts (which pass input_length=0), so raw_max_tokens/override_max_tokens can exceed context length and produce avoidable vLLM validation failures instead of being bounded to remaining capacity.

Useful? React with 👍 / 👎.

- Fix AttributeError when override_generation_config is None: use (getattr(..., None) or {}) to coalesce missing and None cases - Defensive copy of get_diff_sampling_param() to prevent shared-state mutation of ModelConfig internals - Add ValueError guard when input_length exceeds max_model_len - Clamp max_tokens to context length for text input prompts - Fix dead fallback: return None when candidates list is empty - Fix clippy: remove useless u32::try_from(u32) in Messages builder - Apply ruff format to servicer.py - Clarify test_default_temperature_applied is a smoke test Signed-off-by: ai-jz <ai-jz@users.noreply.github.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2065fec386

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

CatherineSue · 2026-04-12T06:29:15Z

+  optional float top_p = 2;
+  optional uint32 top_k = 3;
+  optional float min_p = 4;
+  optional float frequency_penalty = 5;


I don't recommend changing types of the sampling parameters here. Making it optional would mean it is 0 in default. However, the default of the above sampling parameters could be 1.0 in backend engine.

Proto3 optional is specifically what lets us distinguish "not set" from "explicitly 0" — without it, an unset float is silently 0.0 on the wire, and the servicer can't tell if the client meant zero or just didn't specify.

With optional, HasField("top_p") returns False when the field wasn't set, so the servicer falls through to the model's generation_config.json default. When it returns True, the value is used as-is (even if it's 0). The proto-level default is still 0, but the servicer never sees it because it checks HasField() first.

This matches how the HTTP path works — JSON naturally distinguishes a missing key from "top_p": 0. The optional keyword gives the gRPC path the same capability.

Signed-off-by: ai-jz <ai-jz@users.noreply.github.com>

CatherineSue · 2026-04-12T06:36:00Z

+class TestGenerationConfigDefaults:
+    """Verify --generation-config defaults apply in gRPC mode."""
+
+    def test_default_temperature_applied(self, setup_backend, api_client):


I don't think this test is necessary as it doesn't necessarily test the sampling parameters override.

Agreed — removed the test file in 60a723b. The core change is in proto + Rust client + servicer; tests can be added later when there's a way to observe the effective sampling params through the API.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/grpc_client/src/vllm_engine.rs`:
- Around line 917-957: The tests only construct proto::SamplingParams directly
and omit frequency_penalty/presence_penalty/repetition_penalty and thus don't
exercise the builder/conversion paths that regressed; update the tests
(test_sampling_params_optional_preserves_user_values and
test_sampling_params_unset_fields_distinguishable_from_zero) to (1) include
frequency_penalty, presence_penalty, and repetition_penalty in both the user-set
and zero cases, and (2) create and assert equality/inequality by passing those
proto::SamplingParams through the same builder/conversion path used in
production (the SamplingParams builder/converter that consumes
proto::SamplingParams) rather than only using the raw proto struct, so the
translation logic is exercised. Ensure assertions still check Some(value) vs
None and Some(0) distinctions after conversion.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: bd57ef16-dcd2-4d5f-98c0-1f4f7d2884e9

📥 Commits

Reviewing files that changed from the base of the PR and between 2065fec and 335b813.

📒 Files selected for processing (1)

crates/grpc_client/src/vllm_engine.rs

coderabbitai · 2026-04-12T06:37:26Z

+    #[test]
+    fn test_sampling_params_optional_preserves_user_values() {
+        // When user sets values, they should be preserved as Some(value)
+        let params = proto::SamplingParams {
+            temperature: Some(0.6),
+            top_p: Some(0.95),
+            top_k: Some(20),
+            min_p: Some(0.1),
+            frequency_penalty: Some(0.5),
+            presence_penalty: Some(0.3),
+            repetition_penalty: Some(1.2),
+            max_tokens: Some(100),
+            ..Default::default()
+        };
+        assert_eq!(params.temperature, Some(0.6));
+        assert_eq!(params.top_p, Some(0.95));
+        assert_eq!(params.top_k, Some(20));
+        assert_eq!(params.min_p, Some(0.1));
+        assert_eq!(params.frequency_penalty, Some(0.5));
+        assert_eq!(params.presence_penalty, Some(0.3));
+        assert_eq!(params.repetition_penalty, Some(1.2));
+        assert_eq!(params.max_tokens, Some(100));
+    }
+
+    #[test]
+    fn test_sampling_params_unset_fields_distinguishable_from_zero() {
+        // Unset (None) must be different from explicit zero — this is
+        // the core invariant for --generation-config support
+        let unset = proto::SamplingParams::default();
+        let zero = proto::SamplingParams {
+            temperature: Some(0.0),
+            top_p: Some(0.0),
+            top_k: Some(0),
+            min_p: Some(0.0),
+            ..Default::default()
+        };
+        assert_ne!(unset.temperature, zero.temperature);
+        assert_ne!(unset.top_p, zero.top_p);
+        assert_ne!(unset.top_k, zero.top_k);
+        assert_ne!(unset.min_p, zero.min_p);
+    }


🧹 Nitpick | 🔵 Trivial

Exercise the builder paths in the regression tests.

This block proves the generated proto type can represent None vs Some(0), but it never hits the translation paths that regressed in this PR, and it omits frequency_penalty, presence_penalty, and repetition_penalty. A future unwrap_or(...) reintroduction in one of the builders would still leave this suite green.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/grpc_client/src/vllm_engine.rs` around lines 917 - 957, The tests only construct proto::SamplingParams directly and omit frequency_penalty/presence_penalty/repetition_penalty and thus don't exercise the builder/conversion paths that regressed; update the tests (test_sampling_params_optional_preserves_user_values and test_sampling_params_unset_fields_distinguishable_from_zero) to (1) include frequency_penalty, presence_penalty, and repetition_penalty in both the user-set and zero cases, and (2) create and assert equality/inequality by passing those proto::SamplingParams through the same builder/conversion path used in production (the SamplingParams builder/converter that consumes proto::SamplingParams) rather than only using the raw proto struct, so the translation logic is exercised. Ensure assertions still check Some(value) vs None and Some(0) distinctions after conversion.

Fair point — the test validates proto-level semantics (None vs Some(0)) but does not exercise the builder translation paths. The existing 48 Rust unit tests and 17 e2e openai-grpc tests cover those paths; this test was specifically for the new optional field behavior. Adding builder-path coverage is a good follow-up.

@ai-jz Understood — the distinction makes sense. The two new tests are purposefully scoped to proto-layer invariants (confirming None is representable and distinguishable from Some(0) after the optional keyword change), which is orthogonal to exercising the builder translation logic already covered by the broader unit and e2e suite. Happy to track the builder-path regression tests as a follow-up if that's the plan.

🧠 Learnings used

Learnt from: vschandramourya Repo: lightseekorg/smg PR: 915 File: model_gateway/src/routers/grpc/client.rs:387-423 Timestamp: 2026-03-26T17:06:14.307Z Learning: In repo lightseekorg/smg, in `model_gateway/src/routers/grpc/client.rs` and the corresponding backend builders (`crates/grpc_client/src/sglang_scheduler.rs`, `vllm_engine.rs`, `trtllm_service.rs`): The per-backend divergence in handling `CompletionRequest.max_tokens == None` is intentional. SGLang and vLLM pass `None` through to their proto builders, while TRT-LLM falls back to `16`. This matches the pre-existing per-backend pattern used in the chat/messages request builders. Do not flag this divergence as a bug or request normalization at the `build_completion_request` dispatcher layer in `client.rs`.

Learnt from: slin1237 Repo: lightseekorg/smg PR: 489 File: multimodal/tests/vision_golden_tests.rs:576-578 Timestamp: 2026-02-21T02:39:51.670Z Learning: Repo lightseekorg/smg — For PR `#489` (clippy/lint-only), do not replace unwrap() with expect(...) in test files when a file-/crate-level #![expect(clippy::unwrap_used)] is present (e.g., multimodal/tests/vision_golden_tests.rs). Such stylistic swaps are out of scope.

Learnt from: vschandramourya Repo: lightseekorg/smg PR: 978 File: model_gateway/src/routers/grpc/regular/streaming.rs:2275-2276 Timestamp: 2026-03-31T03:43:30.981Z Learning: In repo lightseekorg/smg, file model_gateway/src/routers/grpc/regular/streaming.rs: `logprobs: None` in `CompletionStreamChoice` within `process_completion_streaming_chunks` is intentional. Per-token logprob streaming is deferred across all streaming endpoints (completions, generate) because it requires backend-side plumbing not yet available. The same `// TODO: wire per-token logprob streaming` deferral exists in `process_generate_streaming` (around line 702). Non-streaming completions has the same TODO. Do not flag `logprobs: None` or the TODO comment in the completions or generate streaming paths as a missing feature or silent data loss until a dedicated follow-up PR wires backend per-token logprob support.

Learnt from: slin1237 Repo: lightseekorg/smg PR: 489 File: multimodal/tests/vision_golden_tests.rs:402-404 Timestamp: 2026-02-21T02:39:23.481Z Learning: Repo lightseekorg/smg — For clippy-only/lint-enforcement PRs (e.g., PR `#489`), do not replace unwrap() with expect() across tests/benches when a crate-level `#![expect(clippy::unwrap_used)]` is present. Such per-call swaps are treated as out-of-scope stylistic changes. Example: multimodal/tests/vision_golden_tests.rs.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 812 File: e2e_test/embeddings/test_correctness.py:223-223 Timestamp: 2026-03-20T00:00:47.235Z Learning: In `lightseekorg/smg` PR `#812`, `TestEmbeddingCorrectness` in `e2e_test/embeddings/test_correctness.py` uses the `api_client` fixture without class-level parametrization by design. SmgClient coverage for embedding correctness (HF parity assertions in `test_semantic_similarity` and `test_relevance_scores`) is intentionally opt-in and deferred until SmgClient embedding compatibility is verified. Do not flag the absence of SmgClient parametrization in this test class as a gap.

Learnt from: XinyueZhang369 Repo: lightseekorg/smg PR: 723 File: model_gateway/tests/api/interactions_api_test.rs:251-282 Timestamp: 2026-03-11T23:26:31.100Z Learning: In repo lightseekorg/smg, the Gemini-specific header path (apply_provider_headers emitting x-goog-api-key for googleapis.com URLs, and extract_gemini_auth_header) is intentionally not covered by the mock-server integration tests in model_gateway/tests/api/interactions_api_test.rs (PR `#723`). Coverage against a real Gemini backend URL will be added in a follow-up e2e test PR. Do not flag the absence of this test coverage as a gap in PR `#723` or related PRs.

Learnt from: XinyueZhang369 Repo: lightseekorg/smg PR: 723 File: model_gateway/tests/api/interactions_api_test.rs:316-346 Timestamp: 2026-03-11T05:32:45.536Z Learning: In repo lightseekorg/smg, `test_interactions_multiple_workers` in `model_gateway/tests/api/interactions_api_test.rs` intentionally only verifies that requests succeed with multiple Gemini backends (connectivity), not load distribution. Load distribution across workers is covered separately in `tests/routing/load_balancing_test.rs`. Do not flag the absence of per-worker request count assertions in this test as a gap — distribution verification is deferred to a follow-up PR once the Gemini router implementation matures.

Learnt from: pallasathena92 Repo: lightseekorg/smg PR: 687 File: model_gateway/src/routers/openai/realtime/webrtc.rs:238-289 Timestamp: 2026-03-11T01:29:56.655Z Learning: In repo lightseekorg/smg, file model_gateway/src/routers/openai/realtime/webrtc.rs: `Metrics::record_router_request` is already emitted in router.rs (around line 1152) before `handle_realtime_webrtc` is called, and `Metrics::record_router_error` is emitted inside `handle_realtime_webrtc` for the no-workers case. The missing instrumentation is success/duration recording after `setup_and_spawn_bridge` returns — this is a metrics improvement deferred to a follow-up PR, not a correctness gap. Do not flag missing success/duration metrics as a blocking issue for PR `#687`.

Learnt from: key4ng Repo: lightseekorg/smg PR: 1030 File: crates/tool_parser/src/parsers/deepseek32.rs:352-400 Timestamp: 2026-04-06T20:50:55.263Z Learning: In `crates/tool_parser/src/parsers/deepseek32.rs` (and other parsers in `crates/tool_parser/src/parsers/`), `prev_tool_call_arr` entries are initialized with `"arguments": {}` (a JSON object), not a string. Code that retrieves the previous arguments via `.as_str()` will get `None` for this initial value. The correct pattern is to handle `prev_args = None` explicitly (e.g., treat it as an empty string and emit from `sent_len`), rather than changing the initialization to `"arguments": ""`. This is a deliberate cross-parser convention — do not flag the `{}` initialization as a bug.

Learnt from: key4ng Repo: lightseekorg/smg PR: 1006 File: crates/tool_parser/src/parsers/deepseek31.rs:162-181 Timestamp: 2026-04-01T04:14:46.469Z Learning: In repo lightseekorg/smg, `crates/tool_parser/src/parsers/deepseek31.rs` (and the analogous V3 parser `crates/tool_parser/src/parsers/deepseek.rs` lines 186-197): The `parse_incremental` method does NOT split off a plain-text prefix before the first tool marker within the same chunk. This is intentional because the gRPC streaming path delivers tokens individually, so normal text content and tool-call markers always arrive in separate chunks — a prefix and a tool marker will never coexist in the same chunk. Do not flag the absence of a within-chunk prefix-split as a bug; the `test_deepseek31_streaming_text_before_tools` test covers the realistic multi-chunk case.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 588 File: model_gateway/src/routers/grpc/multimodal.rs:453-514 Timestamp: 2026-03-03T18:03:45.713Z Learning: In repo lightseekorg/smg, backend assembly functions in model_gateway/src/routers/grpc/multimodal.rs (e.g., assemble_sglang, assemble_vllm, assemble_trtllm) are tested via E2E tests rather than unit tests, as unit tests for these functions are not considered worthwhile.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 495 File: grpc_client/proto/sglang_scheduler.proto:140-172 Timestamp: 2026-02-21T11:57:36.560Z Learning: In repo lightseekorg/smg, the multimodal proto fields in `grpc_client/proto/sglang_scheduler.proto` (including `MultimodalInputs.pixel_values`, `model_specific_tensors`, `im_token_id`, and `mm_placeholders`) were introduced in PR `#495` as the first real usage; prior internal-only development fields like `processed_features` were never shipped to production, so wire-compatibility concerns do not apply to these field number assignments.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 638 File: grpc_servicer/smg_grpc_servicer/vllm/servicer.py:371-406 Timestamp: 2026-03-05T04:48:49.033Z Learning: In repo lightseekorg/smg, `grpc_servicer/smg_grpc_servicer/vllm/servicer.py` (`_build_preprocessed_mm_inputs`): multimodal metadata fields (`flat_keys`, `batched_keys`, `mm_placeholders`, `model_specific_tensors`) are produced exclusively by the internal Rust router and are treated as trusted data. Defensive input-validation guards (e.g., checking flat_keys targets exist in hf_dict, validating PlaceholderRange bounds) are intentionally omitted to avoid hot-path overhead. Do not flag missing validation on these fields in future reviews.

Learnt from: slin1237 Repo: lightseekorg/smg PR: 489 File: model_gateway/src/routers/grpc/client.rs:189-194 Timestamp: 2026-02-21T02:32:13.396Z Learning: Repo: lightseekorg/smg PR: 489 File: model_gateway/src/routers/grpc/client.rs Context: GrpcClient::generate fallback match arm Learning: The catch-all panic for mismatched client/request types in GrpcClient::generate is intentional to enforce pipeline invariants; do not convert this to a returned error in PR `#489` or similar lint-only changes.

Learnt from: slin1237 Repo: lightseekorg/smg PR: 447 File: model_gateway/src/routers/grpc/client.rs:312-328 Timestamp: 2026-02-17T20:30:41.243Z Learning: In model_gateway gRPC metadata discovery, SGLang's `server_args` (a `prost_types::Struct`) natively uses short-form keys like `tp_size`, `dp_size`, `pp_size` that match `SGLANG_GRPC_KEYS` and are extracted via `pick_prost_fields()` without normalization. vLLM and TRT-LLM gRPC protos use long-form keys like `tensor_parallel_size`, `pipeline_parallel_size`, which go through `flat_labels()` and are normalized by `normalize_grpc_keys()` in `discover_metadata.rs` after both `model_info.to_labels()` and `server_info.to_labels()` return.

Learnt from: key4ng Repo: lightseekorg/smg PR: 1034 File: model_gateway/src/routers/grpc/multimodal.rs:602-610 Timestamp: 2026-04-04T00:29:52.690Z Learning: In repo lightseekorg/smg, the gRPC regular pipeline runs stages in order: Preparation (tokenization + multimodal media fetch + preprocessing) → Worker Selection → Request Building. Because backend identity (e.g., GrpcClient::Mlx) is only resolved after worker selection, it is architecturally impossible to short-circuit MLX multimodal rejection before process_multimodal() / process_multimodal_messages() runs in the Preparation stage. The earliest feasible rejection point for backend-specific multimodal unsupported errors is at assemble_multimodal_data in request_building (after preprocessing but before generation). Moving rejection earlier would require a larger architectural change (e.g., moving worker selection before preparation). Do not flag assemble_multimodal_data's GrpcClient::Mlx error arm as "too late" in the pipeline.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 606 File: e2e_test/chat_completions/test_openai_server.py:600-600 Timestamp: 2026-03-04T04:05:00.273Z Learning: In the `lightseekorg/smg` repository, as of PR `#606`, Harmony (gpt-oss) DOES support stop sequence trimming. `TestChatCompletionGptOss` in `e2e_test/chat_completions/test_openai_server.py` correctly sets `STOP_SEQUENCE_TRIMMED = True` by default. The vLLM backend (is_vllm) overrides this to `False` in both `test_stop_sequences` and `test_stop_sequences_stream` because vLLM does not trim stop sequences. Do not flag `STOP_SEQUENCE_TRIMMED = True` in `TestChatCompletionGptOss` as incorrect.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 1031 File: model_gateway/src/routers/grpc/regular/streaming.rs:1579-1600 Timestamp: 2026-04-03T05:31:33.707Z Learning: In repo lightseekorg/smg, `model_gateway/src/routers/grpc/regular/streaming.rs` (`process_messages_streaming_chunks`): The reasoning parser is gated on explicit user opt-in (`ThinkingConfig::Enabled`) via `separate_reasoning = matches!(&original_request.thinking, Some(ThinkingConfig::Enabled { .. }))`, NOT on `thinking_override` (which only controls whether `mark_reasoning_started()` is called on the parser, not whether the parser runs at all). Chat API uses the `separate_reasoning` field on `ChatCompletionRequest` instead. Do not suggest removing or replacing the `ThinkingConfig::Enabled` gate in the Messages path.

Learnt from: TingtingZhou7 Repo: lightseekorg/smg PR: 1057 File: model_gateway/src/routers/openai/mcp/tool_loop.rs:856-885 Timestamp: 2026-04-08T00:08:05.944Z Learning: In repo lightseekorg/smg, `sanitize_builtin_tool_arguments` in `model_gateway/src/routers/openai/mcp/tool_loop.rs` intentionally drops image-generation options (size, quality, background, output_format, compression) when handling `ResponseFormat::ImageGenerationCall`, keeping only `model` (hardcoded `IMAGE_MODEL`) and `revised_prompt`. This is a deliberate scoped decision for the initial image-generation tool integration; per-option overrides/defaults are planned for a follow-up PR. Do not flag the truncation as a bug or request preservation of extra fields.

Learnt from: XinyueZhang369 Repo: lightseekorg/smg PR: 399 File: protocols/src/interactions.rs:505-509 Timestamp: 2026-02-19T03:08:50.192Z Learning: In code reviews for Rust projects using the validator crate (v0.20.0), ensure that custom validation functions for numeric primitive types (e.g., f32, i32, u32, i16, etc.) accept the value by value, not by reference. Example: fn validate(value: f32) { ... }. The validator derive macro has a hardcoded list of numeric types that are passed by value, while all other types are passed by reference. Apply this guideline whenever validating numeric fields to align with the derive macro behavior.

Learnt from: slin1237 Repo: lightseekorg/smg PR: 489 File: model_gateway/src/core/token_bucket.rs:58-63 Timestamp: 2026-02-21T02:30:51.443Z Learning: For lint-only/Clippy enforcement PRs in this repository, avoid introducing behavioral changes (e.g., new input validation or logic changes). Treat such PRs as non-functional changes and plan a separate follow-up issue/PR for hardening or behavior changes. This applies broadly to Rust files across the repo; during review, focus on lint/style corrections and clearly note any intentional exceptions.

Learnt from: slin1237 Repo: lightseekorg/smg PR: 489 File: protocols/src/responses.rs:928-931 Timestamp: 2026-02-21T02:36:00.882Z Learning: In Rust code across the repository, use the marker INVARIANT: to document assumptions in safe code. Reserve SAFETY: for explaining why unsafe blocks are sound. This improves clarity of invariants and safety reasoning. Example reference: protocols/src/responses.rs near validate_tool_choice_with_tools().

Learnt from: slin1237 Repo: lightseekorg/smg PR: 489 File: mesh/src/sync.rs:83-83 Timestamp: 2026-02-21T02:37:01.416Z Learning: General Rust formatting rule: format! with implicit captures only supports simple identifiers, not full expressions like {state.model_id}. For cases where you want to interpolate a field or expression, bind the value first and interpolate the binding, e.g., let model_id = &state.model_id; and then use format!("policy:{}", model_id). In the specific file mesh/src/sync.rs, prefer format!("policy:{}", state.model_id) or bind to a local variable if you need named interpolation, to keep clarity and avoid unintended captures.

Learnt from: zhaowenzi Repo: lightseekorg/smg PR: 807 File: model_gateway/src/middleware.rs:61-81 Timestamp: 2026-03-18T21:32:00.041Z Learning: In Rust code using the http crate, HeaderMap::get() is effectively case-insensitive because HeaderName normalizes keys to lowercase on insertion and lookup. Do not require or perform explicit .to_lowercase() before HeaderMap::get() calls. Mark as not a concern for case-sensitivity in lookups; only consider normalization when inserting or comparing via HeaderName, not in lookups.

Learnt from: key4ng Repo: lightseekorg/smg PR: 867 File: tui/src/app.rs:798-813 Timestamp: 2026-03-22T20:13:55.778Z Learning: In this repo (lightseekorg/smg), treat the workspace `Cargo.toml`’s `package.rust-version` (MSRV) as the source of truth (e.g., `rust-version = "1.85"`). When reviewing Rust changes, do not flag usage of Rust language/library features that were stabilized on or before the MSRV (e.g., `Option::is_none_or`, stabilized in 1.82, is compatible with an MSRV of 1.85). Always verify the MSRV from the workspace `Cargo.toml` rather than relying on issue templates.

Learnt from: key4ng Repo: lightseekorg/smg PR: 1034 File: crates/grpc_client/src/mlx_engine.rs:135-145 Timestamp: 2026-04-04T00:29:38.039Z Learning: In this repo (lightseekorg/smg), the gRPC Channel builder implementations for the backend clients (SGLang, vLLM, TRT-LLM, and MLX) intentionally omit `.connect_timeout()`. When reviewing code changes in `crates/grpc_client/src/`, do not treat a missing `.connect_timeout()` in these backend client channel builders as a blocking issue; it is planned for a uniform addition in a follow-up PR.

Learnt from: key4ng Repo: lightseekorg/smg PR: 1034 File: crates/grpc_client/src/mlx_engine.rs:61-66 Timestamp: 2026-04-04T00:29:37.400Z Learning: In lightseekorg/smg’s gRPC backend clients that use the `AbortOnDropStream` pattern (e.g., vLLM, SGLang, TRT-LLM, MLX), `poll_next` is intentionally allowed to return `Poll::Ready(None)` on normal EOF without auto-flipping the completed flag. Do not review-fail for the absence of an EOF-driven `mark_completed()`/completed-flag update in these `AbortOnDropStream` implementations. The stream consumer is responsible for calling `mark_completed()` when the stream ends normally; any spurious/cheap idempotent `Abort` RPC that happens when `mark_completed()` wasn’t called before drop on natural EOF is an accepted tradeoff, and should not be treated as a bug.

Learnt from: key4ng Repo: lightseekorg/smg PR: 1034 File: crates/grpc_client/src/mlx_engine.rs:250-256 Timestamp: 2026-04-04T00:29:40.508Z Learning: In lightseekorg/smg’s gRPC backend implementation under crates/grpc_client/src/, tool-calling enforcement is router-side: SMG’s router-side ToolParser parses tool calls from the backend’s text output. Therefore, MLX (and other gRPC backend) builder functions should NOT be required to reject/guard tool-related fields (e.g., tools, tool_choice) at the backend builder boundary just because they are missing. Only reject tool-related content when there’s an explicit, backend-specific structural constraint to enforce at the builder boundary (e.g., response_format / structural_tag or other required schema/structure invariants). When reviewing, do not flag the absence of tool-call guards in MLX/gRPC backend builders as missing enforcement.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 335b8130cd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-12T06:39:35Z

+            if input_length > 0:
+                model_max = max_model_len - input_length
+                if model_max <= 0:
+                    raise ValueError(
+                        f"Input length ({input_length}) exceeds max model length ({max_model_len})"


Respect prompt truncation before enforcing context-length guard

The new input_length > 0 guard rejects requests before generation using the raw tokenized prompt length, but truncation is applied later via tokenization_kwargs (truncate_prompt_tokens). For requests where len(input_ids) > max_model_len but truncate_prompt_tokens would bring the prompt under the limit, this now raises INVALID_ARGUMENT even though vLLM could have served the request after truncation. This introduces a regression for direct gRPC callers using prompt truncation and breaks parity with the HTTP flow that computes limits from the effective prompt length.

Useful? React with 👍 / 👎.

The HTTP path's get_max_tokens() has the same guard — it checks raw input_length before truncation is applied by the engine. Made a minor edit in 4ee8b93 to align the guard condition (< instead of <=) and error message to match the HTTP path exactly.

The test doesn't verify actual sampling parameter values, only that requests succeed. Remove it to keep the test suite lean — the core change is in proto + Rust client + servicer. Signed-off-by: ai-jz <ai-jz@users.noreply.github.com>

Signed-off-by: ai-jz <ai-jz@users.noreply.github.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4ee8b936ec

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-12T07:01:39Z

+            candidates = [
+                v for v in (model_max, raw_max_tokens, override_max_tokens) if v is not None
+            ]


Clamp max_tokens against platform output limits

The new max_tokens clamp only considers model_max, request/default max_tokens, and override_max_tokens, but it omits vLLM’s platform-specific cap (current_platform.get_max_output_tokens(...)) that HTTP uses in get_max_tokens. On platforms that enforce a lower output-token ceiling than context-based model_max, requests with omitted/large max_tokens can now be forwarded with an oversized value and fail later with avoidable validation/runtime errors instead of being bounded up front.

Useful? React with 👍 / 👎.

current_platform.get_max_output_tokens() is an internal vLLM entrypoint API — the gRPC servicer runs outside that context and does not have access to it. The engine applies platform limits internally after receiving the request, same as it does for all other entrypoints.

CatherineSue · 2026-04-12T07:37:44Z

+        # Harmony model stop tokens — same as chat_completion/serving.py
+        if mc.hf_config.model_type == "gpt_oss":
+            from vllm.entrypoints.openai.parser.harmony_utils import (
+                get_stop_tokens_for_assistant_actions,


This is injected in harmony pipeline. No need to inject here

Got it — removed in b534cc2.

CatherineSue · 2026-04-12T07:38:23Z

-        logger.info("VllmEngineServicer initialized")
+
+        # Load model's default sampling params (respects --generation-config)
+        mc = async_llm.model_config


Can we use the full name variables here?

Done — renamed to model_config in b534cc2.

CatherineSue · 2026-04-12T07:39:49Z

+        mc = async_llm.model_config
+        self.default_sampling_params: dict = dict(mc.get_diff_sampling_param())
+
+        # max_tokens override — same logic as chat_completion/serving.py


Please include detailed url link with specific commit.

Added — see serving.py#L141 in b534cc2.

CatherineSue · 2026-04-12T07:41:09Z


        Returns:
-            vLLM SamplingParams with detokenize=False and structured_outputs
+            vLLM SamplingParams with model defaults applied for unset fields


The phrase model defaults here is misleading. What if user passes --generation-config vllm or an overridden config?

Renamed to "generation-config defaults" in b534cc2 — get_diff_sampling_param() covers all three --generation-config modes.

CatherineSue · 2026-04-12T07:43:00Z

+            params.max_tokens if params.HasField("max_tokens") else defaults.get("max_tokens")
+        )
+        if max_model_len is not None:
+            if input_length > 0:


Does this come from the vllm logic? If so, please put the source link.

And it is quite surprised to me that this is not handled in AsyncLLM already. It should be. This is pure backend logic, should not be in the front end servicer.py

You're right — removed the entire clamping block in b534cc2. The engine handles this in input_processor.py via AsyncLLM.add_request() → process_inputs().

…logic - Remove Harmony stop_token_ids block (gpt_oss model type check) - Rename 'mc' to 'model_config' for clarity in __init__ - Add vLLM HTTP path source link comment - Rename 'model defaults' to 'generation-config defaults' in comment - Remove max_tokens clamping logic (max_model_len, input_length, override_max_tokens) — let vLLM handle context-length enforcement Signed-off-by: ai-jz <ai-jz@users.noreply.github.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 81822bd7ed

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-12T17:36:56Z

+      - uses: actions/checkout@v6
+        with:
+          fetch-depth: 0


Checkout pull-request ref for issue_comment reviews

When this workflow is triggered by issue_comment, actions/checkout without an explicit ref checks out the repository default branch rather than the commented PR branch. That means the Claude run can review the wrong code (or an empty main...HEAD diff) when someone invokes @claude from a PR thread. Set the checkout ref for comment-triggered runs (for example, refs/pull/${{ github.event.issue.number }}/head) so the review analyzes the intended changes.

Useful? React with 👍 / 👎.

This file was included by mistake — fork-ci-review.yml is fork-only CI infrastructure and has been removed from this PR (force-pushed to drop the commit). These findings are valid and will be addressed on the fork separately.

This file was included by mistake and has been removed from this PR.

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

grpc_servicer/smg_grpc_servicer/vllm/servicer.py (1)

495-500: ⚠️ Potential issue | 🟠 Major

Don't merge stop_token_ids in the servicer.

This union changes request semantics because stop_token_ids is still a repeated proto field with no presence bit: omitted and explicitly empty look identical here, so callers lose any way to clear generation-config stop IDs. It also reintroduces the frontend-side stop-token injection that was already called out earlier.

Suggested fix

-        # Merge stop_token_ids: request + model defaults
         stop_token_ids = list(params.stop_token_ids) if params.stop_token_ids else None
-        default_stop_ids = defaults.get("stop_token_ids")
-        if default_stop_ids:
-            stop_token_ids = list(set(stop_token_ids or []) | set(default_stop_ids))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@grpc_servicer/smg_grpc_servicer/vllm/servicer.py` around lines 495 - 500, The
current code merges request stop_token_ids with model defaults (using
params.stop_token_ids and defaults.get("stop_token_ids")), which changes caller
semantics; stop_token_ids is a repeated proto without presence so we must not
union them. Change the logic in servicer.py around stop_token_ids so you do NOT
take the set-union: instead honor the request value as-is (use
params.stop_token_ids directly when provided) and only fall back to
defaults.get("stop_token_ids") when the request did not provide any value,
avoiding combining the two lists; update references to stop_token_ids,
params.stop_token_ids, and defaults.get("stop_token_ids") accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/fork-ci-review.yml:
- Around line 53-55: The workflow currently uses actions/checkout@v6 (the two
occurrences of actions/checkout@v6) which is a floating tag contradicting the
repo's SHA-pinning policy; replace each actions/checkout@v6 with the exact
commit SHA pin used for other actions (e.g., actions/checkout@<commit-sha>) or
remove the SHA-pinning claim in the workflow header; update both occurrences of
actions/checkout so they match the pinned format used for
anthropics/claude-code-action and openai/codex-action.
- Around line 44-51: The "Validate secret" step currently exits 0 and does not
prevent downstream steps from running when ANTHROPIC_API_KEY is missing; change
the validation to emit a step output (e.g., set a has_key/has_anthropic output
via GITHUB_OUTPUT) indicating whether the secret exists, and then add if:
conditionals on the subsequent "checkout" and "Claude Code review" steps to run
only when that output is true; apply the same pattern to the codex-review job's
validation step and its dependent steps (use consistent output names like
has_key or has_codex_key and reference them in if: conditions).
- Around line 21-22: The workflow currently triggers on issue_comment and will
run for plain issues and default-branch refs; update the job condition (where
the workflow checks before running) to require github.event.issue.pull_request
!= null so it only runs for PR comments and add a secondary guard ensuring
github.event_name == 'issue_comment' && contains(github.event.comment.body,
'@claude') if applicable; also change the actions/checkout step (the checkout
job using actions/checkout) to pass an explicit ref:
github.event.issue.pull_request.head.ref or github.event.pull_request.head.sha
(use the correct event field in your job) so the job checks out the PR head
instead of the default branch.

---

Duplicate comments:
In `@grpc_servicer/smg_grpc_servicer/vllm/servicer.py`:
- Around line 495-500: The current code merges request stop_token_ids with model
defaults (using params.stop_token_ids and defaults.get("stop_token_ids")), which
changes caller semantics; stop_token_ids is a repeated proto without presence so
we must not union them. Change the logic in servicer.py around stop_token_ids so
you do NOT take the set-union: instead honor the request value as-is (use
params.stop_token_ids directly when provided) and only fall back to
defaults.get("stop_token_ids") when the request did not provide any value,
avoiding combining the two lists; update references to stop_token_ids,
params.stop_token_ids, and defaults.get("stop_token_ids") accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ed101e87-d959-449d-86c6-c498ffd200d3

📥 Commits

Reviewing files that changed from the base of the PR and between 4ee8b93 and 81822bd.

📒 Files selected for processing (2)

.github/workflows/fork-ci-review.yml
grpc_servicer/smg_grpc_servicer/vllm/servicer.py

coderabbitai · 2026-04-12T17:38:43Z

+  issue_comment:
+    types: [created]


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd -type f -name "fork-ci-review.yml" | head -20

Repository: lightseekorg/smg

Length of output: 230

🏁 Script executed:

fd -t f "fork-ci-review.yml"

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

fd -t f "fork" .github/workflows/

Repository: lightseekorg/smg

Length of output: 96

🏁 Script executed:

cat -n .github/workflows/fork-ci-review.yml

Repository: lightseekorg/smg

Length of output: 5504

@claude comment runs are not scoped to pull requests—they can execute on plain issues or use the default branch.

The issue_comment trigger at line 21-22 fires on both pull requests and issues. Without a check for github.event.issue.pull_request != null in the job condition (line 35), a comment on a plain issue can trigger the job. Additionally, actions/checkout at line 53 lacks an explicit ref parameter, so it defaults to the repository's default branch instead of the PR head—even on valid PR comments.

Suggested fix

claude-review: if: >- github.actor == 'ai-jz' && ( github.event_name == 'pull_request' || - (github.event_name == 'issue_comment' && contains(github.event.comment.body, '@claude')) + ( + github.event_name == 'issue_comment' && + github.event.issue.pull_request != null && + contains(github.event.comment.body, '@claude') + ) ) @@ - uses: actions/checkout@v6 with: fetch-depth: 0 + ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || format('refs/pull/{0}/head', github.event.issue.number) }}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/fork-ci-review.yml around lines 21 - 22, The workflow currently triggers on issue_comment and will run for plain issues and default-branch refs; update the job condition (where the workflow checks before running) to require github.event.issue.pull_request != null so it only runs for PR comments and add a secondary guard ensuring github.event_name == 'issue_comment' && contains(github.event.comment.body, '@claude') if applicable; also change the actions/checkout step (the checkout job using actions/checkout) to pass an explicit ref: github.event.issue.pull_request.head.ref or github.event.pull_request.head.sha (use the correct event field in your job) so the job checks out the PR head instead of the default branch.

This file was included by mistake — fork-ci-review.yml is fork-only CI infrastructure and has been removed from this PR (force-pushed to drop the commit). These findings are valid and will be addressed on the fork separately.

@ai-jz, understood — thanks for the clarification! Glad the findings were useful; feel free to apply them when revisiting that workflow on the fork. No further action needed here.

🧠 Learnings used

Learnt from: key4ng Repo: lightseekorg/smg PR: 1003 File: .github/workflows/claude-code-review.yml:77-81 Timestamp: 2026-03-31T23:05:49.930Z Learning: In repo lightseekorg/smg, `github-actions[bot]` does NOT post inline PR review comments, so it does not need to be included in cross-bot deduplication lists for Claude review workflows (e.g., in `.github/workflows/claude-code-review.yml`). Do not flag its absence from such lists.

Learnt from: key4ng Repo: lightseekorg/smg PR: 1036 File: .github/workflows/claude-code-review.yml:71-73 Timestamp: 2026-04-06T16:49:28.633Z Learning: In GitHub Actions workflow YAMLs, treat any `prompt:` field intended for an LLM (e.g., Claude) as natural-language instructions for the model. Do not embed shell-script-specific constructs or error-handling patterns inside the prompt content (e.g., `git rev-parse --verify`, `2>/dev/null`, explicit exit-code checks like `if [ $? -ne 0 ]`). Instead, express fallback/error-handling guidance in natural language so the LLM can reason about failures from context rather than parsing shell stderr/stdout semantics.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 633 File: .github/workflows/pr-test-rust.yml:484-485 Timestamp: 2026-03-04T20:40:48.357Z Learning: In repo lightseekorg/smg, the composite action `.github/actions/run-e2e` (at `.github/actions/run-e2e/action.yml`) includes a `actions/checkoutv6` step internally as its first step. Callers (e.g., `e2e-always`, `e2e-chat-completions`, `e2e-agentic` jobs in `.github/workflows/pr-test-rust.yml`) do NOT need a separate checkout step before invoking this action — do not flag missing checkout for these jobs.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 633 File: .github/workflows/pr-test-rust.yml:439-440 Timestamp: 2026-03-04T20:41:53.013Z Learning: Ensure GitHub Actions workflows always include an actions/checkout step when referencing local composite actions (uses: ./... or similar) on self-hosted runners. Do not rely on a pre-checked-out workspace unless your environment explicitly guarantees it; follow the standard behavior and include checkout to ensure reliable action resolution.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 638 File: .github/workflows/release-grpc.yml:35-37 Timestamp: 2026-03-05T05:50:07.158Z Learning: In workflows under .github/workflows that trigger only on push to main, you do not need to guard against the all-zeros github.event.before (0000000000000000000000000000000000000000) since that value only occurs on initial branch creation and cannot happen for an already-existing main. Apply this guidance to similar push-to-main workflows across the repository (pattern: **/.github/workflows/*.yml). If a workflow also handles other events or branches, retain appropriate guards for those cases.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 633 File: .github/workflows/pr-test-rust.yml:439-440 Timestamp: 2026-03-04T20:42:14.058Z Learning: In lightseekorg/smg, the self-hosted k8s-runner-gpu and k8s-runner-cpu runners appear to have a pre-checked-out workspace, which allows local composite action references (e.g., uses: ./.github/actions/run-e2e) to resolve without an explicit actions/checkout step. Per standard GitHub Actions behavior, checkout IS required before local `./` composite action references; this repo is an exception due to its persistent self-hosted runner workspace.

Learnt from: slin1237 Repo: lightseekorg/smg PR: 851 File: docs/_audit/reliability-worker.md:216-228 Timestamp: 2026-03-21T06:14:03.701Z Learning: In repo lightseekorg/smg, files under `docs/_audit/` (e.g., `docs/_audit/reliability-worker.md`) are internal audit checklists used during the review process and are not shipped as user-facing documentation. Do not flag markdown formatting nitpicks (e.g., missing code block language specifiers, lint warnings) in these files — formatting polish is out of scope for them.

Learnt from: vschandramourya Repo: lightseekorg/smg PR: 915 File: model_gateway/src/routers/grpc/client.rs:387-423 Timestamp: 2026-03-26T17:06:14.307Z Learning: In repo lightseekorg/smg, in `model_gateway/src/routers/grpc/client.rs` and the corresponding backend builders (`crates/grpc_client/src/sglang_scheduler.rs`, `vllm_engine.rs`, `trtllm_service.rs`): The per-backend divergence in handling `CompletionRequest.max_tokens == None` is intentional. SGLang and vLLM pass `None` through to their proto builders, while TRT-LLM falls back to `16`. This matches the pre-existing per-backend pattern used in the chat/messages request builders. Do not flag this divergence as a bug or request normalization at the `build_completion_request` dispatcher layer in `client.rs`.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 495 File: grpc_client/proto/sglang_scheduler.proto:140-172 Timestamp: 2026-02-21T11:57:36.560Z Learning: In repo lightseekorg/smg, the multimodal proto fields in `grpc_client/proto/sglang_scheduler.proto` (including `MultimodalInputs.pixel_values`, `model_specific_tensors`, `im_token_id`, and `mm_placeholders`) were introduced in PR `#495` as the first real usage; prior internal-only development fields like `processed_features` were never shipped to production, so wire-compatibility concerns do not apply to these field number assignments.

Learnt from: vschandramourya Repo: lightseekorg/smg PR: 978 File: model_gateway/src/routers/grpc/regular/streaming.rs:2275-2276 Timestamp: 2026-03-31T03:43:30.981Z Learning: In repo lightseekorg/smg, file model_gateway/src/routers/grpc/regular/streaming.rs: `logprobs: None` in `CompletionStreamChoice` within `process_completion_streaming_chunks` is intentional. Per-token logprob streaming is deferred across all streaming endpoints (completions, generate) because it requires backend-side plumbing not yet available. The same `// TODO: wire per-token logprob streaming` deferral exists in `process_generate_streaming` (around line 702). Non-streaming completions has the same TODO. Do not flag `logprobs: None` or the TODO comment in the completions or generate streaming paths as a missing feature or silent data loss until a dedicated follow-up PR wires backend per-token logprob support.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 638 File: grpc_servicer/smg_grpc_servicer/vllm/servicer.py:371-406 Timestamp: 2026-03-05T04:48:49.033Z Learning: In repo lightseekorg/smg, `grpc_servicer/smg_grpc_servicer/vllm/servicer.py` (`_build_preprocessed_mm_inputs`): multimodal metadata fields (`flat_keys`, `batched_keys`, `mm_placeholders`, `model_specific_tensors`) are produced exclusively by the internal Rust router and are treated as trusted data. Defensive input-validation guards (e.g., checking flat_keys targets exist in hf_dict, validating PlaceholderRange bounds) are intentionally omitted to avoid hot-path overhead. Do not flag missing validation on these fields in future reviews.

Learnt from: slin1237 Repo: lightseekorg/smg PR: 447 File: model_gateway/src/routers/grpc/client.rs:312-328 Timestamp: 2026-02-17T20:30:41.243Z Learning: In model_gateway gRPC metadata discovery, SGLang's `server_args` (a `prost_types::Struct`) natively uses short-form keys like `tp_size`, `dp_size`, `pp_size` that match `SGLANG_GRPC_KEYS` and are extracted via `pick_prost_fields()` without normalization. vLLM and TRT-LLM gRPC protos use long-form keys like `tensor_parallel_size`, `pipeline_parallel_size`, which go through `flat_labels()` and are normalized by `normalize_grpc_keys()` in `discover_metadata.rs` after both `model_info.to_labels()` and `server_info.to_labels()` return.

Learnt from: key4ng Repo: lightseekorg/smg PR: 1034 File: model_gateway/src/routers/grpc/multimodal.rs:602-610 Timestamp: 2026-04-04T00:29:52.690Z Learning: In repo lightseekorg/smg, the gRPC regular pipeline runs stages in order: Preparation (tokenization + multimodal media fetch + preprocessing) → Worker Selection → Request Building. Because backend identity (e.g., GrpcClient::Mlx) is only resolved after worker selection, it is architecturally impossible to short-circuit MLX multimodal rejection before process_multimodal() / process_multimodal_messages() runs in the Preparation stage. The earliest feasible rejection point for backend-specific multimodal unsupported errors is at assemble_multimodal_data in request_building (after preprocessing but before generation). Moving rejection earlier would require a larger architectural change (e.g., moving worker selection before preparation). Do not flag assemble_multimodal_data's GrpcClient::Mlx error arm as "too late" in the pipeline.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 606 File: e2e_test/chat_completions/test_openai_server.py:600-600 Timestamp: 2026-03-04T04:05:00.273Z Learning: In the `lightseekorg/smg` repository, as of PR `#606`, Harmony (gpt-oss) DOES support stop sequence trimming. `TestChatCompletionGptOss` in `e2e_test/chat_completions/test_openai_server.py` correctly sets `STOP_SEQUENCE_TRIMMED = True` by default. The vLLM backend (is_vllm) overrides this to `False` in both `test_stop_sequences` and `test_stop_sequences_stream` because vLLM does not trim stop sequences. Do not flag `STOP_SEQUENCE_TRIMMED = True` in `TestChatCompletionGptOss` as incorrect.

This file was included by mistake and has been removed from this PR.

coderabbitai · 2026-04-12T17:38:44Z

+      - name: Validate secret
+        env:
+          KEY: ${{ secrets.ANTHROPIC_API_KEY }}
+        run: |
+          if [ -z "$KEY" ]; then
+            echo "::warning::ANTHROPIC_API_KEY not set — skipping Claude review"
+            exit 0
+          fi


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd -t f fork-ci-review.yml

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

ls -la .github/workflows/ | grep -i fork

Repository: lightseekorg/smg

Length of output: 128

🏁 Script executed:

cat -n .github/workflows/fork-ci-review.yml

Repository: lightseekorg/smg

Length of output: 5504

Missing secrets do not skip either job; validation steps need output-based conditionals.

The validation steps exit successfully (exit 0), allowing subsequent steps to execute with empty API keys. In GitHub Actions, a step exiting with code 0 is successful; it does not prevent the next step from running. This causes the Claude Code review and Codex review actions to fail with missing credentials instead of cleanly skipping.

Add step outputs to the validation steps and apply if: conditionals to the checkout and review steps to actually skip them when secrets are missing:

Suggested fix

- name: Validate secret + id: anthropic_secret env: KEY: ${{ secrets.ANTHROPIC_API_KEY }} run: | if [ -z "$KEY" ]; then echo "::warning::ANTHROPIC_API_KEY not set — skipping Claude review" + echo "present=false" >> "$GITHUB_OUTPUT" exit 0 fi + echo "present=true" >> "$GITHUB_OUTPUT" - uses: actions/checkout@v6 + if: steps.anthropic_secret.outputs.present == 'true' with: fetch-depth: 0 - name: Claude Code review + if: steps.anthropic_secret.outputs.present == 'true' uses: anthropics/claude-code-action@c7c8889b30499b4e46f4c32b892e43cd364bc2fe # v1

Apply the same pattern to the codex-review job (lines 89–96).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- name: Validate secret

env:

KEY: ${{ secrets.ANTHROPIC_API_KEY }}

run: |

if [ -z "$KEY" ]; then

echo "::warning::ANTHROPIC_API_KEY not set — skipping Claude review"

exit 0

fi

- name: Validate secret

id: anthropic_secret

env:

KEY: ${{ secrets.ANTHROPIC_API_KEY }}

run: |

if [ -z "$KEY" ]; then

echo "::warning::ANTHROPIC_API_KEY not set — skipping Claude review"

echo "present=false" >> "$GITHUB_OUTPUT"

exit 0

fi

echo "present=true" >> "$GITHUB_OUTPUT"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/fork-ci-review.yml around lines 44 - 51, The "Validate secret" step currently exits 0 and does not prevent downstream steps from running when ANTHROPIC_API_KEY is missing; change the validation to emit a step output (e.g., set a has_key/has_anthropic output via GITHUB_OUTPUT) indicating whether the secret exists, and then add if: conditionals on the subsequent "checkout" and "Claude Code review" steps to run only when that output is true; apply the same pattern to the codex-review job's validation step and its dependent steps (use consistent output names like has_key or has_codex_key and reference them in if: conditions).

This file was included by mistake — fork-ci-review.yml is fork-only CI infrastructure and has been removed from this PR (force-pushed to drop the commit). These findings are valid and will be addressed on the fork separately.

@ai-jz, understood — thanks for the clarification. Since the file has been dropped from this PR entirely, the findings are moot here. No action needed on your end for this review; glad the findings are useful for hardening the fork workflow separately.

🧠 Learnings used

Learnt from: key4ng Repo: lightseekorg/smg PR: 1036 File: .github/workflows/claude-code-review.yml:71-73 Timestamp: 2026-04-06T16:49:28.633Z Learning: In GitHub Actions workflow YAMLs, treat any `prompt:` field intended for an LLM (e.g., Claude) as natural-language instructions for the model. Do not embed shell-script-specific constructs or error-handling patterns inside the prompt content (e.g., `git rev-parse --verify`, `2>/dev/null`, explicit exit-code checks like `if [ $? -ne 0 ]`). Instead, express fallback/error-handling guidance in natural language so the LLM can reason about failures from context rather than parsing shell stderr/stdout semantics.

Learnt from: key4ng Repo: lightseekorg/smg PR: 1003 File: .github/workflows/claude-code-review.yml:77-81 Timestamp: 2026-03-31T23:05:49.930Z Learning: In repo lightseekorg/smg, `github-actions[bot]` does NOT post inline PR review comments, so it does not need to be included in cross-bot deduplication lists for Claude review workflows (e.g., in `.github/workflows/claude-code-review.yml`). Do not flag its absence from such lists.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 633 File: .github/workflows/pr-test-rust.yml:484-485 Timestamp: 2026-03-04T20:40:48.357Z Learning: In repo lightseekorg/smg, the composite action `.github/actions/run-e2e` (at `.github/actions/run-e2e/action.yml`) includes a `actions/checkoutv6` step internally as its first step. Callers (e.g., `e2e-always`, `e2e-chat-completions`, `e2e-agentic` jobs in `.github/workflows/pr-test-rust.yml`) do NOT need a separate checkout step before invoking this action — do not flag missing checkout for these jobs.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 633 File: .github/workflows/pr-test-rust.yml:439-440 Timestamp: 2026-03-04T20:41:53.013Z Learning: Ensure GitHub Actions workflows always include an actions/checkout step when referencing local composite actions (uses: ./... or similar) on self-hosted runners. Do not rely on a pre-checked-out workspace unless your environment explicitly guarantees it; follow the standard behavior and include checkout to ensure reliable action resolution.

Learnt from: slin1237 Repo: lightseekorg/smg PR: 851 File: docs/_audit/reliability-worker.md:216-228 Timestamp: 2026-03-21T06:14:03.701Z Learning: In repo lightseekorg/smg, files under `docs/_audit/` (e.g., `docs/_audit/reliability-worker.md`) are internal audit checklists used during the review process and are not shipped as user-facing documentation. Do not flag markdown formatting nitpicks (e.g., missing code block language specifiers, lint warnings) in these files — formatting polish is out of scope for them.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 633 File: .github/actions/run-e2e/action.yml:65-83 Timestamp: 2026-03-04T20:39:43.892Z Learning: Repo lightseekorg/smg uses custom/internal action versions that differ from public GitHub marketplace releases. Specifically: `actions/checkoutv6`, `actions/download-artifactv8`, and `actions/upload-artifactv7` are all valid and intentional throughout this repo's workflows (e.g., pr-test-rust.yml and composite actions under .github/actions/). Do not flag these versions as non-existent or incorrect.

Learnt from: vschandramourya Repo: lightseekorg/smg PR: 915 File: model_gateway/src/routers/grpc/client.rs:387-423 Timestamp: 2026-03-26T17:06:14.307Z Learning: In repo lightseekorg/smg, in `model_gateway/src/routers/grpc/client.rs` and the corresponding backend builders (`crates/grpc_client/src/sglang_scheduler.rs`, `vllm_engine.rs`, `trtllm_service.rs`): The per-backend divergence in handling `CompletionRequest.max_tokens == None` is intentional. SGLang and vLLM pass `None` through to their proto builders, while TRT-LLM falls back to `16`. This matches the pre-existing per-backend pattern used in the chat/messages request builders. Do not flag this divergence as a bug or request normalization at the `build_completion_request` dispatcher layer in `client.rs`.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 495 File: grpc_client/proto/sglang_scheduler.proto:140-172 Timestamp: 2026-02-21T11:57:36.560Z Learning: In repo lightseekorg/smg, the multimodal proto fields in `grpc_client/proto/sglang_scheduler.proto` (including `MultimodalInputs.pixel_values`, `model_specific_tensors`, `im_token_id`, and `mm_placeholders`) were introduced in PR `#495` as the first real usage; prior internal-only development fields like `processed_features` were never shipped to production, so wire-compatibility concerns do not apply to these field number assignments.

Learnt from: vschandramourya Repo: lightseekorg/smg PR: 978 File: model_gateway/src/routers/grpc/regular/streaming.rs:2275-2276 Timestamp: 2026-03-31T03:43:30.981Z Learning: In repo lightseekorg/smg, file model_gateway/src/routers/grpc/regular/streaming.rs: `logprobs: None` in `CompletionStreamChoice` within `process_completion_streaming_chunks` is intentional. Per-token logprob streaming is deferred across all streaming endpoints (completions, generate) because it requires backend-side plumbing not yet available. The same `// TODO: wire per-token logprob streaming` deferral exists in `process_generate_streaming` (around line 702). Non-streaming completions has the same TODO. Do not flag `logprobs: None` or the TODO comment in the completions or generate streaming paths as a missing feature or silent data loss until a dedicated follow-up PR wires backend per-token logprob support.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 638 File: grpc_servicer/smg_grpc_servicer/vllm/servicer.py:371-406 Timestamp: 2026-03-05T04:48:49.033Z Learning: In repo lightseekorg/smg, `grpc_servicer/smg_grpc_servicer/vllm/servicer.py` (`_build_preprocessed_mm_inputs`): multimodal metadata fields (`flat_keys`, `batched_keys`, `mm_placeholders`, `model_specific_tensors`) are produced exclusively by the internal Rust router and are treated as trusted data. Defensive input-validation guards (e.g., checking flat_keys targets exist in hf_dict, validating PlaceholderRange bounds) are intentionally omitted to avoid hot-path overhead. Do not flag missing validation on these fields in future reviews.

Learnt from: slin1237 Repo: lightseekorg/smg PR: 447 File: model_gateway/src/routers/grpc/client.rs:312-328 Timestamp: 2026-02-17T20:30:41.243Z Learning: In model_gateway gRPC metadata discovery, SGLang's `server_args` (a `prost_types::Struct`) natively uses short-form keys like `tp_size`, `dp_size`, `pp_size` that match `SGLANG_GRPC_KEYS` and are extracted via `pick_prost_fields()` without normalization. vLLM and TRT-LLM gRPC protos use long-form keys like `tensor_parallel_size`, `pipeline_parallel_size`, which go through `flat_labels()` and are normalized by `normalize_grpc_keys()` in `discover_metadata.rs` after both `model_info.to_labels()` and `server_info.to_labels()` return.

Learnt from: key4ng Repo: lightseekorg/smg PR: 1034 File: model_gateway/src/routers/grpc/multimodal.rs:602-610 Timestamp: 2026-04-04T00:29:52.690Z Learning: In repo lightseekorg/smg, the gRPC regular pipeline runs stages in order: Preparation (tokenization + multimodal media fetch + preprocessing) → Worker Selection → Request Building. Because backend identity (e.g., GrpcClient::Mlx) is only resolved after worker selection, it is architecturally impossible to short-circuit MLX multimodal rejection before process_multimodal() / process_multimodal_messages() runs in the Preparation stage. The earliest feasible rejection point for backend-specific multimodal unsupported errors is at assemble_multimodal_data in request_building (after preprocessing but before generation). Moving rejection earlier would require a larger architectural change (e.g., moving worker selection before preparation). Do not flag assemble_multimodal_data's GrpcClient::Mlx error arm as "too late" in the pipeline.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 606 File: e2e_test/chat_completions/test_openai_server.py:600-600 Timestamp: 2026-03-04T04:05:00.273Z Learning: In the `lightseekorg/smg` repository, as of PR `#606`, Harmony (gpt-oss) DOES support stop sequence trimming. `TestChatCompletionGptOss` in `e2e_test/chat_completions/test_openai_server.py` correctly sets `STOP_SEQUENCE_TRIMMED = True` by default. The vLLM backend (is_vllm) overrides this to `False` in both `test_stop_sequences` and `test_stop_sequences_stream` because vLLM does not trim stop sequences. Do not flag `STOP_SEQUENCE_TRIMMED = True` in `TestChatCompletionGptOss` as incorrect.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 638 File: .github/workflows/release-grpc.yml:35-37 Timestamp: 2026-03-05T05:50:07.158Z Learning: In workflows under .github/workflows that trigger only on push to main, you do not need to guard against the all-zeros github.event.before (0000000000000000000000000000000000000000) since that value only occurs on initial branch creation and cannot happen for an already-existing main. Apply this guidance to similar push-to-main workflows across the repository (pattern: **/.github/workflows/*.yml). If a workflow also handles other events or branches, retain appropriate guards for those cases.

This file was included by mistake and has been removed from this PR.

coderabbitai · 2026-04-12T17:38:44Z

+      - uses: actions/checkout@v6
+        with:
+          fetch-depth: 0


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n .github/workflows/fork-ci-review.yml | head -120

Repository: lightseekorg/smg

Length of output: 5079

Pin actions/checkout to an exact commit SHA or remove the security claim.

Line 6 asserts "Actions pinned to SHA (not floating tags) to prevent supply chain attacks," but actions/checkout@v6 on lines 53 and 98 uses a floating tag. Other actions in this file (anthropics/claude-code-action and openai/codex-action) are correctly pinned to commit SHAs. Either pin checkout to a specific commit hash like the others, or remove the SHA-pinning security claim from the header.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/fork-ci-review.yml around lines 53 - 55, The workflow currently uses actions/checkout@v6 (the two occurrences of actions/checkout@v6) which is a floating tag contradicting the repo's SHA-pinning policy; replace each actions/checkout@v6 with the exact commit SHA pin used for other actions (e.g., actions/checkout@<commit-sha>) or remove the SHA-pinning claim in the workflow header; update both occurrences of actions/checkout so they match the pinned format used for anthropics/claude-code-action and openai/codex-action.

This file was included by mistake — fork-ci-review.yml is fork-only CI infrastructure and has been removed from this PR (force-pushed to drop the commit). These findings are valid and will be addressed on the fork separately.

@ai-jz, understood — thanks for the clarification! Makes sense to keep fork-only CI infrastructure out of the upstream PR. Good to hear the SHA-pinning findings will be addressed on the fork side.

🧠 Learnings used

Learnt from: key4ng Repo: lightseekorg/smg PR: 1003 File: .github/workflows/claude-code-review.yml:77-81 Timestamp: 2026-03-31T23:05:49.930Z Learning: In repo lightseekorg/smg, `github-actions[bot]` does NOT post inline PR review comments, so it does not need to be included in cross-bot deduplication lists for Claude review workflows (e.g., in `.github/workflows/claude-code-review.yml`). Do not flag its absence from such lists.

Learnt from: key4ng Repo: lightseekorg/smg PR: 1036 File: .github/workflows/claude-code-review.yml:71-73 Timestamp: 2026-04-06T16:49:28.633Z Learning: In GitHub Actions workflow YAMLs, treat any `prompt:` field intended for an LLM (e.g., Claude) as natural-language instructions for the model. Do not embed shell-script-specific constructs or error-handling patterns inside the prompt content (e.g., `git rev-parse --verify`, `2>/dev/null`, explicit exit-code checks like `if [ $? -ne 0 ]`). Instead, express fallback/error-handling guidance in natural language so the LLM can reason about failures from context rather than parsing shell stderr/stdout semantics.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 633 File: .github/workflows/pr-test-rust.yml:484-485 Timestamp: 2026-03-04T20:40:48.357Z Learning: In repo lightseekorg/smg, the composite action `.github/actions/run-e2e` (at `.github/actions/run-e2e/action.yml`) includes a `actions/checkoutv6` step internally as its first step. Callers (e.g., `e2e-always`, `e2e-chat-completions`, `e2e-agentic` jobs in `.github/workflows/pr-test-rust.yml`) do NOT need a separate checkout step before invoking this action — do not flag missing checkout for these jobs.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 633 File: .github/workflows/pr-test-rust.yml:439-440 Timestamp: 2026-03-04T20:41:53.013Z Learning: Ensure GitHub Actions workflows always include an actions/checkout step when referencing local composite actions (uses: ./... or similar) on self-hosted runners. Do not rely on a pre-checked-out workspace unless your environment explicitly guarantees it; follow the standard behavior and include checkout to ensure reliable action resolution.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 633 File: .github/actions/run-e2e/action.yml:65-83 Timestamp: 2026-03-04T20:39:43.892Z Learning: Repo lightseekorg/smg uses custom/internal action versions that differ from public GitHub marketplace releases. Specifically: `actions/checkoutv6`, `actions/download-artifactv8`, and `actions/upload-artifactv7` are all valid and intentional throughout this repo's workflows (e.g., pr-test-rust.yml and composite actions under .github/actions/). Do not flag these versions as non-existent or incorrect.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 633 File: .github/workflows/pr-test-rust.yml:439-440 Timestamp: 2026-03-04T20:42:14.058Z Learning: In lightseekorg/smg, the self-hosted k8s-runner-gpu and k8s-runner-cpu runners appear to have a pre-checked-out workspace, which allows local composite action references (e.g., uses: ./.github/actions/run-e2e) to resolve without an explicit actions/checkout step. Per standard GitHub Actions behavior, checkout IS required before local `./` composite action references; this repo is an exception due to its persistent self-hosted runner workspace.

Learnt from: slin1237 Repo: lightseekorg/smg PR: 851 File: docs/_audit/reliability-worker.md:216-228 Timestamp: 2026-03-21T06:14:03.701Z Learning: In repo lightseekorg/smg, files under `docs/_audit/` (e.g., `docs/_audit/reliability-worker.md`) are internal audit checklists used during the review process and are not shipped as user-facing documentation. Do not flag markdown formatting nitpicks (e.g., missing code block language specifiers, lint warnings) in these files — formatting polish is out of scope for them.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 638 File: .github/workflows/release-grpc.yml:35-37 Timestamp: 2026-03-05T05:50:07.158Z Learning: In workflows under .github/workflows that trigger only on push to main, you do not need to guard against the all-zeros github.event.before (0000000000000000000000000000000000000000) since that value only occurs on initial branch creation and cannot happen for an already-existing main. Apply this guidance to similar push-to-main workflows across the repository (pattern: **/.github/workflows/*.yml). If a workflow also handles other events or branches, retain appropriate guards for those cases.

Learnt from: vschandramourya Repo: lightseekorg/smg PR: 915 File: model_gateway/src/routers/grpc/client.rs:387-423 Timestamp: 2026-03-26T17:06:14.307Z Learning: In repo lightseekorg/smg, in `model_gateway/src/routers/grpc/client.rs` and the corresponding backend builders (`crates/grpc_client/src/sglang_scheduler.rs`, `vllm_engine.rs`, `trtllm_service.rs`): The per-backend divergence in handling `CompletionRequest.max_tokens == None` is intentional. SGLang and vLLM pass `None` through to their proto builders, while TRT-LLM falls back to `16`. This matches the pre-existing per-backend pattern used in the chat/messages request builders. Do not flag this divergence as a bug or request normalization at the `build_completion_request` dispatcher layer in `client.rs`.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 495 File: grpc_client/proto/sglang_scheduler.proto:140-172 Timestamp: 2026-02-21T11:57:36.560Z Learning: In repo lightseekorg/smg, the multimodal proto fields in `grpc_client/proto/sglang_scheduler.proto` (including `MultimodalInputs.pixel_values`, `model_specific_tensors`, `im_token_id`, and `mm_placeholders`) were introduced in PR `#495` as the first real usage; prior internal-only development fields like `processed_features` were never shipped to production, so wire-compatibility concerns do not apply to these field number assignments.

Learnt from: vschandramourya Repo: lightseekorg/smg PR: 978 File: model_gateway/src/routers/grpc/regular/streaming.rs:2275-2276 Timestamp: 2026-03-31T03:43:30.981Z Learning: In repo lightseekorg/smg, file model_gateway/src/routers/grpc/regular/streaming.rs: `logprobs: None` in `CompletionStreamChoice` within `process_completion_streaming_chunks` is intentional. Per-token logprob streaming is deferred across all streaming endpoints (completions, generate) because it requires backend-side plumbing not yet available. The same `// TODO: wire per-token logprob streaming` deferral exists in `process_generate_streaming` (around line 702). Non-streaming completions has the same TODO. Do not flag `logprobs: None` or the TODO comment in the completions or generate streaming paths as a missing feature or silent data loss until a dedicated follow-up PR wires backend per-token logprob support.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 638 File: grpc_servicer/smg_grpc_servicer/vllm/servicer.py:371-406 Timestamp: 2026-03-05T04:48:49.033Z Learning: In repo lightseekorg/smg, `grpc_servicer/smg_grpc_servicer/vllm/servicer.py` (`_build_preprocessed_mm_inputs`): multimodal metadata fields (`flat_keys`, `batched_keys`, `mm_placeholders`, `model_specific_tensors`) are produced exclusively by the internal Rust router and are treated as trusted data. Defensive input-validation guards (e.g., checking flat_keys targets exist in hf_dict, validating PlaceholderRange bounds) are intentionally omitted to avoid hot-path overhead. Do not flag missing validation on these fields in future reviews.

Learnt from: slin1237 Repo: lightseekorg/smg PR: 447 File: model_gateway/src/routers/grpc/client.rs:312-328 Timestamp: 2026-02-17T20:30:41.243Z Learning: In model_gateway gRPC metadata discovery, SGLang's `server_args` (a `prost_types::Struct`) natively uses short-form keys like `tp_size`, `dp_size`, `pp_size` that match `SGLANG_GRPC_KEYS` and are extracted via `pick_prost_fields()` without normalization. vLLM and TRT-LLM gRPC protos use long-form keys like `tensor_parallel_size`, `pipeline_parallel_size`, which go through `flat_labels()` and are normalized by `normalize_grpc_keys()` in `discover_metadata.rs` after both `model_info.to_labels()` and `server_info.to_labels()` return.

Learnt from: key4ng Repo: lightseekorg/smg PR: 1034 File: model_gateway/src/routers/grpc/multimodal.rs:602-610 Timestamp: 2026-04-04T00:29:52.690Z Learning: In repo lightseekorg/smg, the gRPC regular pipeline runs stages in order: Preparation (tokenization + multimodal media fetch + preprocessing) → Worker Selection → Request Building. Because backend identity (e.g., GrpcClient::Mlx) is only resolved after worker selection, it is architecturally impossible to short-circuit MLX multimodal rejection before process_multimodal() / process_multimodal_messages() runs in the Preparation stage. The earliest feasible rejection point for backend-specific multimodal unsupported errors is at assemble_multimodal_data in request_building (after preprocessing but before generation). Moving rejection earlier would require a larger architectural change (e.g., moving worker selection before preparation). Do not flag assemble_multimodal_data's GrpcClient::Mlx error arm as "too late" in the pipeline.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 606 File: e2e_test/chat_completions/test_openai_server.py:600-600 Timestamp: 2026-03-04T04:05:00.273Z Learning: In the `lightseekorg/smg` repository, as of PR `#606`, Harmony (gpt-oss) DOES support stop sequence trimming. `TestChatCompletionGptOss` in `e2e_test/chat_completions/test_openai_server.py` correctly sets `STOP_SEQUENCE_TRIMMED = True` by default. The vLLM backend (is_vllm) overrides this to `False` in both `test_stop_sequences` and `test_stop_sequences_stream` because vLLM does not trim stop sequences. Do not flag `STOP_SEQUENCE_TRIMMED = True` in `TestChatCompletionGptOss` as incorrect.

This file was included by mistake and has been removed from this PR.

@ai-jz, confirmed — thanks for removing it!

🧠 Learnings used

Learnt from: key4ng Repo: lightseekorg/smg PR: 1003 File: .github/workflows/claude-code-review.yml:77-81 Timestamp: 2026-03-31T23:05:49.930Z Learning: In repo lightseekorg/smg, `github-actions[bot]` does NOT post inline PR review comments, so it does not need to be included in cross-bot deduplication lists for Claude review workflows (e.g., in `.github/workflows/claude-code-review.yml`). Do not flag its absence from such lists.

Learnt from: key4ng Repo: lightseekorg/smg PR: 1036 File: .github/workflows/claude-code-review.yml:71-73 Timestamp: 2026-04-06T16:49:28.633Z Learning: In GitHub Actions workflow YAMLs, treat any `prompt:` field intended for an LLM (e.g., Claude) as natural-language instructions for the model. Do not embed shell-script-specific constructs or error-handling patterns inside the prompt content (e.g., `git rev-parse --verify`, `2>/dev/null`, explicit exit-code checks like `if [ $? -ne 0 ]`). Instead, express fallback/error-handling guidance in natural language so the LLM can reason about failures from context rather than parsing shell stderr/stdout semantics.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 606 File: e2e_test/chat_completions/test_openai_server.py:611-612 Timestamp: 2026-03-04T04:43:55.458Z Learning: In `lightseekorg/smg` PR `#606`, `TestChatCompletionGptOss` in `e2e_test/chat_completions/test_openai_server.py`: `STOP_SEQUENCE_TRIMMED = True` by default (Harmony/gpt-oss trims stop sequences). In `test_stop_sequences` (non-streaming), both `is_vllm()` and `is_sglang()` override this to `False` because neither backend trims stop sequences in non-streaming mode. In `test_stop_sequences_stream` (streaming), only `is_vllm()` overrides to `False`. This asymmetry is intentional. Do not flag `is_sglang()` in the non-streaming condition as incorrect.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 633 File: .github/workflows/pr-test-rust.yml:484-485 Timestamp: 2026-03-04T20:40:48.357Z Learning: In repo lightseekorg/smg, the composite action `.github/actions/run-e2e` (at `.github/actions/run-e2e/action.yml`) includes a `actions/checkoutv6` step internally as its first step. Callers (e.g., `e2e-always`, `e2e-chat-completions`, `e2e-agentic` jobs in `.github/workflows/pr-test-rust.yml`) do NOT need a separate checkout step before invoking this action — do not flag missing checkout for these jobs.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 606 File: e2e_test/chat_completions/test_openai_server.py:600-600 Timestamp: 2026-03-04T04:05:00.273Z Learning: In the `lightseekorg/smg` repository, as of PR `#606`, Harmony (gpt-oss) DOES support stop sequence trimming. `TestChatCompletionGptOss` in `e2e_test/chat_completions/test_openai_server.py` correctly sets `STOP_SEQUENCE_TRIMMED = True` by default. The vLLM backend (is_vllm) overrides this to `False` in both `test_stop_sequences` and `test_stop_sequences_stream` because vLLM does not trim stop sequences. Do not flag `STOP_SEQUENCE_TRIMMED = True` in `TestChatCompletionGptOss` as incorrect.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 633 File: .github/workflows/pr-test-rust.yml:439-440 Timestamp: 2026-03-04T20:41:53.013Z Learning: Ensure GitHub Actions workflows always include an actions/checkout step when referencing local composite actions (uses: ./... or similar) on self-hosted runners. Do not rely on a pre-checked-out workspace unless your environment explicitly guarantees it; follow the standard behavior and include checkout to ensure reliable action resolution.

Learnt from: slin1237 Repo: lightseekorg/smg PR: 851 File: docs/_audit/reliability-worker.md:216-228 Timestamp: 2026-03-21T06:14:03.701Z Learning: In repo lightseekorg/smg, files under `docs/_audit/` (e.g., `docs/_audit/reliability-worker.md`) are internal audit checklists used during the review process and are not shipped as user-facing documentation. Do not flag markdown formatting nitpicks (e.g., missing code block language specifiers, lint warnings) in these files — formatting polish is out of scope for them.

Learnt from: slin1237 Repo: lightseekorg/smg PR: 489 File: model_gateway/benches/wasm_middleware_latency.rs:88-91 Timestamp: 2026-02-21T02:37:02.009Z Learning: Repo: lightseekorg/smg — For clippy-only/enforcement PRs (e.g., PR `#489`), even micro-optimizations (like replacing an async closure with std::future::ready in benches such as model_gateway/benches/wasm_middleware_latency.rs) should be deferred to a follow-up PR rather than included inline.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 633 File: .github/actions/run-e2e/action.yml:65-83 Timestamp: 2026-03-04T20:39:43.892Z Learning: Repo lightseekorg/smg uses custom/internal action versions that differ from public GitHub marketplace releases. Specifically: `actions/checkoutv6`, `actions/download-artifactv8`, and `actions/upload-artifactv7` are all valid and intentional throughout this repo's workflows (e.g., pr-test-rust.yml and composite actions under .github/actions/). Do not flag these versions as non-existent or incorrect.

Learnt from: slin1237 Repo: lightseekorg/smg PR: 489 File: model_gateway/src/core/token_bucket.rs:58-63 Timestamp: 2026-02-21T02:30:58.753Z Learning: For lint-only/Clippy enforcement PRs in this repo (e.g., PR `#489`), avoid introducing behavioral changes (like new input validation). Prefer a separate, follow-up issue/PR for such hardening.

Learnt from: vschandramourya Repo: lightseekorg/smg PR: 915 File: model_gateway/src/routers/grpc/client.rs:387-423 Timestamp: 2026-03-26T17:06:14.307Z Learning: In repo lightseekorg/smg, in `model_gateway/src/routers/grpc/client.rs` and the corresponding backend builders (`crates/grpc_client/src/sglang_scheduler.rs`, `vllm_engine.rs`, `trtllm_service.rs`): The per-backend divergence in handling `CompletionRequest.max_tokens == None` is intentional. SGLang and vLLM pass `None` through to their proto builders, while TRT-LLM falls back to `16`. This matches the pre-existing per-backend pattern used in the chat/messages request builders. Do not flag this divergence as a bug or request normalization at the `build_completion_request` dispatcher layer in `client.rs`.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 495 File: grpc_client/proto/sglang_scheduler.proto:140-172 Timestamp: 2026-02-21T11:57:36.560Z Learning: In repo lightseekorg/smg, the multimodal proto fields in `grpc_client/proto/sglang_scheduler.proto` (including `MultimodalInputs.pixel_values`, `model_specific_tensors`, `im_token_id`, and `mm_placeholders`) were introduced in PR `#495` as the first real usage; prior internal-only development fields like `processed_features` were never shipped to production, so wire-compatibility concerns do not apply to these field number assignments.

Learnt from: vschandramourya Repo: lightseekorg/smg PR: 978 File: model_gateway/src/routers/grpc/regular/streaming.rs:2275-2276 Timestamp: 2026-03-31T03:43:30.981Z Learning: In repo lightseekorg/smg, file model_gateway/src/routers/grpc/regular/streaming.rs: `logprobs: None` in `CompletionStreamChoice` within `process_completion_streaming_chunks` is intentional. Per-token logprob streaming is deferred across all streaming endpoints (completions, generate) because it requires backend-side plumbing not yet available. The same `// TODO: wire per-token logprob streaming` deferral exists in `process_generate_streaming` (around line 702). Non-streaming completions has the same TODO. Do not flag `logprobs: None` or the TODO comment in the completions or generate streaming paths as a missing feature or silent data loss until a dedicated follow-up PR wires backend per-token logprob support.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 638 File: grpc_servicer/smg_grpc_servicer/vllm/servicer.py:371-406 Timestamp: 2026-03-05T04:48:49.033Z Learning: In repo lightseekorg/smg, `grpc_servicer/smg_grpc_servicer/vllm/servicer.py` (`_build_preprocessed_mm_inputs`): multimodal metadata fields (`flat_keys`, `batched_keys`, `mm_placeholders`, `model_specific_tensors`) are produced exclusively by the internal Rust router and are treated as trusted data. Defensive input-validation guards (e.g., checking flat_keys targets exist in hf_dict, validating PlaceholderRange bounds) are intentionally omitted to avoid hot-path overhead. Do not flag missing validation on these fields in future reviews.

Learnt from: slin1237 Repo: lightseekorg/smg PR: 447 File: model_gateway/src/routers/grpc/client.rs:312-328 Timestamp: 2026-02-17T20:30:41.243Z Learning: In model_gateway gRPC metadata discovery, SGLang's `server_args` (a `prost_types::Struct`) natively uses short-form keys like `tp_size`, `dp_size`, `pp_size` that match `SGLANG_GRPC_KEYS` and are extracted via `pick_prost_fields()` without normalization. vLLM and TRT-LLM gRPC protos use long-form keys like `tensor_parallel_size`, `pipeline_parallel_size`, which go through `flat_labels()` and are normalized by `normalize_grpc_keys()` in `discover_metadata.rs` after both `model_info.to_labels()` and `server_info.to_labels()` return.

Learnt from: key4ng Repo: lightseekorg/smg PR: 1034 File: model_gateway/src/routers/grpc/multimodal.rs:602-610 Timestamp: 2026-04-04T00:29:52.690Z Learning: In repo lightseekorg/smg, the gRPC regular pipeline runs stages in order: Preparation (tokenization + multimodal media fetch + preprocessing) → Worker Selection → Request Building. Because backend identity (e.g., GrpcClient::Mlx) is only resolved after worker selection, it is architecturally impossible to short-circuit MLX multimodal rejection before process_multimodal() / process_multimodal_messages() runs in the Preparation stage. The earliest feasible rejection point for backend-specific multimodal unsupported errors is at assemble_multimodal_data in request_building (after preprocessing but before generation). Moving rejection earlier would require a larger architectural change (e.g., moving worker selection before preparation). Do not flag assemble_multimodal_data's GrpcClient::Mlx error arm as "too late" in the pipeline.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 638 File: .github/workflows/release-grpc.yml:35-37 Timestamp: 2026-03-05T05:50:07.158Z Learning: In workflows under .github/workflows that trigger only on push to main, you do not need to guard against the all-zeros github.event.before (0000000000000000000000000000000000000000) since that value only occurs on initial branch creation and cannot happen for an already-existing main. Apply this guidance to similar push-to-main workflows across the repository (pattern: **/.github/workflows/*.yml). If a workflow also handles other events or branches, retain appropriate guards for those cases.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b534cc23c1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-12T17:54:17Z

+            temperature=params.temperature
+            if params.HasField("temperature")
+            else defaults.get("temperature", 1.0),
+            top_p=params.top_p if params.HasField("top_p") else defaults.get("top_p", 1.0),


Require updated proto before calling new HasField checks

These new presence checks rely on top_p/top_k/... being proto3 optional, but smg-grpc-servicer still allows older smg-grpc-proto installs (e.g. 0.4.6) where those fields are plain scalars; in that environment params.HasField("top_p") raises ValueError and generation requests fail at runtime. This commit introduces that hard dependency, so the servicer release needs a minimum proto-version bump (or a compatibility guard) before this code is safe to deploy.

Useful? React with 👍 / 👎.

The proto definition and servicer are in the same repo and released together — there's no independent versioning between them. A new servicer always ships with the matching proto.

ai-jz · 2026-04-12T18:03:05Z

@CatherineSue Addressed all 5 comments in b534cc2:

Removed Harmony stop_token_ids block — handled in pipeline
Renamed mc → model_config
Added vLLM HTTP path source link
Renamed "model defaults" → "generation-config defaults"
Removed max_tokens clamping — engine handles this in input_processor.py

Also updated PR description to reflect the simplified scope.

Note: fork-ci-review.yml was included by mistake in an earlier commit — it's a fork-only workflow for local iteration before submitting to this PR. It has been removed. Sorry for the confusion.

Signed-off-by: ai-jz <ai-jz@users.noreply.github.com>

ai-jz · 2026-04-12T20:46:51Z

Addressed the stop_token_ids merging concern flagged by @coderabbitai — removed the set-union logic in c70ef85. The servicer now passes through the client's stop_token_ids as-is. The engine handles generation-config stop behavior internally via update_from_generation_config() in sampling_params.py, consistent with the HTTP Responses path.

@coderabbitai re-review

coderabbitai · 2026-04-12T20:47:07Z

@ai-jz Thanks for addressing the stop_token_ids concern — deferring to update_from_generation_config() in sampling_params.py is the correct approach and keeps the servicer aligned with the HTTP Responses path. Triggering a re-review now.

✏️ Learnings added

Learnt from: ai-jz
URL: https://github.com/lightseekorg/smg/pull/1103

Timestamp: 2026-04-12T20:47:06.274Z
Learning: In repo lightseekorg/smg, `grpc_servicer/smg_grpc_servicer/vllm/servicer.py`: `stop_token_ids` from generation-config defaults should NOT be merged (set-union) with the client's `stop_token_ids` in the servicer. The servicer passes through the client's `stop_token_ids` as-is; the vLLM engine handles generation-config stop behavior internally via `update_from_generation_config()` in `sampling_params.py`, consistent with the HTTP Responses path (`vllm/entrypoints/openai/responses/serving.py`).

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

🧠 Learnings used

Learnt from: vschandramourya
Repo: lightseekorg/smg PR: 915
File: model_gateway/src/routers/grpc/client.rs:387-423
Timestamp: 2026-03-26T17:06:14.307Z
Learning: In repo lightseekorg/smg, in `model_gateway/src/routers/grpc/client.rs` and the corresponding backend builders (`crates/grpc_client/src/sglang_scheduler.rs`, `vllm_engine.rs`, `trtllm_service.rs`): The per-backend divergence in handling `CompletionRequest.max_tokens == None` is intentional. SGLang and vLLM pass `None` through to their proto builders, while TRT-LLM falls back to `16`. This matches the pre-existing per-backend pattern used in the chat/messages request builders. Do not flag this divergence as a bug or request normalization at the `build_completion_request` dispatcher layer in `client.rs`.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 606
File: e2e_test/chat_completions/test_openai_server.py:600-600
Timestamp: 2026-03-04T04:05:00.273Z
Learning: In the `lightseekorg/smg` repository, as of PR `#606`, Harmony (gpt-oss) DOES support stop sequence trimming. `TestChatCompletionGptOss` in `e2e_test/chat_completions/test_openai_server.py` correctly sets `STOP_SEQUENCE_TRIMMED = True` by default. The vLLM backend (is_vllm) overrides this to `False` in both `test_stop_sequences` and `test_stop_sequences_stream` because vLLM does not trim stop sequences. Do not flag `STOP_SEQUENCE_TRIMMED = True` in `TestChatCompletionGptOss` as incorrect.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 606
File: e2e_test/chat_completions/test_openai_server.py:611-612
Timestamp: 2026-03-04T04:43:55.458Z
Learning: In `lightseekorg/smg` PR `#606`, `TestChatCompletionGptOss` in `e2e_test/chat_completions/test_openai_server.py`: `STOP_SEQUENCE_TRIMMED = True` by default (Harmony/gpt-oss trims stop sequences). In `test_stop_sequences` (non-streaming), both `is_vllm()` and `is_sglang()` override this to `False` because neither backend trims stop sequences in non-streaming mode. In `test_stop_sequences_stream` (streaming), only `is_vllm()` overrides to `False`. This asymmetry is intentional. Do not flag `is_sglang()` in the non-streaming condition as incorrect.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 638
File: grpc_servicer/smg_grpc_servicer/vllm/servicer.py:371-406
Timestamp: 2026-03-05T04:48:49.033Z
Learning: In repo lightseekorg/smg, `grpc_servicer/smg_grpc_servicer/vllm/servicer.py` (`_build_preprocessed_mm_inputs`): multimodal metadata fields (`flat_keys`, `batched_keys`, `mm_placeholders`, `model_specific_tensors`) are produced exclusively by the internal Rust router and are treated as trusted data. Defensive input-validation guards (e.g., checking flat_keys targets exist in hf_dict, validating PlaceholderRange bounds) are intentionally omitted to avoid hot-path overhead. Do not flag missing validation on these fields in future reviews.

Learnt from: zhaowenzi
Repo: lightseekorg/smg PR: 807
File: model_gateway/src/routers/openai/responses/streaming.rs:821-855
Timestamp: 2026-03-18T21:57:03.433Z
Learning: In repo lightseekorg/smg, file model_gateway/src/routers/openai/responses/streaming.rs: The early `return` statements inside `handle_streaming_with_tool_interception` (on `tx` send failures in `forward_streaming_event`, `send_mcp_list_tools_events`, and the `is_in_progress`/`mcp_list_tools_sent` branches) are pre-existing behavior that predates PR `#807`. They cause the persistence phase (final response / conversation-backed storage writes at the end of the tool loop) to be skipped when the client disconnects mid-stream with `store=true` or a conversation-backed request. This is a known pre-existing gap in the MCP streaming path, not a regression introduced by the storage context header changes in PR `#807`.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 799
File: model_gateway/src/core/resilience.rs:42-49
Timestamp: 2026-03-18T04:52:09.084Z
Learning: In repo lightseekorg/smg, file model_gateway/src/core/resilience.rs: `RetryConfig` (from `model_gateway/src/config/types`) does NOT have a `retryable_status_codes` field. The `retryable_status_codes` field only exists on `ResilienceUpdate` (from `openai_protocol::worker`). The `DEFAULT_RETRYABLE_STATUS_CODES` constant ([408, 429, 500, 502, 503, 504]) IS the canonical router default and matches `is_retryable_status()` in `model_gateway/src/core/retry.rs` exactly. The fallback from `overrides.retryable_status_codes` to `DEFAULT_RETRYABLE_STATUS_CODES` in `resolve_resilience()` is correct — do not flag it as bypassing a router-level default, and do not suggest adding a `base_retryable_status_codes` parameter to `resolve_resilience()`.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 1031
File: model_gateway/src/routers/grpc/regular/streaming.rs:1579-1600
Timestamp: 2026-04-03T05:31:33.707Z
Learning: In repo lightseekorg/smg, `model_gateway/src/routers/grpc/regular/streaming.rs` (`process_messages_streaming_chunks`): The reasoning parser is gated on explicit user opt-in (`ThinkingConfig::Enabled`) via `separate_reasoning = matches!(&original_request.thinking, Some(ThinkingConfig::Enabled { .. }))`, NOT on `thinking_override` (which only controls whether `mark_reasoning_started()` is called on the parser, not whether the parser runs at all). Chat API uses the `separate_reasoning` field on `ChatCompletionRequest` instead. Do not suggest removing or replacing the `ThinkingConfig::Enabled` gate in the Messages path.

Learnt from: key4ng
Repo: lightseekorg/smg PR: 1034
File: model_gateway/src/routers/grpc/multimodal.rs:602-610
Timestamp: 2026-04-04T00:29:52.690Z
Learning: In repo lightseekorg/smg, the gRPC regular pipeline runs stages in order: Preparation (tokenization + multimodal media fetch + preprocessing) → Worker Selection → Request Building. Because backend identity (e.g., GrpcClient::Mlx) is only resolved after worker selection, it is architecturally impossible to short-circuit MLX multimodal rejection before process_multimodal() / process_multimodal_messages() runs in the Preparation stage. The earliest feasible rejection point for backend-specific multimodal unsupported errors is at assemble_multimodal_data in request_building (after preprocessing but before generation). Moving rejection earlier would require a larger architectural change (e.g., moving worker selection before preparation). Do not flag assemble_multimodal_data's GrpcClient::Mlx error arm as "too late" in the pipeline.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 936
File: model_gateway/src/routers/gemini/router.rs:82-90
Timestamp: 2026-03-30T18:52:19.689Z
Learning: In repo lightseekorg/smg, the Gemini router's (`model_gateway/src/routers/gemini/router.rs`) use of `model_id` for per-model retry config lookup in `route_interactions` has a pre-existing limitation: when the request carries only an agent identifier, `get_retry_config(agent_id)` finds no match because agent→model resolution happens later inside `driver::execute` (in `worker_selection.rs`). This same limitation affects policy lookup and worker selection in the Gemini router. It is NOT introduced by PR `#936` and is a known accepted gap. The Gemini router is not actively used currently. Do not flag this as a bug introduced by per-model retry config changes to the Gemini router.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 495
File: grpc_client/proto/sglang_scheduler.proto:140-172
Timestamp: 2026-02-21T11:57:36.560Z
Learning: In repo lightseekorg/smg, the multimodal proto fields in `grpc_client/proto/sglang_scheduler.proto` (including `MultimodalInputs.pixel_values`, `model_specific_tensors`, `im_token_id`, and `mm_placeholders`) were introduced in PR `#495` as the first real usage; prior internal-only development fields like `processed_features` were never shipped to production, so wire-compatibility concerns do not apply to these field number assignments.

Learnt from: vschandramourya
Repo: lightseekorg/smg PR: 978
File: model_gateway/src/routers/grpc/regular/streaming.rs:2275-2276
Timestamp: 2026-03-31T03:43:30.981Z
Learning: In repo lightseekorg/smg, file model_gateway/src/routers/grpc/regular/streaming.rs: `logprobs: None` in `CompletionStreamChoice` within `process_completion_streaming_chunks` is intentional. Per-token logprob streaming is deferred across all streaming endpoints (completions, generate) because it requires backend-side plumbing not yet available. The same `// TODO: wire per-token logprob streaming` deferral exists in `process_generate_streaming` (around line 702). Non-streaming completions has the same TODO. Do not flag `logprobs: None` or the TODO comment in the completions or generate streaming paths as a missing feature or silent data loss until a dedicated follow-up PR wires backend per-token logprob support.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 447
File: model_gateway/src/routers/grpc/client.rs:312-328
Timestamp: 2026-02-17T20:30:41.243Z
Learning: In model_gateway gRPC metadata discovery, SGLang's `server_args` (a `prost_types::Struct`) natively uses short-form keys like `tp_size`, `dp_size`, `pp_size` that match `SGLANG_GRPC_KEYS` and are extracted via `pick_prost_fields()` without normalization. vLLM and TRT-LLM gRPC protos use long-form keys like `tensor_parallel_size`, `pipeline_parallel_size`, which go through `flat_labels()` and are normalized by `normalize_grpc_keys()` in `discover_metadata.rs` after both `model_info.to_labels()` and `server_info.to_labels()` return.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

ai-jz · 2026-04-12T22:35:41Z

@coderabbitai review

coderabbitai · 2026-04-12T22:35:48Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

ai-jz requested review from CatherineSue, XinyueZhang369, key4ng, njhill and slin1237 as code owners April 12, 2026 02:48

github-actions Bot added grpc gRPC client and router changes tests Test changes labels Apr 12, 2026

ai-jz force-pushed the feat/grpc-generation-config branch from 7fb8355 to 55ec145 Compare April 12, 2026 02:49

gemini-code-assist Bot reviewed Apr 12, 2026

View reviewed changes

Comment thread crates/grpc_client/src/vllm_engine.rs

chatgpt-codex-connector Bot reviewed Apr 12, 2026

View reviewed changes

ai-jz added 2 commits April 11, 2026 19:58

fix: address CI review — safe top_k conversion, override_max_tokens f…

78072f3

…or text input Signed-off-by: ai-jz <ai-jz@users.noreply.github.com>

ai-jz force-pushed the feat/grpc-generation-config branch from e73a892 to 78072f3 Compare April 12, 2026 02:58

coderabbitai Bot requested changes Apr 12, 2026

View reviewed changes

Comment thread e2e_test/chat_completions/test_generation_config.py Outdated

fix: use api_client fixture directly per CodeRabbit feedback

e8df77c

Signed-off-by: ai-jz <ai-jz@users.noreply.github.com>

github-actions Bot added the ci CI/CD configuration changes label Apr 12, 2026

ai-jz force-pushed the feat/grpc-generation-config branch from 0916c74 to e8df77c Compare April 12, 2026 03:28

github-actions Bot removed the ci CI/CD configuration changes label Apr 12, 2026

chatgpt-codex-connector Bot reviewed Apr 12, 2026

View reviewed changes

slin1237 added the ci-approved label Apr 12, 2026

coderabbitai Bot approved these changes Apr 12, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 12, 2026

View reviewed changes

Comment thread grpc_servicer/smg_grpc_servicer/vllm/servicer.py Outdated

CatherineSue reviewed Apr 12, 2026

View reviewed changes

style: apply nightly rustfmt comment alignment

335b813

Signed-off-by: ai-jz <ai-jz@users.noreply.github.com>

CatherineSue reviewed Apr 12, 2026

View reviewed changes

coderabbitai Bot requested changes Apr 12, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 12, 2026

View reviewed changes

github-actions Bot removed the tests Test changes label Apr 12, 2026

fix: align context-length guard with vLLM HTTP path exactly

4ee8b93

Signed-off-by: ai-jz <ai-jz@users.noreply.github.com>

chatgpt-codex-connector Bot reviewed Apr 12, 2026

View reviewed changes

CatherineSue reviewed Apr 12, 2026

View reviewed changes

github-actions Bot added the ci CI/CD configuration changes label Apr 12, 2026

chatgpt-codex-connector Bot reviewed Apr 12, 2026

View reviewed changes

coderabbitai Bot requested changes Apr 12, 2026

View reviewed changes

ai-jz force-pushed the feat/grpc-generation-config branch from 81822bd to b534cc2 Compare April 12, 2026 17:48

github-actions Bot removed the ci CI/CD configuration changes label Apr 12, 2026

chatgpt-codex-connector Bot reviewed Apr 12, 2026

View reviewed changes

fix: remove stop_token_ids merging — let engine handle it

c70ef85

Signed-off-by: ai-jz <ai-jz@users.noreply.github.com>

ai-jz mentioned this pull request Apr 13, 2026

feat(grpc): support --sampling-defaults in sglang gRPC servicer #1107

Open

8 tasks

ai-jz closed this Apr 16, 2026

ai-jz mentioned this pull request Apr 18, 2026

feat(grpc): support vLLM --generation-config in gRPC mode #1237

Open

7 tasks

Conversation

ai-jz commented Apr 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Description

Problem

Solution

Changes

Design Decisions

Test Plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

mergify Bot commented Apr 12, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

ai-jz commented Apr 12, 2026

Review Response

Addressed

Acknowledged (no change needed)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 12, 2026

ai-jz commented Apr 12, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 12, 2026 •

edited

Loading

coderabbitai Bot Apr 12, 2026 •

edited

Loading

ai-jz Apr 12, 2026 •

edited

Loading

CatherineSue Apr 12, 2026 •

edited

Loading