feat: Add LiteLLM proxy for tool calling compatibility by jsirish · Pull Request #6 · jsirish/llm-hosting

jsirish · 2025-12-13T02:34:43Z

Problem

Tool calling with Qwen3-Coder-30B in Continue.dev fails with JSON parsing errors across all vLLM tool parsers:

qwen3_coder: JSON escape character issues
openai: Requires token IDs, not text-based
qwen3_xml: vLLM still validates incoming messages as JSON
hermes: Expects JSON but model outputs XML

Root Cause: vLLM's tool calling architecture expects consistent JSON format throughout, which Qwen3 doesn't reliably provide. The _postprocess_messages function validates incoming tool calls as JSON regardless of parser.

Solution: LiteLLM Proxy

Add LiteLLM proxy layer to handle tool calling format translation:

Continue.dev → LiteLLM Proxy (port 4000) → vLLM (port 8000, no parser)

How It Works

Continue.dev sends OpenAI-format tool calling requests to LiteLLM
LiteLLM forwards requests to vLLM (running without tool parser)
vLLM generates raw text responses
LiteLLM parses and converts responses to OpenAI format
Continue.dev receives properly formatted tool calls

Benefits

✅ Bypasses vLLM's JSON validation issues
✅ Handles format translation automatically
✅ Compatible with Continue.dev's OpenAI expectations
✅ No vLLM parser configuration needed
✅ Can add retries, fallbacks, and load balancing

Changes Made

New Scripts

scripts/setup-litellm-proxy.sh - Installs LiteLLM and creates config
scripts/start-litellm-proxy.sh - Starts LiteLLM proxy on port 4000

Configuration Updates

models/qwen.sh - Removed tool parser (VLLM_TOOL_PARSER="")
- Let vLLM generate raw output for LiteLLM to parse

Documentation

docs/setup/LITELLM-PROXY-SETUP.md - Complete setup guide
- Installation steps
- Port exposure configuration
- Continue.dev integration
- Troubleshooting tips

Testing Required

1. Setup on RunPod

cd /workspace/llm-hosting
git pull origin feature/litellm-proxy-tool-calling
./scripts/setup-litellm-proxy.sh

2. Restart vLLM (no tool parser)

./scripts/stop-server.sh
./models/qwen.sh

3. Start LiteLLM

./scripts/start-litellm-proxy.sh

4. Expose Port 4000

RunPod: Add port mapping 4000 → TCP
Get public URL

5. Update Continue.dev

apiBase: https://YOUR-POD.proxy.runpod.net:4000/v1
model: qwen3-coder-30b

6. Test Tool Calling

Start fresh chat in Continue.dev
Try tool calling features
Verify no JSON parsing errors

Alternative Considered

Disabling tool calling entirely was rejected because tool parsing is a priority requirement.

References

LiteLLM Docs: https://docs.litellm.ai/docs/proxy/quick_start
Previous attempts: PR Fix: Use qwen3_xml parser to resolve tool calling escape errors #2 (qwen3_xml), PR Fix: Switch to hermes parser for Qwen3 tool calling #5 (hermes)
Related: Issue about vLLM parser incompatibilities

- Add setup-litellm-proxy.sh script to configure LiteLLM - Add start-litellm-proxy.sh to run proxy on port 4000 - Update qwen.sh to disable vLLM tool parser (let LiteLLM handle it) - Add comprehensive setup documentation Architecture: Continue.dev → LiteLLM (port 4000) → vLLM (port 8000) Benefits: - Handles tool calling format translation - Avoids vLLM JSON validation issues - Compatible with Continue.dev OpenAI format - No vLLM parser configuration needed This solves the persistent JSON parsing errors by letting LiteLLM handle all tool call parsing and format conversion.

continue · 2025-12-13T02:34:46Z

Learn more

All Green is an AI agent that automatically:

✅ Addresses code review comments

✅ Fixes failing CI checks

✅ Resolves merge conflicts

Copilot

Pull request overview

This PR introduces a LiteLLM proxy layer to resolve tool calling compatibility issues between Continue.dev and vLLM when using Qwen3-Coder-30B. The proxy sits between Continue.dev and vLLM to handle OpenAI format translation, bypassing vLLM's JSON validation issues that occur with various tool parsers (qwen3_coder, openai, qwen3_xml, hermes).

Key Changes

New setup and startup scripts for LiteLLM proxy installation and management
Configuration changes to disable vLLM's built-in tool parser in favor of LiteLLM handling
Comprehensive documentation covering installation, configuration, and troubleshooting

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 11 comments.

File	Description
scripts/setup-litellm-proxy.sh	Creates LiteLLM configuration with model definitions, API keys, and proxy settings
scripts/start-litellm-proxy.sh	Starts LiteLLM proxy with validation checks for config file and vLLM availability
models/qwen.sh	Updates Qwen configuration to disable vLLM tool parser (set to empty string)
docs/setup/LITELLM-PROXY-SETUP.md	Provides complete setup guide including installation, port exposure, Continue.dev integration, and troubleshooting

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-13T02:40:06Z

+# Start LiteLLM proxy
+echo "Starting LiteLLM..."
+litellm --config /workspace/litellm-config.yaml --port 4000 --host 0.0.0.0
+
+# If the above command fails, try with --detailed_debug
+# litellm --config /workspace/litellm-config.yaml --port 4000 --host 0.0.0.0 --detailed_debug


The LiteLLM proxy process is started in the foreground without proper process management (no background execution, PID tracking, or log file redirection). This is inconsistent with how the vLLM server is managed in start-vllm-server.sh, which uses nohup, PID files, and log redirection. Consider implementing similar process management for LiteLLM to enable:

Background execution with nohup

PID file tracking at /workspace/logs/litellm-proxy.pid

Log output to /workspace/logs/litellm-proxy.log

Ability to stop/monitor the proxy using the same patterns as vLLM

Suggested change

# Start LiteLLM proxy

echo "Starting LiteLLM..."

litellm --config /workspace/litellm-config.yaml --port 4000 --host 0.0.0.0

# If the above command fails, try with --detailed_debug

# litellm --config /workspace/litellm-config.yaml --port 4000 --host 0.0.0.0 --detailed_debug

# Start LiteLLM proxy with process management

echo "Starting LiteLLM..."

# Ensure log directory exists

mkdir -p /workspace/logs

# Start LiteLLM in background with nohup, log output, and PID file

nohup litellm --config /workspace/litellm-config.yaml --port 4000 --host 0.0.0.0 > /workspace/logs/litellm-proxy.log 2>&1 &

echo $! > /workspace/logs/litellm-proxy.pid

echo "LiteLLM proxy started with PID $(cat /workspace/logs/litellm-proxy.pid)"

echo "Logs: /workspace/logs/litellm-proxy.log"

# If the above command fails, try with --detailed_debug

# nohup litellm --config /workspace/litellm-config.yaml --port 4000 --host 0.0.0.0 --detailed_debug > /workspace/logs/litellm-proxy.log 2>&1 &

# echo $! > /workspace/logs/litellm-proxy.pid

Copilot · 2025-12-13T02:40:06Z

+    litellm_params:
+      model: openai/Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8
+      api_base: http://localhost:8000/v1
+      api_key: sk-vllm-c9be6c31b9f1ebd5bc5a316ac7d71381


The API key 'sk-vllm-c9be6c31b9f1ebd5bc5a316ac7d71381' appears to be a fixed, hardcoded credential shared between vLLM and LiteLLM. This differs from the vLLM server's approach where API keys are either generated randomly or explicitly set via environment variables. Hardcoding credentials in configuration files can be a security risk if this configuration is committed to version control or shared. Consider generating a random key or reading from an environment variable like the vLLM server does.

Copilot · 2025-12-13T02:40:07Z

+  success_callback: []
+
+general_settings:
+  master_key: sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381  # Same key for simplicity


The master_key 'sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381' is hardcoded in the configuration file. This key controls access to the LiteLLM proxy. Hardcoding credentials poses a security risk if the configuration is committed to version control or shared. Consider generating a random key or reading from an environment variable, similar to how the vLLM server handles API keys.

Copilot · 2025-12-13T02:40:07Z

 export VLLM_MAX_MODEL_LEN=131072     # 128K tokens
 export VLLM_GPU_MEMORY_UTIL=0.95     # Maximum GPU memory utilization
-export VLLM_TOOL_PARSER="hermes"     # Hermes-style tool calling (built into Qwen chat template)
+export VLLM_TOOL_PARSER=""           # No parser - let LiteLLM proxy handle tool calling


Setting VLLM_TOOL_PARSER to an empty string will cause start-vllm-server.sh to use the default value 'openai' instead of disabling the tool parser. In start-vllm-server.sh line 28, the code uses 'TOOL_PARSER="${VLLM_TOOL_PARSER:-openai}"' which applies the default when the variable is unset OR empty. This means the tool parser will still be active with 'openai' parser, defeating the purpose of this PR. To properly disable the tool parser, either:

Pass a special value that vLLM recognizes as "no parser"

Modify start-vllm-server.sh to handle empty strings differently and conditionally include the --tool-call-parser flag

Suggested change

export VLLM_TOOL_PARSER="" # No parser - let LiteLLM proxy handle tool calling

# export VLLM_TOOL_PARSER="" # No parser - let LiteLLM proxy handle tool calling

Copilot · 2025-12-13T02:40:07Z

+  json_logs: false  # Easier to read logs
+  num_retries: 2
+  request_timeout: 600
+  modify_params: true


The 'success_callback' is set to an empty array but there's no explanation of what this field does or when it might be useful. Consider either removing this line (if it's not needed) or adding a comment explaining its purpose for future maintainers who might want to configure callbacks.

Suggested change

modify_params: true

modify_params: true

# List of callback functions to execute on successful requests.

# Leave empty unless you want to add custom success handlers.

Copilot · 2025-12-13T02:40:08Z

+if ! curl -s http://localhost:8000/v1/models > /dev/null 2>&1; then
+    echo "⚠️  Warning: vLLM doesn't appear to be running on port 8000"


The health check for vLLM uses 'curl -s' which silences all output including error messages. If the check fails, the user only sees a generic warning without understanding why the connection failed. Consider using 'curl -sf' to show HTTP errors or capture and display the actual error message to help with troubleshooting.

Suggested change

if ! curl -s http://localhost:8000/v1/models > /dev/null 2>&1; then

echo "⚠️ Warning: vLLM doesn't appear to be running on port 8000"

VLLM_HEALTH_OUTPUT=$(curl -sf http://localhost:8000/v1/models 2>&1)

if [ $? -ne 0 ]; then

echo "⚠️ Warning: vLLM doesn't appear to be running on port 8000"

echo "curl error message:"

echo "$VLLM_HEALTH_OUTPUT"

Copilot · 2025-12-13T02:40:08Z

+  provider: openai
+  model: qwen3-coder-30b  # Model name from litellm-config.yaml
+  apiBase: https://YOUR-POD.proxy.runpod.net:4000/v1  # LiteLLM port
+  apiKey: sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381


The documentation shows a hardcoded API key that matches the one in the setup script. This reinforces the pattern of using fixed credentials. Consider updating the documentation to mention that users should replace this with their own secure key, or indicate that the key will be generated during setup.

Copilot · 2025-12-13T02:40:08Z

+echo "✅ LiteLLM config created at /workspace/litellm-config.yaml"
+echo ""
+echo "Next steps:"
+echo "1. Start vLLM WITHOUT tool parser: ./models/qwen.sh (with VLLM_TOOL_PARSER=\"\")"


The instruction says 'Start vLLM WITHOUT tool parser: ./models/qwen.sh (with VLLM_TOOL_PARSER="")' which is misleading. The qwen.sh script already sets VLLM_TOOL_PARSER="" in this PR, so users don't need to manually specify it. The parenthetical note could confuse users into thinking they need to modify something. Consider revising to: 'Start vLLM: ./models/qwen.sh (tool parser now disabled)'

Suggested change

echo "1. Start vLLM WITHOUT tool parser: ./models/qwen.sh (with VLLM_TOOL_PARSER=\"\")"

echo "1. Start vLLM: ./models/qwen.sh (tool parser now disabled)"

Copilot · 2025-12-13T02:40:08Z

+set -e
+
+echo "🔧 Installing LiteLLM..."
+pip install 'litellm[proxy]' --quiet


Using '--quiet' flag for pip install suppresses all output including potential warnings or errors. If the installation fails or has issues, users won't see helpful error messages. Consider removing '--quiet' or replacing it with '--progress-bar off' to show errors while reducing visual noise.

Suggested change

pip install 'litellm[proxy]' --quiet

pip install 'litellm[proxy]' --progress-bar off

Copilot · 2025-12-13T02:40:08Z

+set -e
+
+echo "🔧 Installing LiteLLM..."
+pip install 'litellm[proxy]' --quiet


Running pip install 'litellm[proxy]' without pinning to a specific version or verifying integrity introduces a supply-chain risk: any compromise of the litellm package or its dependencies would automatically execute attacker-controlled code in your runtime. Since this script is part of your deployment path, an attacker could leverage a malicious package version to access data or modify model-serving behavior. Pin the dependency to a trusted version and/or verify hashes or signatures before installation to reduce this risk.

- Updated models/qwen.sh: Set VLLM_TOOL_PARSER='qwen_coder' for XML parsing - Updated scripts/setup-litellm-proxy.sh: Added supports_parallel_function_calling - Created scripts/update-runpod-config.sh: Automated deployment script with testing - Created docs/setup/LITELLM-VLLM-TOOLCALLING.md: Comprehensive architecture guide - Created docs/setup/VLLM-LITELLM-INTEGRATION-DOCS.md: Context7 documentation findings - Updated QUICK-UPDATE.md: TL;DR deployment instructions - Updated scripts/start-litellm-proxy.sh: Background execution with logging Architecture: Continue.dev -> LiteLLM (port 4000) -> vLLM (port 8000) -> Model - vLLM parses Qwen's XML tool calls using qwen_coder parser - LiteLLM normalizes to OpenAI format and strips non-standard parameters - Fixes 'Invalid \escape' JSON parsing errors in Continue.dev

Error from RunPod showed parser should be 'qwen3_coder' not 'qwen_coder'. Verified against vLLM official source code: vllm/entrypoints/openai/tool_parsers/__init__.py line 115-118 Updated all occurrences: - models/qwen.sh: VLLM_TOOL_PARSER='qwen3_coder' - scripts/update-runpod-config.sh: References to qwen3_coder - docs/setup/LITELLM-VLLM-TOOLCALLING.md - docs/setup/VLLM-LITELLM-INTEGRATION-DOCS.md - QUICK-UPDATE.md Available parsers from vLLM source: deepseek_v3, deepseek_v31, ernie45, glm45, granite, granite-20b-fc, hermes, hunyuan_a13b, internlm, jamba, kimi_k2, llama3_json, llama4_json, llama4_pythonic, longcat, minimax, minimax_m2, mistral, olmo3, openai, phi4_mini_json, pythonic, qwen3_coder, qwen3_xml, seed_oss, step3, xlam

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 21 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-13T03:08:38Z

+echo "✅ LiteLLM config created at /workspace/litellm-config.yaml"
+echo ""
+echo "Next steps:"
+echo "1. Start vLLM with qwen_coder parser: ./models/qwen.sh (VLLM_TOOL_PARSER=\"qwen_coder\")"


The instruction mentions 'VLLM_TOOL_PARSER="qwen_coder"' but this contradicts the PR description and other documentation which states that qwen.sh should use the qwen_coder parser. This comment is correct, but the surrounding context in LITELLM-PROXY-SETUP.md (line 21) incorrectly states the parser should be empty. Ensure consistency across all documentation.

Copilot · 2025-12-13T03:08:38Z

+- **vLLM only** with various parsers (`qwen3_coder`, `qwen3_xml`, `hermes`, `openai`)
+- **Problem**: vLLM's `_postprocess_messages` validates all tool calls as JSON, causing failures with Qwen's XML output
+
+### Current Approach (Working)
+1. **vLLM** with `qwen3_coder` parser:
+   - Parses Qwen's XML tool call format (`<tool_call>`, `<function>`, `<parameter>`)
+   - Converts to vLLM's internal tool call structure
+   - No JSON validation issues because parsing happens BEFORE validation
+
+2. **LiteLLM** as normalization layer:
+   - Takes vLLM's parsed tool calls
+   - Normalizes to OpenAI format expected by Continue.dev
+   - Strips non-standard parameters like `supports_function_calling`
+   - Handles retries and timeouts


This documentation states that the previous approach used 'vLLM only' but the current PR description and code changes indicate that vLLM now uses the 'qwen_coder' parser, not no parser. The description here should clarify that the previous approaches tried different parsers but had issues, and the current solution combines vLLM's qwen_coder parser with LiteLLM for normalization.

Suggested change

- **vLLM only** with various parsers (`qwen3_coder`, `qwen3_xml`, `hermes`, `openai`)

- **Problem**: vLLM's `_postprocess_messages` validates all tool calls as JSON, causing failures with Qwen's XML output

### Current Approach (Working)

1. **vLLM** with `qwen3_coder` parser:

- Parses Qwen's XML tool call format (`<tool_call>`, `<function>`, `<parameter>`)

- Converts to vLLM's internal tool call structure

- No JSON validation issues because parsing happens BEFORE validation

2. **LiteLLM** as normalization layer:

- Takes vLLM's parsed tool calls

- Normalizes to OpenAI format expected by Continue.dev

- Strips non-standard parameters like `supports_function_calling`

- Handles retries and timeouts

- **vLLM with various parsers** (`qwen3_coder`, `qwen3_xml`, `hermes`, `openai`)

- **Problem**: Each parser had issues—vLLM's `_postprocess_messages` step always validated tool calls as JSON, so Qwen's XML output (even when parsed) would fail validation, breaking tool calling.

### Current Approach (Working)

1. **vLLM** with `qwen_coder` parser:

- Uses the `qwen_coder` parser to convert Qwen's XML tool call format (`<tool_call>`, `<function>`, `<parameter>`) into vLLM's internal tool call structure.

- Parsing happens before vLLM's JSON validation, so XML is correctly handled and converted.

2. **LiteLLM** as normalization layer:

- Receives vLLM's parsed tool calls.

- Normalizes them to the OpenAI format expected by Continue.dev.

- Strips non-standard parameters like `supports_function_calling`.

- Handles retries and timeouts.

Copilot · 2025-12-13T03:08:38Z

+  apiBase: https://3clxt008hl0a3a-4000.proxy.runpod.net/v1
+  apiKey: sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381
+  capabilities:
+    - tool_use


The configuration includes a hardcoded API key and a specific RunPod URL. Replace with environment variable references or placeholders to avoid exposing sensitive information in documentation.

Suggested change

apiBase: https://3clxt008hl0a3a-4000.proxy.runpod.net/v1

apiKey: sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381

capabilities:

- tool_use

apiBase: ${CONTINUE_API_BASE} # e.g., https://<your-runpod-endpoint>:4000/v1

apiKey: ${CONTINUE_API_KEY} # Set this environment variable to your API key

capabilities:

- tool_use

# Replace the placeholders above with your actual API base URL and API key,

# or set the CONTINUE_API_BASE and CONTINUE_API_KEY environment variables.

Copilot · 2025-12-13T03:08:39Z

+  provider: openai
+  model: qwen3-coder-30b  # Model name from litellm-config.yaml
+  apiBase: https://YOUR-POD.proxy.runpod.net:4000/v1  # LiteLLM port
+  apiKey: sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381


The configuration example includes a hardcoded API key. Replace with a placeholder like 'YOUR_LITELLM_API_KEY' to avoid exposing sensitive information in documentation.

Suggested change

apiKey: sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381

apiKey: YOUR_LITELLM_API_KEY

Copilot · 2025-12-13T03:08:39Z

+3. **vLLM** generates response (no tool parser, raw output)
+4. **LiteLLM** parses the response and converts to OpenAI format


Step 3 in the 'How It Works' section states that vLLM generates response with 'no tool parser, raw output'. This contradicts the actual implementation where models/qwen.sh sets VLLM_TOOL_PARSER="qwen_coder". The documentation should accurately reflect that vLLM uses the qwen_coder parser to convert XML to JSON, which LiteLLM then normalizes to OpenAI format.

Suggested change

3. **vLLM** generates response (no tool parser, raw output)

4. **LiteLLM** parses the response and converts to OpenAI format

3. **vLLM** uses the `qwen_coder` parser to convert XML tool calls to JSON

4. **LiteLLM** normalizes the JSON response to OpenAI format

Copilot · 2025-12-13T03:08:42Z

+  -H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" | jq -r '.data[].id' || echo "   ⚠️  LiteLLM test failed"
+
+echo ""
+echo "   Testing tool calling through LiteLLM..."
+RESPONSE=$(curl -s http://localhost:4000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" \


The API key 'sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381' is hardcoded in the script and appears to be a real key used across multiple files in this PR. Hardcoding API keys in scripts and configuration files is a security risk, especially in version control. Consider using environment variables or a secrets management system instead.

Copilot · 2025-12-13T03:08:42Z

+      api_key: sk-vllm-c9be6c31b9f1ebd5bc5a316ac7d71381
+      supports_function_calling: true
+      supports_parallel_function_calling: true
+    model_info:
+      mode: chat
+      supports_function_calling: true
+      supports_parallel_function_calling: true
+      max_tokens: 8192  # Max completion tokens
+      max_input_tokens: 131072  # 128K context
+
+litellm_settings:
+  drop_params: true  # Strip non-standard parameters
+  json_logs: false  # Easier to read logs
+  num_retries: 2
+  request_timeout: 600
+  modify_params: true
+  success_callback: []
+
+general_settings:
+  master_key: sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381  # Same key for simplicity


The vLLM API key 'sk-vllm-c9be6c31b9f1ebd5bc5a316ac7d71381' is hardcoded in the configuration file template. This same key is also used as the LiteLLM master key on line 38. Hardcoding API keys in configuration templates that are committed to version control is a security risk. Consider using environment variable substitution or prompting for keys during setup.

Copilot · 2025-12-13T03:08:42Z

+    echo "  tail -f ${LOG_FILE}"
+    exit 1
+fi
+


The litellm command uses --host 0.0.0.0 which binds to all network interfaces, potentially exposing the proxy to external networks. While this might be intentional for RunPod's port forwarding, it could be a security concern. Consider documenting this choice or adding a warning about firewall configuration.

Suggested change

# SECURITY WARNING

echo "⚠️ SECURITY WARNING: The LiteLLM proxy will bind to all network interfaces (0.0.0.0)."

echo " This may expose the proxy to external networks. Ensure your firewall is configured"

echo " to restrict access to trusted sources only."

echo ""

Copilot · 2025-12-13T03:08:42Z

+curl https://3clxt008hl0a3a-4000.proxy.runpod.net/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" \


The curl command includes a hardcoded API key and a specific RunPod URL. Replace with placeholders or environment variable references.

Suggested change

curl https://3clxt008hl0a3a-4000.proxy.runpod.net/v1/chat/completions \

-H "Content-Type: application/json" \

-H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" \

curl <RUNPOD_URL>/v1/chat/completions \

-H "Content-Type: application/json" \

-H "Authorization: Bearer <LITELLM_API_KEY>" \

Copilot · 2025-12-13T03:08:43Z

+# 5. Test
+curl http://localhost:4000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" \


The curl command example includes a hardcoded API key. Replace with a placeholder or environment variable reference to avoid exposing sensitive information.

Suggested change

-H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" \

-H "Authorization: Bearer YOUR_API_KEY_HERE" \

Matches the pattern of stop-server.sh for consistency: - Checks PID file - Graceful shutdown attempt - Force kill if needed - Cleans up PID file Usage on RunPod: ./scripts/stop-litellm-proxy.sh ./scripts/start-litellm-proxy.sh

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-13T03:14:44Z

+echo "1. Start vLLM with qwen_coder parser: ./models/qwen.sh (VLLM_TOOL_PARSER=\"qwen_coder\")"
+echo "2. Start LiteLLM proxy: ./scripts/start-litellm-proxy.sh"
+echo "3. Update Continue.dev to use: http://localhost:4000 or https://...proxy.runpod.net:4000"
+echo ""
+echo "Architecture: Continue.dev → LiteLLM (format normalization) → vLLM (qwen_coder parser) → Model"


This next steps instruction contradicts the actual configuration in the script itself. It says to start vLLM with 'VLLM_TOOL_PARSER="qwen_coder"' (note: qwen_coder without the "3"), but line 14 of the generated config file and the models/qwen.sh file both use "qwen3_coder" (with the "3"). Additionally, the description states that vLLM should use the "qwen_coder parser" while the architecture diagram on line 54 shows "qwen3_coder parser". This inconsistency in the parser name will cause configuration errors.

Suggested change

echo "1. Start vLLM with qwen_coder parser: ./models/qwen.sh (VLLM_TOOL_PARSER=\"qwen_coder\")"

echo "2. Start LiteLLM proxy: ./scripts/start-litellm-proxy.sh"

echo "3. Update Continue.dev to use: http://localhost:4000 or https://...proxy.runpod.net:4000"

echo ""

echo "Architecture: Continue.dev → LiteLLM (format normalization) → vLLM (qwen_coder parser) → Model"

echo "1. Start vLLM with qwen3_coder parser: ./models/qwen.sh (VLLM_TOOL_PARSER=\"qwen3_coder\")"

echo "2. Start LiteLLM proxy: ./scripts/start-litellm-proxy.sh"

echo "3. Update Continue.dev to use: http://localhost:4000 or https://...proxy.runpod.net:4000"

echo ""

echo "Architecture: Continue.dev → LiteLLM (format normalization) → vLLM (qwen3_coder parser) → Model"

Copilot · 2025-12-13T03:14:45Z

+curl -s http://localhost:8000/v1/models | jq -r '.data[].id' || echo "   ⚠️  vLLM test failed"
+
+echo ""
+echo "   Testing LiteLLM (port 4000)..."
+curl -s http://localhost:4000/v1/models \
+  -H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" | jq -r '.data[].id' || echo "   ⚠️  LiteLLM test failed"


The error handling here silently fails with a generic message. If the jq command fails (due to malformed JSON or connection issues), the user won't know the specific cause. Consider showing the actual error or the raw response to aid debugging.

Suggested change

curl -s http://localhost:8000/v1/models | jq -r '.data[].id' || echo " ⚠️ vLLM test failed"

echo ""

echo " Testing LiteLLM (port 4000)..."

curl -s http://localhost:4000/v1/models \

-H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" | jq -r '.data[].id' || echo " ⚠️ LiteLLM test failed"

VLLM_MODELS_RESPONSE=$(curl -s http://localhost:8000/v1/models)

if ! echo "$VLLM_MODELS_RESPONSE" | jq -r '.data[].id'; then

echo " ⚠️ vLLM test failed"

echo " Raw response:"

echo "$VLLM_MODELS_RESPONSE"

fi

echo ""

echo " Testing LiteLLM (port 4000)..."

LITELLM_MODELS_RESPONSE=$(curl -s http://localhost:4000/v1/models \

-H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381")

if ! echo "$LITELLM_MODELS_RESPONSE" | jq -r '.data[].id'; then

echo " ⚠️ LiteLLM test failed"

echo " Raw response:"

echo "$LITELLM_MODELS_RESPONSE"

fi

Copilot · 2025-12-13T03:14:45Z

+cat > /workspace/litellm-config.yaml << 'EOF'
+model_list:
+  # Qwen 3 Coder 30B - Primary model for tool calling
+  # vLLM uses qwen_coder parser, LiteLLM normalizes to OpenAI format


The comment references "qwen_coder parser" but the correct parser name used throughout the codebase and vLLM is "qwen3_coder" (with the "3"). This typo could mislead users about which parser is actually being configured.

Suggested change

# vLLM uses qwen_coder parser, LiteLLM normalizes to OpenAI format

# vLLM uses qwen3_coder parser, LiteLLM normalizes to OpenAI format

Copilot · 2025-12-13T03:14:45Z

+curl -s http://localhost:4000/v1/models \
+  -H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" | jq -r '.data[].id' || echo "   ⚠️  LiteLLM test failed"


Similar to line 70, this error handling is too generic. If the curl or jq command fails, the user won't know why. Consider capturing and displaying the actual error or response to help with troubleshooting.

Suggested change

curl -s http://localhost:4000/v1/models \

-H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" | jq -r '.data[].id' || echo " ⚠️ LiteLLM test failed"

LITELLM_RESPONSE=$(curl -s -w "\n%{http_code}" http://localhost:4000/v1/models -H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381")

LITELLM_BODY=$(echo "$LITELLM_RESPONSE" | sed '$d')

LITELLM_STATUS=$(echo "$LITELLM_RESPONSE" | tail -n1)

if [ "$LITELLM_STATUS" -ne 200 ]; then

echo " ⚠️ LiteLLM test failed (HTTP $LITELLM_STATUS)"

echo " Response:"

echo "$LITELLM_BODY" | head -10

else

if ! echo "$LITELLM_BODY" | jq -r '.data[].id'; then

echo " ⚠️ LiteLLM test failed (invalid JSON or missing .data[].id)"

echo " Response:"

echo "$LITELLM_BODY" | head -10

fi

fi

Copilot · 2025-12-13T03:14:46Z

+echo ""
+echo "2️⃣ Restarting vLLM server with qwen3_coder parser..."
+echo "   Stopping current vLLM server..."
+pkill -f "python -m vllm.entrypoints.openai.api_server" || echo "   No vLLM process found"


The pkill command with '|| echo "No vLLM process found"' pattern could be misleading because pkill returns non-zero both when no process is found AND when the user lacks permissions to kill the process. Consider checking process existence separately before attempting to kill, or provide more specific error handling.

Suggested change

pkill -f "python -m vllm.entrypoints.openai.api_server" || echo " No vLLM process found"

if pgrep -f "python -m vllm.entrypoints.openai.api_server" > /dev/null; then

if pkill -f "python -m vllm.entrypoints.openai.api_server"; then

echo " vLLM process stopped"

else

echo " Failed to stop vLLM process (insufficient permissions or other error)"

fi

else

echo " No vLLM process found"

fi

Copilot · 2025-12-13T03:14:46Z

+echo $! > "${PID_FILE}"
+
+echo "✅ LiteLLM proxy started successfully!"
+echo ""
+echo "PID: $(cat ${PID_FILE})"
+echo "Log file: ${LOG_FILE}"
+echo ""
+echo "📊 To monitor logs in real-time:"
+echo "   tail -f ${LOG_FILE}"
+echo ""
+echo "🛑 To stop the proxy:"
+echo "   kill $(cat ${PID_FILE})"
+echo ""
+echo "Waiting 5 seconds for proxy to initialize..."
+sleep 5
+
+# Check if process is still running
+if kill -0 $(cat "${PID_FILE}") 2>/dev/null; then


Using command substitution within the echo message means if the PID file is removed between line 51 and this echo, the command will fail silently or show an empty value. Consider storing the PID in a variable after line 51 and using that variable throughout the script for consistency and reliability.

Suggested change

echo $! > "${PID_FILE}"

echo "✅ LiteLLM proxy started successfully!"

echo ""

echo "PID: $(cat ${PID_FILE})"

echo "Log file: ${LOG_FILE}"

echo ""

echo "📊 To monitor logs in real-time:"

echo " tail -f ${LOG_FILE}"

echo ""

echo "🛑 To stop the proxy:"

echo " kill $(cat ${PID_FILE})"

echo ""

echo "Waiting 5 seconds for proxy to initialize..."

sleep 5

# Check if process is still running

if kill -0 $(cat "${PID_FILE}") 2>/dev/null; then

LITELLM_PID=$!

echo "${LITELLM_PID}" > "${PID_FILE}"

echo "✅ LiteLLM proxy started successfully!"

echo ""

echo "PID: ${LITELLM_PID}"

echo "Log file: ${LOG_FILE}"

echo ""

echo "📊 To monitor logs in real-time:"

echo " tail -f ${LOG_FILE}"

echo ""

echo "🛑 To stop the proxy:"

echo " kill ${LITELLM_PID}"

echo ""

echo "Waiting 5 seconds for proxy to initialize..."

sleep 5

# Check if process is still running

if kill -0 "${LITELLM_PID}" 2>/dev/null; then

Copilot · 2025-12-13T03:14:46Z

+kill $(cat /workspace/logs/litellm-proxy.pid)
+
+# 2. Update vLLM to use qwen3_coder parser
+# Edit models/qwen.sh:


The instruction to manually edit models/qwen.sh contradicts the purpose of this PR, which includes that file with VLLM_TOOL_PARSER already set to "qwen3_coder". If users follow these steps, they would be manually making a change that should already be present from the PR. Either this step should be removed or clarified to indicate it's only needed if the file wasn't updated.

Suggested change

# Edit models/qwen.sh:

# (If models/qwen.sh was not updated by the PR, set:)

Changed from 'openai/Qwen/...' to just 'Qwen/...' with explicit custom_llm_provider: openai This fixes the 'list index out of range' error when LiteLLM tries to connect to vLLM. The model name should match what vLLM is serving, not include a provider prefix.

The router_settings with model_group_alias was causing 'list index out of range' errors. Since we only have one model, we don't need routing or fallback logic. Also set num_retries: 0 to prevent retry logic from interfering. This should fix the ServiceUnavailableError in Continue.dev.

The previous pkill wasn't reliably stopping vLLM. Now: 1. Use stop-server.sh script if available (proper graceful shutdown) 2. Multiple pkill patterns as fallback 3. Longer sleep to ensure process is stopped 4. Remove background execution of qwen.sh This ensures vLLM truly restarts with the new qwen3_coder parser.

The qwen3_coder parser has a bug with streaming tool calling: IndexError: list index out of range in streamed_args_for_tool vLLM logs show error at serving_chat.py:1163 when Continue.dev uses streaming (which it does by default). The qwen3_xml parser should have better streaming support and still parses Qwen's XML tool format correctly.

qwen3_xml and qwen3_coder both have streaming issues causing JSON parsing errors. Hermes parser is more stable and widely tested for streaming tool calling.

All vLLM parsers have bugs: - qwen3_coder: IndexError in streaming - qwen3_xml: Malformed JSON responses - hermes: Can't parse Qwen's XML format Solution: Let vLLM output raw text, Continue.dev will handle tool parsing natively using the model's XML format.

Mistral parser uses simpler JSON format that may work better with Qwen model vs complex XML parsers.

qwen3_coder parser works perfectly in non-streaming mode. Force LiteLLM to disable streaming to avoid the IndexError bug.

The qwen3_coder parser has a streaming bug. The fix is to disable streaming in Continue.dev config by setting stream: false. Non-streaming mode works perfectly and is actually faster for tool calling scenarios.

LiteLLM proxy was unnecessary complexity. Continue.dev uses system message tools, not OpenAI tool calling format. Changes: - Continue.dev points directly to vLLM port 8000 - Disabled vLLM tool parser (Continue.dev handles tools itself) - Removed streaming workarounds (not needed without parser) This allows Continue.dev to work with MCP tools using its native system message tool approach.

- Remove all LiteLLM proxy setup/restart steps - Remove all Qwen references - Use Gemma3-27B with native OpenAI tool parser - Simplify to direct vLLM connection (port 8000) - Update tests to target vLLM directly Architecture: Continue.dev → vLLM → Gemma3-27B (openai parser)

- OpenAI parser failed with: 'requires token IDs and does not support text-based extraction' - Hermes parser is more generic and works with instruction-tuned models - No Gemma-specific parser exists in vLLM

- Disable vLLM tool parser completely (VLLM_TOOL_PARSER="") - Continue.dev supports gpt-oss models natively via system message tools - Update deployment script to use GPT-OSS instead of Gemma3 - This bypasses all vLLM parser bugs entirely

Copilot AI review requested due to automatic review settings December 13, 2025 02:34

Copilot started reviewing on behalf of jsirish December 13, 2025 02:35 View session

Copilot AI reviewed Dec 13, 2025

View reviewed changes

jsirish requested a review from Copilot December 13, 2025 03:01

Copilot started reviewing on behalf of jsirish December 13, 2025 03:02 View session

Copilot AI reviewed Dec 13, 2025

View reviewed changes

jsirish requested a review from Copilot December 13, 2025 03:08

Copilot started reviewing on behalf of jsirish December 13, 2025 03:09 View session

Add stop-litellm-proxy.sh script following existing pattern

e85cd93

Matches the pattern of stop-server.sh for consistency: - Checks PID file - Graceful shutdown attempt - Force kill if needed - Cleans up PID file Usage on RunPod: ./scripts/stop-litellm-proxy.sh ./scripts/start-litellm-proxy.sh

Copilot AI reviewed Dec 13, 2025

View reviewed changes

jsirish added 13 commits December 12, 2025 21:15

Switch to hermes parser for stable streaming

f6f8d7d

qwen3_xml and qwen3_coder both have streaming issues causing JSON parsing errors. Hermes parser is more stable and widely tested for streaming tool calling.

Use mistral parser - simpler format, better compatibility

289a999

Mistral parser uses simpler JSON format that may work better with Qwen model vs complex XML parsers.

Fix: Force non-streaming mode to avoid qwen3_coder bug

c7623e3

qwen3_coder parser works perfectly in non-streaming mode. Force LiteLLM to disable streaming to avoid the IndexError bug.

Add Continue.dev streaming fix documentation

41cb370

The qwen3_coder parser has a streaming bug. The fix is to disable streaming in Continue.dev config by setting stream: false. Non-streaming mode works perfectly and is actually faster for tool calling scenarios.

Switch Gemma3 to hermes tool parser

a139490

- OpenAI parser failed with: 'requires token IDs and does not support text-based extraction' - Hermes parser is more generic and works with instruction-tuned models - No Gemma-specific parser exists in vLLM

-# Start LiteLLM proxy
-echo "Starting LiteLLM..."
-litellm --config /workspace/litellm-config.yaml --port 4000 --host 0.0.0.0
-# If the above command fails, try with --detailed_debug
-# litellm --config /workspace/litellm-config.yaml --port 4000 --host 0.0.0.0 --detailed_debug
+# Start LiteLLM proxy with process management
+echo "Starting LiteLLM..."
+# Ensure log directory exists
+mkdir -p /workspace/logs
+# Start LiteLLM in background with nohup, log output, and PID file
+nohup litellm --config /workspace/litellm-config.yaml --port 4000 --host 0.0.0.0 > /workspace/logs/litellm-proxy.log 2>&1 &
+echo $! > /workspace/logs/litellm-proxy.pid
+echo "LiteLLM proxy started with PID $(cat /workspace/logs/litellm-proxy.pid)"
+echo "Logs: /workspace/logs/litellm-proxy.log"
+# If the above command fails, try with --detailed_debug
+# nohup litellm --config /workspace/litellm-config.yaml --port 4000 --host 0.0.0.0 --detailed_debug > /workspace/logs/litellm-proxy.log 2>&1 &
+# echo $! > /workspace/logs/litellm-proxy.pid

	export VLLM_TOOL_PARSER="" # No parser - let LiteLLM proxy handle tool calling
	# export VLLM_TOOL_PARSER="" # No parser - let LiteLLM proxy handle tool calling

		if ! curl -s http://localhost:8000/v1/models > /dev/null 2>&1; then
		echo "⚠️ Warning: vLLM doesn't appear to be running on port 8000"

-if ! curl -s http://localhost:8000/v1/models > /dev/null 2>&1; then
-    echo "⚠️  Warning: vLLM doesn't appear to be running on port 8000"
+VLLM_HEALTH_OUTPUT=$(curl -sf http://localhost:8000/v1/models 2>&1)
+if [ $? -ne 0 ]; then
+    echo "⚠️  Warning: vLLM doesn't appear to be running on port 8000"
+    echo "curl error message:"
+    echo "$VLLM_HEALTH_OUTPUT"

	echo "1. Start vLLM WITHOUT tool parser: ./models/qwen.sh (with VLLM_TOOL_PARSER=\"\")"
	echo "1. Start vLLM: ./models/qwen.sh (tool parser now disabled)"

	pip install 'litellm[proxy]' --quiet
	pip install 'litellm[proxy]' --progress-bar off

	apiKey: sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381
	apiKey: YOUR_LITELLM_API_KEY

		3. vLLM generates response (no tool parser, raw output)
		4. LiteLLM parses the response and converts to OpenAI format

+# SECURITY WARNING
+echo "⚠️  SECURITY WARNING: The LiteLLM proxy will bind to all network interfaces (0.0.0.0)."
+echo "   This may expose the proxy to external networks. Ensure your firewall is configured"
+echo "   to restrict access to trusted sources only."
+echo ""

	-H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" \
	-H "Authorization: Bearer YOUR_API_KEY_HERE" \

-curl -s http://localhost:8000/v1/models | jq -r '.data[].id' || echo "   ⚠️  vLLM test failed"
-echo ""
-echo "   Testing LiteLLM (port 4000)..."
-curl -s http://localhost:4000/v1/models \
-  -H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" | jq -r '.data[].id' || echo "   ⚠️  LiteLLM test failed"
+VLLM_MODELS_RESPONSE=$(curl -s http://localhost:8000/v1/models)
+if ! echo "$VLLM_MODELS_RESPONSE" | jq -r '.data[].id'; then
+    echo "   ⚠️  vLLM test failed"
+    echo "   Raw response:"
+    echo "$VLLM_MODELS_RESPONSE"
+fi
+echo ""
+echo "   Testing LiteLLM (port 4000)..."
+LITELLM_MODELS_RESPONSE=$(curl -s http://localhost:4000/v1/models \
+  -H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381")
+if ! echo "$LITELLM_MODELS_RESPONSE" | jq -r '.data[].id'; then
+    echo "   ⚠️  LiteLLM test failed"
+    echo "   Raw response:"
+    echo "$LITELLM_MODELS_RESPONSE"
+fi

	# vLLM uses qwen_coder parser, LiteLLM normalizes to OpenAI format
	# vLLM uses qwen3_coder parser, LiteLLM normalizes to OpenAI format

		curl -s http://localhost:4000/v1/models \
		-H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" \| jq -r '.data[].id' \|\| echo " ⚠️ LiteLLM test failed"

-curl -s http://localhost:4000/v1/models \
-  -H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" | jq -r '.data[].id' || echo "   ⚠️  LiteLLM test failed"
+LITELLM_RESPONSE=$(curl -s -w "\n%{http_code}" http://localhost:4000/v1/models -H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381")
+LITELLM_BODY=$(echo "$LITELLM_RESPONSE" | sed '$d')
+LITELLM_STATUS=$(echo "$LITELLM_RESPONSE" | tail -n1)
+if [ "$LITELLM_STATUS" -ne 200 ]; then
+  echo "   ⚠️  LiteLLM test failed (HTTP $LITELLM_STATUS)"
+  echo "   Response:"
+  echo "$LITELLM_BODY" | head -10
+else
+  if ! echo "$LITELLM_BODY" | jq -r '.data[].id'; then
+    echo "   ⚠️  LiteLLM test failed (invalid JSON or missing .data[].id)"
+    echo "   Response:"
+    echo "$LITELLM_BODY" | head -10
+  fi
+fi

-pkill -f "python -m vllm.entrypoints.openai.api_server" || echo "   No vLLM process found"
+if pgrep -f "python -m vllm.entrypoints.openai.api_server" > /dev/null; then
+    if pkill -f "python -m vllm.entrypoints.openai.api_server"; then
+        echo "   vLLM process stopped"
+    else
+        echo "   Failed to stop vLLM process (insufficient permissions or other error)"
+    fi
+else
+    echo "   No vLLM process found"
+fi

	# Edit models/qwen.sh:
	# (If models/qwen.sh was not updated by the PR, set:)

Conversation

jsirish commented Dec 13, 2025

Problem

Solution: LiteLLM Proxy

How It Works

Benefits

Changes Made

New Scripts

Configuration Updates

Documentation

Testing Required

1. Setup on RunPod

2. Restart vLLM (no tool parser)

3. Start LiteLLM

4. Expose Port 4000

5. Update Continue.dev

6. Test Tool Calling

Alternative Considered

References

Uh oh!

continue Bot commented Dec 13, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!