Skip to content

feat: Add LiteLLM proxy for tool calling compatibility#6

Open
jsirish wants to merge 17 commits into
mainfrom
feature/litellm-proxy-tool-calling
Open

feat: Add LiteLLM proxy for tool calling compatibility#6
jsirish wants to merge 17 commits into
mainfrom
feature/litellm-proxy-tool-calling

Conversation

@jsirish
Copy link
Copy Markdown
Owner

@jsirish jsirish commented Dec 13, 2025

Problem

Tool calling with Qwen3-Coder-30B in Continue.dev fails with JSON parsing errors across all vLLM tool parsers:

  • qwen3_coder: JSON escape character issues
  • openai: Requires token IDs, not text-based
  • qwen3_xml: vLLM still validates incoming messages as JSON
  • hermes: Expects JSON but model outputs XML

Root Cause: vLLM's tool calling architecture expects consistent JSON format throughout, which Qwen3 doesn't reliably provide. The _postprocess_messages function validates incoming tool calls as JSON regardless of parser.

Solution: LiteLLM Proxy

Add LiteLLM proxy layer to handle tool calling format translation:

Continue.dev → LiteLLM Proxy (port 4000) → vLLM (port 8000, no parser)

How It Works

  1. Continue.dev sends OpenAI-format tool calling requests to LiteLLM
  2. LiteLLM forwards requests to vLLM (running without tool parser)
  3. vLLM generates raw text responses
  4. LiteLLM parses and converts responses to OpenAI format
  5. Continue.dev receives properly formatted tool calls

Benefits

  • ✅ Bypasses vLLM's JSON validation issues
  • ✅ Handles format translation automatically
  • ✅ Compatible with Continue.dev's OpenAI expectations
  • ✅ No vLLM parser configuration needed
  • ✅ Can add retries, fallbacks, and load balancing

Changes Made

New Scripts

  • scripts/setup-litellm-proxy.sh - Installs LiteLLM and creates config
  • scripts/start-litellm-proxy.sh - Starts LiteLLM proxy on port 4000

Configuration Updates

  • models/qwen.sh - Removed tool parser (VLLM_TOOL_PARSER="")
    • Let vLLM generate raw output for LiteLLM to parse

Documentation

  • docs/setup/LITELLM-PROXY-SETUP.md - Complete setup guide
    • Installation steps
    • Port exposure configuration
    • Continue.dev integration
    • Troubleshooting tips

Testing Required

1. Setup on RunPod

cd /workspace/llm-hosting
git pull origin feature/litellm-proxy-tool-calling
./scripts/setup-litellm-proxy.sh

2. Restart vLLM (no tool parser)

./scripts/stop-server.sh
./models/qwen.sh

3. Start LiteLLM

./scripts/start-litellm-proxy.sh

4. Expose Port 4000

  • RunPod: Add port mapping 4000 → TCP
  • Get public URL

5. Update Continue.dev

apiBase: https://YOUR-POD.proxy.runpod.net:4000/v1
model: qwen3-coder-30b

6. Test Tool Calling

  • Start fresh chat in Continue.dev
  • Try tool calling features
  • Verify no JSON parsing errors

Alternative Considered

Disabling tool calling entirely was rejected because tool parsing is a priority requirement.

References

- Add setup-litellm-proxy.sh script to configure LiteLLM
- Add start-litellm-proxy.sh to run proxy on port 4000
- Update qwen.sh to disable vLLM tool parser (let LiteLLM handle it)
- Add comprehensive setup documentation

Architecture: Continue.dev → LiteLLM (port 4000) → vLLM (port 8000)

Benefits:
- Handles tool calling format translation
- Avoids vLLM JSON validation issues
- Compatible with Continue.dev OpenAI format
- No vLLM parser configuration needed

This solves the persistent JSON parsing errors by letting LiteLLM
handle all tool call parsing and format conversion.
Copilot AI review requested due to automatic review settings December 13, 2025 02:34
@continue
Copy link
Copy Markdown

continue Bot commented Dec 13, 2025

All Green - Keep your PRs mergeable

Learn more

All Green is an AI agent that automatically:

✅ Addresses code review comments

✅ Fixes failing CI checks

✅ Resolves merge conflicts

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a LiteLLM proxy layer to resolve tool calling compatibility issues between Continue.dev and vLLM when using Qwen3-Coder-30B. The proxy sits between Continue.dev and vLLM to handle OpenAI format translation, bypassing vLLM's JSON validation issues that occur with various tool parsers (qwen3_coder, openai, qwen3_xml, hermes).

Key Changes

  • New setup and startup scripts for LiteLLM proxy installation and management
  • Configuration changes to disable vLLM's built-in tool parser in favor of LiteLLM handling
  • Comprehensive documentation covering installation, configuration, and troubleshooting

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 11 comments.

File Description
scripts/setup-litellm-proxy.sh Creates LiteLLM configuration with model definitions, API keys, and proxy settings
scripts/start-litellm-proxy.sh Starts LiteLLM proxy with validation checks for config file and vLLM availability
models/qwen.sh Updates Qwen configuration to disable vLLM tool parser (set to empty string)
docs/setup/LITELLM-PROXY-SETUP.md Provides complete setup guide including installation, port exposure, Continue.dev integration, and troubleshooting

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/start-litellm-proxy.sh Outdated
Comment on lines +29 to +34
# Start LiteLLM proxy
echo "Starting LiteLLM..."
litellm --config /workspace/litellm-config.yaml --port 4000 --host 0.0.0.0

# If the above command fails, try with --detailed_debug
# litellm --config /workspace/litellm-config.yaml --port 4000 --host 0.0.0.0 --detailed_debug
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LiteLLM proxy process is started in the foreground without proper process management (no background execution, PID tracking, or log file redirection). This is inconsistent with how the vLLM server is managed in start-vllm-server.sh, which uses nohup, PID files, and log redirection. Consider implementing similar process management for LiteLLM to enable:

  1. Background execution with nohup
  2. PID file tracking at /workspace/logs/litellm-proxy.pid
  3. Log output to /workspace/logs/litellm-proxy.log
  4. Ability to stop/monitor the proxy using the same patterns as vLLM
Suggested change
# Start LiteLLM proxy
echo "Starting LiteLLM..."
litellm --config /workspace/litellm-config.yaml --port 4000 --host 0.0.0.0
# If the above command fails, try with --detailed_debug
# litellm --config /workspace/litellm-config.yaml --port 4000 --host 0.0.0.0 --detailed_debug
# Start LiteLLM proxy with process management
echo "Starting LiteLLM..."
# Ensure log directory exists
mkdir -p /workspace/logs
# Start LiteLLM in background with nohup, log output, and PID file
nohup litellm --config /workspace/litellm-config.yaml --port 4000 --host 0.0.0.0 > /workspace/logs/litellm-proxy.log 2>&1 &
echo $! > /workspace/logs/litellm-proxy.pid
echo "LiteLLM proxy started with PID $(cat /workspace/logs/litellm-proxy.pid)"
echo "Logs: /workspace/logs/litellm-proxy.log"
# If the above command fails, try with --detailed_debug
# nohup litellm --config /workspace/litellm-config.yaml --port 4000 --host 0.0.0.0 --detailed_debug > /workspace/logs/litellm-proxy.log 2>&1 &
# echo $! > /workspace/logs/litellm-proxy.pid

Copilot uses AI. Check for mistakes.
litellm_params:
model: openai/Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8
api_base: http://localhost:8000/v1
api_key: sk-vllm-c9be6c31b9f1ebd5bc5a316ac7d71381
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API key 'sk-vllm-c9be6c31b9f1ebd5bc5a316ac7d71381' appears to be a fixed, hardcoded credential shared between vLLM and LiteLLM. This differs from the vLLM server's approach where API keys are either generated randomly or explicitly set via environment variables. Hardcoding credentials in configuration files can be a security risk if this configuration is committed to version control or shared. Consider generating a random key or reading from an environment variable like the vLLM server does.

Copilot uses AI. Check for mistakes.
success_callback: []

general_settings:
master_key: sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381 # Same key for simplicity
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The master_key 'sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381' is hardcoded in the configuration file. This key controls access to the LiteLLM proxy. Hardcoding credentials poses a security risk if the configuration is committed to version control or shared. Consider generating a random key or reading from an environment variable, similar to how the vLLM server handles API keys.

Copilot uses AI. Check for mistakes.
Comment thread models/qwen.sh Outdated
export VLLM_MAX_MODEL_LEN=131072 # 128K tokens
export VLLM_GPU_MEMORY_UTIL=0.95 # Maximum GPU memory utilization
export VLLM_TOOL_PARSER="hermes" # Hermes-style tool calling (built into Qwen chat template)
export VLLM_TOOL_PARSER="" # No parser - let LiteLLM proxy handle tool calling
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting VLLM_TOOL_PARSER to an empty string will cause start-vllm-server.sh to use the default value 'openai' instead of disabling the tool parser. In start-vllm-server.sh line 28, the code uses 'TOOL_PARSER="${VLLM_TOOL_PARSER:-openai}"' which applies the default when the variable is unset OR empty. This means the tool parser will still be active with 'openai' parser, defeating the purpose of this PR. To properly disable the tool parser, either:

  1. Pass a special value that vLLM recognizes as "no parser"
  2. Modify start-vllm-server.sh to handle empty strings differently and conditionally include the --tool-call-parser flag
Suggested change
export VLLM_TOOL_PARSER="" # No parser - let LiteLLM proxy handle tool calling
# export VLLM_TOOL_PARSER="" # No parser - let LiteLLM proxy handle tool calling

Copilot uses AI. Check for mistakes.
Comment thread scripts/setup-litellm-proxy.sh Outdated
json_logs: false # Easier to read logs
num_retries: 2
request_timeout: 600
modify_params: true
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'success_callback' is set to an empty array but there's no explanation of what this field does or when it might be useful. Consider either removing this line (if it's not needed) or adding a comment explaining its purpose for future maintainers who might want to configure callbacks.

Suggested change
modify_params: true
modify_params: true
# List of callback functions to execute on successful requests.
# Leave empty unless you want to add custom success handlers.

Copilot uses AI. Check for mistakes.
Comment on lines +18 to +19
if ! curl -s http://localhost:8000/v1/models > /dev/null 2>&1; then
echo "⚠️ Warning: vLLM doesn't appear to be running on port 8000"
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The health check for vLLM uses 'curl -s' which silences all output including error messages. If the check fails, the user only sees a generic warning without understanding why the connection failed. Consider using 'curl -sf' to show HTTP errors or capture and display the actual error message to help with troubleshooting.

Suggested change
if ! curl -s http://localhost:8000/v1/models > /dev/null 2>&1; then
echo "⚠️ Warning: vLLM doesn't appear to be running on port 8000"
VLLM_HEALTH_OUTPUT=$(curl -sf http://localhost:8000/v1/models 2>&1)
if [ $? -ne 0 ]; then
echo "⚠️ Warning: vLLM doesn't appear to be running on port 8000"
echo "curl error message:"
echo "$VLLM_HEALTH_OUTPUT"

Copilot uses AI. Check for mistakes.
provider: openai
model: qwen3-coder-30b # Model name from litellm-config.yaml
apiBase: https://YOUR-POD.proxy.runpod.net:4000/v1 # LiteLLM port
apiKey: sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation shows a hardcoded API key that matches the one in the setup script. This reinforces the pattern of using fixed credentials. Consider updating the documentation to mention that users should replace this with their own secure key, or indicate that the key will be generated during setup.

Copilot uses AI. Check for mistakes.
Comment thread scripts/setup-litellm-proxy.sh Outdated
echo "✅ LiteLLM config created at /workspace/litellm-config.yaml"
echo ""
echo "Next steps:"
echo "1. Start vLLM WITHOUT tool parser: ./models/qwen.sh (with VLLM_TOOL_PARSER=\"\")"
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instruction says 'Start vLLM WITHOUT tool parser: ./models/qwen.sh (with VLLM_TOOL_PARSER="")' which is misleading. The qwen.sh script already sets VLLM_TOOL_PARSER="" in this PR, so users don't need to manually specify it. The parenthetical note could confuse users into thinking they need to modify something. Consider revising to: 'Start vLLM: ./models/qwen.sh (tool parser now disabled)'

Suggested change
echo "1. Start vLLM WITHOUT tool parser: ./models/qwen.sh (with VLLM_TOOL_PARSER=\"\")"
echo "1. Start vLLM: ./models/qwen.sh (tool parser now disabled)"

Copilot uses AI. Check for mistakes.
set -e

echo "🔧 Installing LiteLLM..."
pip install 'litellm[proxy]' --quiet
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using '--quiet' flag for pip install suppresses all output including potential warnings or errors. If the installation fails or has issues, users won't see helpful error messages. Consider removing '--quiet' or replacing it with '--progress-bar off' to show errors while reducing visual noise.

Suggested change
pip install 'litellm[proxy]' --quiet
pip install 'litellm[proxy]' --progress-bar off

Copilot uses AI. Check for mistakes.
set -e

echo "🔧 Installing LiteLLM..."
pip install 'litellm[proxy]' --quiet
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running pip install 'litellm[proxy]' without pinning to a specific version or verifying integrity introduces a supply-chain risk: any compromise of the litellm package or its dependencies would automatically execute attacker-controlled code in your runtime. Since this script is part of your deployment path, an attacker could leverage a malicious package version to access data or modify model-serving behavior. Pin the dependency to a trusted version and/or verify hashes or signatures before installation to reduce this risk.

Copilot uses AI. Check for mistakes.
- Updated models/qwen.sh: Set VLLM_TOOL_PARSER='qwen_coder' for XML parsing
- Updated scripts/setup-litellm-proxy.sh: Added supports_parallel_function_calling
- Created scripts/update-runpod-config.sh: Automated deployment script with testing
- Created docs/setup/LITELLM-VLLM-TOOLCALLING.md: Comprehensive architecture guide
- Created docs/setup/VLLM-LITELLM-INTEGRATION-DOCS.md: Context7 documentation findings
- Updated QUICK-UPDATE.md: TL;DR deployment instructions
- Updated scripts/start-litellm-proxy.sh: Background execution with logging

Architecture: Continue.dev -> LiteLLM (port 4000) -> vLLM (port 8000) -> Model
- vLLM parses Qwen's XML tool calls using qwen_coder parser
- LiteLLM normalizes to OpenAI format and strips non-standard parameters
- Fixes 'Invalid \escape' JSON parsing errors in Continue.dev
Error from RunPod showed parser should be 'qwen3_coder' not 'qwen_coder'.
Verified against vLLM official source code:
vllm/entrypoints/openai/tool_parsers/__init__.py line 115-118

Updated all occurrences:
- models/qwen.sh: VLLM_TOOL_PARSER='qwen3_coder'
- scripts/update-runpod-config.sh: References to qwen3_coder
- docs/setup/LITELLM-VLLM-TOOLCALLING.md
- docs/setup/VLLM-LITELLM-INTEGRATION-DOCS.md
- QUICK-UPDATE.md

Available parsers from vLLM source:
deepseek_v3, deepseek_v31, ernie45, glm45, granite, granite-20b-fc,
hermes, hunyuan_a13b, internlm, jamba, kimi_k2, llama3_json,
llama4_json, llama4_pythonic, longcat, minimax, minimax_m2, mistral,
olmo3, openai, phi4_mini_json, pythonic, qwen3_coder, qwen3_xml,
seed_oss, step3, xlam
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 21 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

echo "✅ LiteLLM config created at /workspace/litellm-config.yaml"
echo ""
echo "Next steps:"
echo "1. Start vLLM with qwen_coder parser: ./models/qwen.sh (VLLM_TOOL_PARSER=\"qwen_coder\")"
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instruction mentions 'VLLM_TOOL_PARSER="qwen_coder"' but this contradicts the PR description and other documentation which states that qwen.sh should use the qwen_coder parser. This comment is correct, but the surrounding context in LITELLM-PROXY-SETUP.md (line 21) incorrectly states the parser should be empty. Ensure consistency across all documentation.

Copilot uses AI. Check for mistakes.
Comment on lines +19 to +32
- **vLLM only** with various parsers (`qwen3_coder`, `qwen3_xml`, `hermes`, `openai`)
- **Problem**: vLLM's `_postprocess_messages` validates all tool calls as JSON, causing failures with Qwen's XML output

### Current Approach (Working)
1. **vLLM** with `qwen3_coder` parser:
- Parses Qwen's XML tool call format (`<tool_call>`, `<function>`, `<parameter>`)
- Converts to vLLM's internal tool call structure
- No JSON validation issues because parsing happens BEFORE validation

2. **LiteLLM** as normalization layer:
- Takes vLLM's parsed tool calls
- Normalizes to OpenAI format expected by Continue.dev
- Strips non-standard parameters like `supports_function_calling`
- Handles retries and timeouts
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This documentation states that the previous approach used 'vLLM only' but the current PR description and code changes indicate that vLLM now uses the 'qwen_coder' parser, not no parser. The description here should clarify that the previous approaches tried different parsers but had issues, and the current solution combines vLLM's qwen_coder parser with LiteLLM for normalization.

Suggested change
- **vLLM only** with various parsers (`qwen3_coder`, `qwen3_xml`, `hermes`, `openai`)
- **Problem**: vLLM's `_postprocess_messages` validates all tool calls as JSON, causing failures with Qwen's XML output
### Current Approach (Working)
1. **vLLM** with `qwen3_coder` parser:
- Parses Qwen's XML tool call format (`<tool_call>`, `<function>`, `<parameter>`)
- Converts to vLLM's internal tool call structure
- No JSON validation issues because parsing happens BEFORE validation
2. **LiteLLM** as normalization layer:
- Takes vLLM's parsed tool calls
- Normalizes to OpenAI format expected by Continue.dev
- Strips non-standard parameters like `supports_function_calling`
- Handles retries and timeouts
- **vLLM with various parsers** (`qwen3_coder`, `qwen3_xml`, `hermes`, `openai`)
- **Problem**: Each parser had issues—vLLM's `_postprocess_messages` step always validated tool calls as JSON, so Qwen's XML output (even when parsed) would fail validation, breaking tool calling.
### Current Approach (Working)
1. **vLLM** with `qwen_coder` parser:
- Uses the `qwen_coder` parser to convert Qwen's XML tool call format (`<tool_call>`, `<function>`, `<parameter>`) into vLLM's internal tool call structure.
- Parsing happens before vLLM's JSON validation, so XML is correctly handled and converted.
2. **LiteLLM** as normalization layer:
- Receives vLLM's parsed tool calls.
- Normalizes them to the OpenAI format expected by Continue.dev.
- Strips non-standard parameters like `supports_function_calling`.
- Handles retries and timeouts.

Copilot uses AI. Check for mistakes.
Comment on lines +57 to +60
apiBase: https://3clxt008hl0a3a-4000.proxy.runpod.net/v1
apiKey: sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381
capabilities:
- tool_use
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configuration includes a hardcoded API key and a specific RunPod URL. Replace with environment variable references or placeholders to avoid exposing sensitive information in documentation.

Suggested change
apiBase: https://3clxt008hl0a3a-4000.proxy.runpod.net/v1
apiKey: sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381
capabilities:
- tool_use
apiBase: ${CONTINUE_API_BASE} # e.g., https://<your-runpod-endpoint>:4000/v1
apiKey: ${CONTINUE_API_KEY} # Set this environment variable to your API key
capabilities:
- tool_use
# Replace the placeholders above with your actual API base URL and API key,
# or set the CONTINUE_API_BASE and CONTINUE_API_KEY environment variables.

Copilot uses AI. Check for mistakes.
provider: openai
model: qwen3-coder-30b # Model name from litellm-config.yaml
apiBase: https://YOUR-POD.proxy.runpod.net:4000/v1 # LiteLLM port
apiKey: sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configuration example includes a hardcoded API key. Replace with a placeholder like 'YOUR_LITELLM_API_KEY' to avoid exposing sensitive information in documentation.

Suggested change
apiKey: sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381
apiKey: YOUR_LITELLM_API_KEY

Copilot uses AI. Check for mistakes.
Comment on lines +63 to +64
3. **vLLM** generates response (no tool parser, raw output)
4. **LiteLLM** parses the response and converts to OpenAI format
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Step 3 in the 'How It Works' section states that vLLM generates response with 'no tool parser, raw output'. This contradicts the actual implementation where models/qwen.sh sets VLLM_TOOL_PARSER="qwen_coder". The documentation should accurately reflect that vLLM uses the qwen_coder parser to convert XML to JSON, which LiteLLM then normalizes to OpenAI format.

Suggested change
3. **vLLM** generates response (no tool parser, raw output)
4. **LiteLLM** parses the response and converts to OpenAI format
3. **vLLM** uses the `qwen_coder` parser to convert XML tool calls to JSON
4. **LiteLLM** normalizes the JSON response to OpenAI format

Copilot uses AI. Check for mistakes.
Comment thread scripts/update-runpod-config.sh Outdated
Comment on lines +75 to +81
-H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" | jq -r '.data[].id' || echo " ⚠️ LiteLLM test failed"

echo ""
echo " Testing tool calling through LiteLLM..."
RESPONSE=$(curl -s http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" \
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API key 'sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381' is hardcoded in the script and appears to be a real key used across multiple files in this PR. Hardcoding API keys in scripts and configuration files is a security risk, especially in version control. Consider using environment variables or a secrets management system instead.

Copilot uses AI. Check for mistakes.
Comment on lines +19 to +38
api_key: sk-vllm-c9be6c31b9f1ebd5bc5a316ac7d71381
supports_function_calling: true
supports_parallel_function_calling: true
model_info:
mode: chat
supports_function_calling: true
supports_parallel_function_calling: true
max_tokens: 8192 # Max completion tokens
max_input_tokens: 131072 # 128K context

litellm_settings:
drop_params: true # Strip non-standard parameters
json_logs: false # Easier to read logs
num_retries: 2
request_timeout: 600
modify_params: true
success_callback: []

general_settings:
master_key: sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381 # Same key for simplicity
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vLLM API key 'sk-vllm-c9be6c31b9f1ebd5bc5a316ac7d71381' is hardcoded in the configuration file template. This same key is also used as the LiteLLM master key on line 38. Hardcoding API keys in configuration templates that are committed to version control is a security risk. Consider using environment variable substitution or prompting for keys during setup.

Copilot uses AI. Check for mistakes.
echo " tail -f ${LOG_FILE}"
exit 1
fi

Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The litellm command uses --host 0.0.0.0 which binds to all network interfaces, potentially exposing the proxy to external networks. While this might be intentional for RunPod's port forwarding, it could be a security concern. Consider documenting this choice or adding a warning about firewall configuration.

Suggested change
# SECURITY WARNING
echo "⚠️ SECURITY WARNING: The LiteLLM proxy will bind to all network interfaces (0.0.0.0)."
echo " This may expose the proxy to external networks. Ensure your firewall is configured"
echo " to restrict access to trusted sources only."
echo ""

Copilot uses AI. Check for mistakes.
Comment on lines +122 to +124
curl https://3clxt008hl0a3a-4000.proxy.runpod.net/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" \
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The curl command includes a hardcoded API key and a specific RunPod URL. Replace with placeholders or environment variable references.

Suggested change
curl https://3clxt008hl0a3a-4000.proxy.runpod.net/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" \
curl <RUNPOD_URL>/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <LITELLM_API_KEY>" \

Copilot uses AI. Check for mistakes.
Comment thread QUICK-UPDATE.md
# 5. Test
curl http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" \
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The curl command example includes a hardcoded API key. Replace with a placeholder or environment variable reference to avoid exposing sensitive information.

Suggested change
-H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" \
-H "Authorization: Bearer YOUR_API_KEY_HERE" \

Copilot uses AI. Check for mistakes.
Matches the pattern of stop-server.sh for consistency:
- Checks PID file
- Graceful shutdown attempt
- Force kill if needed
- Cleans up PID file

Usage on RunPod:
  ./scripts/stop-litellm-proxy.sh
  ./scripts/start-litellm-proxy.sh
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +50 to +54
echo "1. Start vLLM with qwen_coder parser: ./models/qwen.sh (VLLM_TOOL_PARSER=\"qwen_coder\")"
echo "2. Start LiteLLM proxy: ./scripts/start-litellm-proxy.sh"
echo "3. Update Continue.dev to use: http://localhost:4000 or https://...proxy.runpod.net:4000"
echo ""
echo "Architecture: Continue.dev → LiteLLM (format normalization) → vLLM (qwen_coder parser) → Model"
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This next steps instruction contradicts the actual configuration in the script itself. It says to start vLLM with 'VLLM_TOOL_PARSER="qwen_coder"' (note: qwen_coder without the "3"), but line 14 of the generated config file and the models/qwen.sh file both use "qwen3_coder" (with the "3"). Additionally, the description states that vLLM should use the "qwen_coder parser" while the architecture diagram on line 54 shows "qwen3_coder parser". This inconsistency in the parser name will cause configuration errors.

Suggested change
echo "1. Start vLLM with qwen_coder parser: ./models/qwen.sh (VLLM_TOOL_PARSER=\"qwen_coder\")"
echo "2. Start LiteLLM proxy: ./scripts/start-litellm-proxy.sh"
echo "3. Update Continue.dev to use: http://localhost:4000 or https://...proxy.runpod.net:4000"
echo ""
echo "Architecture: Continue.dev → LiteLLM (format normalization) → vLLM (qwen_coder parser) → Model"
echo "1. Start vLLM with qwen3_coder parser: ./models/qwen.sh (VLLM_TOOL_PARSER=\"qwen3_coder\")"
echo "2. Start LiteLLM proxy: ./scripts/start-litellm-proxy.sh"
echo "3. Update Continue.dev to use: http://localhost:4000 or https://...proxy.runpod.net:4000"
echo ""
echo "Architecture: Continue.dev → LiteLLM (format normalization) → vLLM (qwen3_coder parser) → Model"

Copilot uses AI. Check for mistakes.
Comment thread scripts/update-runpod-config.sh Outdated
Comment on lines +70 to +75
curl -s http://localhost:8000/v1/models | jq -r '.data[].id' || echo " ⚠️ vLLM test failed"

echo ""
echo " Testing LiteLLM (port 4000)..."
curl -s http://localhost:4000/v1/models \
-H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" | jq -r '.data[].id' || echo " ⚠️ LiteLLM test failed"
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error handling here silently fails with a generic message. If the jq command fails (due to malformed JSON or connection issues), the user won't know the specific cause. Consider showing the actual error or the raw response to aid debugging.

Suggested change
curl -s http://localhost:8000/v1/models | jq -r '.data[].id' || echo " ⚠️ vLLM test failed"
echo ""
echo " Testing LiteLLM (port 4000)..."
curl -s http://localhost:4000/v1/models \
-H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" | jq -r '.data[].id' || echo " ⚠️ LiteLLM test failed"
VLLM_MODELS_RESPONSE=$(curl -s http://localhost:8000/v1/models)
if ! echo "$VLLM_MODELS_RESPONSE" | jq -r '.data[].id'; then
echo " ⚠️ vLLM test failed"
echo " Raw response:"
echo "$VLLM_MODELS_RESPONSE"
fi
echo ""
echo " Testing LiteLLM (port 4000)..."
LITELLM_MODELS_RESPONSE=$(curl -s http://localhost:4000/v1/models \
-H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381")
if ! echo "$LITELLM_MODELS_RESPONSE" | jq -r '.data[].id'; then
echo " ⚠️ LiteLLM test failed"
echo " Raw response:"
echo "$LITELLM_MODELS_RESPONSE"
fi

Copilot uses AI. Check for mistakes.
Comment thread scripts/setup-litellm-proxy.sh Outdated
cat > /workspace/litellm-config.yaml << 'EOF'
model_list:
# Qwen 3 Coder 30B - Primary model for tool calling
# vLLM uses qwen_coder parser, LiteLLM normalizes to OpenAI format
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment references "qwen_coder parser" but the correct parser name used throughout the codebase and vLLM is "qwen3_coder" (with the "3"). This typo could mislead users about which parser is actually being configured.

Suggested change
# vLLM uses qwen_coder parser, LiteLLM normalizes to OpenAI format
# vLLM uses qwen3_coder parser, LiteLLM normalizes to OpenAI format

Copilot uses AI. Check for mistakes.
Comment thread scripts/update-runpod-config.sh Outdated
Comment on lines +74 to +75
curl -s http://localhost:4000/v1/models \
-H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" | jq -r '.data[].id' || echo " ⚠️ LiteLLM test failed"
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to line 70, this error handling is too generic. If the curl or jq command fails, the user won't know why. Consider capturing and displaying the actual error or response to help with troubleshooting.

Suggested change
curl -s http://localhost:4000/v1/models \
-H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" | jq -r '.data[].id' || echo " ⚠️ LiteLLM test failed"
LITELLM_RESPONSE=$(curl -s -w "\n%{http_code}" http://localhost:4000/v1/models -H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381")
LITELLM_BODY=$(echo "$LITELLM_RESPONSE" | sed '$d')
LITELLM_STATUS=$(echo "$LITELLM_RESPONSE" | tail -n1)
if [ "$LITELLM_STATUS" -ne 200 ]; then
echo " ⚠️ LiteLLM test failed (HTTP $LITELLM_STATUS)"
echo " Response:"
echo "$LITELLM_BODY" | head -10
else
if ! echo "$LITELLM_BODY" | jq -r '.data[].id'; then
echo " ⚠️ LiteLLM test failed (invalid JSON or missing .data[].id)"
echo " Response:"
echo "$LITELLM_BODY" | head -10
fi
fi

Copilot uses AI. Check for mistakes.
Comment thread scripts/update-runpod-config.sh Outdated
echo ""
echo "2️⃣ Restarting vLLM server with qwen3_coder parser..."
echo " Stopping current vLLM server..."
pkill -f "python -m vllm.entrypoints.openai.api_server" || echo " No vLLM process found"
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pkill command with '|| echo "No vLLM process found"' pattern could be misleading because pkill returns non-zero both when no process is found AND when the user lacks permissions to kill the process. Consider checking process existence separately before attempting to kill, or provide more specific error handling.

Suggested change
pkill -f "python -m vllm.entrypoints.openai.api_server" || echo " No vLLM process found"
if pgrep -f "python -m vllm.entrypoints.openai.api_server" > /dev/null; then
if pkill -f "python -m vllm.entrypoints.openai.api_server"; then
echo " vLLM process stopped"
else
echo " Failed to stop vLLM process (insufficient permissions or other error)"
fi
else
echo " No vLLM process found"
fi

Copilot uses AI. Check for mistakes.
Comment on lines +51 to +68
echo $! > "${PID_FILE}"

echo "✅ LiteLLM proxy started successfully!"
echo ""
echo "PID: $(cat ${PID_FILE})"
echo "Log file: ${LOG_FILE}"
echo ""
echo "📊 To monitor logs in real-time:"
echo " tail -f ${LOG_FILE}"
echo ""
echo "🛑 To stop the proxy:"
echo " kill $(cat ${PID_FILE})"
echo ""
echo "Waiting 5 seconds for proxy to initialize..."
sleep 5

# Check if process is still running
if kill -0 $(cat "${PID_FILE}") 2>/dev/null; then
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using command substitution within the echo message means if the PID file is removed between line 51 and this echo, the command will fail silently or show an empty value. Consider storing the PID in a variable after line 51 and using that variable throughout the script for consistency and reliability.

Suggested change
echo $! > "${PID_FILE}"
echo "✅ LiteLLM proxy started successfully!"
echo ""
echo "PID: $(cat ${PID_FILE})"
echo "Log file: ${LOG_FILE}"
echo ""
echo "📊 To monitor logs in real-time:"
echo " tail -f ${LOG_FILE}"
echo ""
echo "🛑 To stop the proxy:"
echo " kill $(cat ${PID_FILE})"
echo ""
echo "Waiting 5 seconds for proxy to initialize..."
sleep 5
# Check if process is still running
if kill -0 $(cat "${PID_FILE}") 2>/dev/null; then
LITELLM_PID=$!
echo "${LITELLM_PID}" > "${PID_FILE}"
echo "✅ LiteLLM proxy started successfully!"
echo ""
echo "PID: ${LITELLM_PID}"
echo "Log file: ${LOG_FILE}"
echo ""
echo "📊 To monitor logs in real-time:"
echo " tail -f ${LOG_FILE}"
echo ""
echo "🛑 To stop the proxy:"
echo " kill ${LITELLM_PID}"
echo ""
echo "Waiting 5 seconds for proxy to initialize..."
sleep 5
# Check if process is still running
if kill -0 "${LITELLM_PID}" 2>/dev/null; then

Copilot uses AI. Check for mistakes.
Comment thread QUICK-UPDATE.md
kill $(cat /workspace/logs/litellm-proxy.pid)

# 2. Update vLLM to use qwen3_coder parser
# Edit models/qwen.sh:
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instruction to manually edit models/qwen.sh contradicts the purpose of this PR, which includes that file with VLLM_TOOL_PARSER already set to "qwen3_coder". If users follow these steps, they would be manually making a change that should already be present from the PR. Either this step should be removed or clarified to indicate it's only needed if the file wasn't updated.

Suggested change
# Edit models/qwen.sh:
# (If models/qwen.sh was not updated by the PR, set:)

Copilot uses AI. Check for mistakes.
Changed from 'openai/Qwen/...' to just 'Qwen/...' with explicit
custom_llm_provider: openai

This fixes the 'list index out of range' error when LiteLLM tries
to connect to vLLM.

The model name should match what vLLM is serving, not include a
provider prefix.
The router_settings with model_group_alias was causing 'list index
out of range' errors. Since we only have one model, we don't need
routing or fallback logic.

Also set num_retries: 0 to prevent retry logic from interfering.

This should fix the ServiceUnavailableError in Continue.dev.
The previous pkill wasn't reliably stopping vLLM. Now:
1. Use stop-server.sh script if available (proper graceful shutdown)
2. Multiple pkill patterns as fallback
3. Longer sleep to ensure process is stopped
4. Remove background execution of qwen.sh

This ensures vLLM truly restarts with the new qwen3_coder parser.
The qwen3_coder parser has a bug with streaming tool calling:
IndexError: list index out of range in streamed_args_for_tool

vLLM logs show error at serving_chat.py:1163 when Continue.dev
uses streaming (which it does by default).

The qwen3_xml parser should have better streaming support and
still parses Qwen's XML tool format correctly.
qwen3_xml and qwen3_coder both have streaming issues causing
JSON parsing errors. Hermes parser is more stable and widely tested
for streaming tool calling.
All vLLM parsers have bugs:
- qwen3_coder: IndexError in streaming
- qwen3_xml: Malformed JSON responses
- hermes: Can't parse Qwen's XML format

Solution: Let vLLM output raw text, Continue.dev will handle
tool parsing natively using the model's XML format.
Mistral parser uses simpler JSON format that may work better
with Qwen model vs complex XML parsers.
qwen3_coder parser works perfectly in non-streaming mode.
Force LiteLLM to disable streaming to avoid the IndexError bug.
The qwen3_coder parser has a streaming bug. The fix is to disable
streaming in Continue.dev config by setting stream: false.

Non-streaming mode works perfectly and is actually faster for
tool calling scenarios.
LiteLLM proxy was unnecessary complexity. Continue.dev uses
system message tools, not OpenAI tool calling format.

Changes:
- Continue.dev points directly to vLLM port 8000
- Disabled vLLM tool parser (Continue.dev handles tools itself)
- Removed streaming workarounds (not needed without parser)

This allows Continue.dev to work with MCP tools using its
native system message tool approach.
- Remove all LiteLLM proxy setup/restart steps
- Remove all Qwen references
- Use Gemma3-27B with native OpenAI tool parser
- Simplify to direct vLLM connection (port 8000)
- Update tests to target vLLM directly

Architecture: Continue.dev → vLLM → Gemma3-27B (openai parser)
- OpenAI parser failed with: 'requires token IDs and does not support text-based extraction'
- Hermes parser is more generic and works with instruction-tuned models
- No Gemma-specific parser exists in vLLM
- Disable vLLM tool parser completely (VLLM_TOOL_PARSER="")
- Continue.dev supports gpt-oss models natively via system message tools
- Update deployment script to use GPT-OSS instead of Gemma3
- This bypasses all vLLM parser bugs entirely
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants