feat: Add LiteLLM proxy for tool calling compatibility#6
Conversation
- Add setup-litellm-proxy.sh script to configure LiteLLM - Add start-litellm-proxy.sh to run proxy on port 4000 - Update qwen.sh to disable vLLM tool parser (let LiteLLM handle it) - Add comprehensive setup documentation Architecture: Continue.dev → LiteLLM (port 4000) → vLLM (port 8000) Benefits: - Handles tool calling format translation - Avoids vLLM JSON validation issues - Compatible with Continue.dev OpenAI format - No vLLM parser configuration needed This solves the persistent JSON parsing errors by letting LiteLLM handle all tool call parsing and format conversion.
Learn moreAll Green is an AI agent that automatically: ✅ Addresses code review comments ✅ Fixes failing CI checks ✅ Resolves merge conflicts |
There was a problem hiding this comment.
Pull request overview
This PR introduces a LiteLLM proxy layer to resolve tool calling compatibility issues between Continue.dev and vLLM when using Qwen3-Coder-30B. The proxy sits between Continue.dev and vLLM to handle OpenAI format translation, bypassing vLLM's JSON validation issues that occur with various tool parsers (qwen3_coder, openai, qwen3_xml, hermes).
Key Changes
- New setup and startup scripts for LiteLLM proxy installation and management
- Configuration changes to disable vLLM's built-in tool parser in favor of LiteLLM handling
- Comprehensive documentation covering installation, configuration, and troubleshooting
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 11 comments.
| File | Description |
|---|---|
| scripts/setup-litellm-proxy.sh | Creates LiteLLM configuration with model definitions, API keys, and proxy settings |
| scripts/start-litellm-proxy.sh | Starts LiteLLM proxy with validation checks for config file and vLLM availability |
| models/qwen.sh | Updates Qwen configuration to disable vLLM tool parser (set to empty string) |
| docs/setup/LITELLM-PROXY-SETUP.md | Provides complete setup guide including installation, port exposure, Continue.dev integration, and troubleshooting |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Start LiteLLM proxy | ||
| echo "Starting LiteLLM..." | ||
| litellm --config /workspace/litellm-config.yaml --port 4000 --host 0.0.0.0 | ||
|
|
||
| # If the above command fails, try with --detailed_debug | ||
| # litellm --config /workspace/litellm-config.yaml --port 4000 --host 0.0.0.0 --detailed_debug |
There was a problem hiding this comment.
The LiteLLM proxy process is started in the foreground without proper process management (no background execution, PID tracking, or log file redirection). This is inconsistent with how the vLLM server is managed in start-vllm-server.sh, which uses nohup, PID files, and log redirection. Consider implementing similar process management for LiteLLM to enable:
- Background execution with nohup
- PID file tracking at /workspace/logs/litellm-proxy.pid
- Log output to /workspace/logs/litellm-proxy.log
- Ability to stop/monitor the proxy using the same patterns as vLLM
| # Start LiteLLM proxy | |
| echo "Starting LiteLLM..." | |
| litellm --config /workspace/litellm-config.yaml --port 4000 --host 0.0.0.0 | |
| # If the above command fails, try with --detailed_debug | |
| # litellm --config /workspace/litellm-config.yaml --port 4000 --host 0.0.0.0 --detailed_debug | |
| # Start LiteLLM proxy with process management | |
| echo "Starting LiteLLM..." | |
| # Ensure log directory exists | |
| mkdir -p /workspace/logs | |
| # Start LiteLLM in background with nohup, log output, and PID file | |
| nohup litellm --config /workspace/litellm-config.yaml --port 4000 --host 0.0.0.0 > /workspace/logs/litellm-proxy.log 2>&1 & | |
| echo $! > /workspace/logs/litellm-proxy.pid | |
| echo "LiteLLM proxy started with PID $(cat /workspace/logs/litellm-proxy.pid)" | |
| echo "Logs: /workspace/logs/litellm-proxy.log" | |
| # If the above command fails, try with --detailed_debug | |
| # nohup litellm --config /workspace/litellm-config.yaml --port 4000 --host 0.0.0.0 --detailed_debug > /workspace/logs/litellm-proxy.log 2>&1 & | |
| # echo $! > /workspace/logs/litellm-proxy.pid |
| litellm_params: | ||
| model: openai/Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 | ||
| api_base: http://localhost:8000/v1 | ||
| api_key: sk-vllm-c9be6c31b9f1ebd5bc5a316ac7d71381 |
There was a problem hiding this comment.
The API key 'sk-vllm-c9be6c31b9f1ebd5bc5a316ac7d71381' appears to be a fixed, hardcoded credential shared between vLLM and LiteLLM. This differs from the vLLM server's approach where API keys are either generated randomly or explicitly set via environment variables. Hardcoding credentials in configuration files can be a security risk if this configuration is committed to version control or shared. Consider generating a random key or reading from an environment variable like the vLLM server does.
| success_callback: [] | ||
|
|
||
| general_settings: | ||
| master_key: sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381 # Same key for simplicity |
There was a problem hiding this comment.
The master_key 'sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381' is hardcoded in the configuration file. This key controls access to the LiteLLM proxy. Hardcoding credentials poses a security risk if the configuration is committed to version control or shared. Consider generating a random key or reading from an environment variable, similar to how the vLLM server handles API keys.
| export VLLM_MAX_MODEL_LEN=131072 # 128K tokens | ||
| export VLLM_GPU_MEMORY_UTIL=0.95 # Maximum GPU memory utilization | ||
| export VLLM_TOOL_PARSER="hermes" # Hermes-style tool calling (built into Qwen chat template) | ||
| export VLLM_TOOL_PARSER="" # No parser - let LiteLLM proxy handle tool calling |
There was a problem hiding this comment.
Setting VLLM_TOOL_PARSER to an empty string will cause start-vllm-server.sh to use the default value 'openai' instead of disabling the tool parser. In start-vllm-server.sh line 28, the code uses 'TOOL_PARSER="${VLLM_TOOL_PARSER:-openai}"' which applies the default when the variable is unset OR empty. This means the tool parser will still be active with 'openai' parser, defeating the purpose of this PR. To properly disable the tool parser, either:
- Pass a special value that vLLM recognizes as "no parser"
- Modify start-vllm-server.sh to handle empty strings differently and conditionally include the --tool-call-parser flag
| export VLLM_TOOL_PARSER="" # No parser - let LiteLLM proxy handle tool calling | |
| # export VLLM_TOOL_PARSER="" # No parser - let LiteLLM proxy handle tool calling |
| json_logs: false # Easier to read logs | ||
| num_retries: 2 | ||
| request_timeout: 600 | ||
| modify_params: true |
There was a problem hiding this comment.
The 'success_callback' is set to an empty array but there's no explanation of what this field does or when it might be useful. Consider either removing this line (if it's not needed) or adding a comment explaining its purpose for future maintainers who might want to configure callbacks.
| modify_params: true | |
| modify_params: true | |
| # List of callback functions to execute on successful requests. | |
| # Leave empty unless you want to add custom success handlers. |
| if ! curl -s http://localhost:8000/v1/models > /dev/null 2>&1; then | ||
| echo "⚠️ Warning: vLLM doesn't appear to be running on port 8000" |
There was a problem hiding this comment.
The health check for vLLM uses 'curl -s' which silences all output including error messages. If the check fails, the user only sees a generic warning without understanding why the connection failed. Consider using 'curl -sf' to show HTTP errors or capture and display the actual error message to help with troubleshooting.
| if ! curl -s http://localhost:8000/v1/models > /dev/null 2>&1; then | |
| echo "⚠️ Warning: vLLM doesn't appear to be running on port 8000" | |
| VLLM_HEALTH_OUTPUT=$(curl -sf http://localhost:8000/v1/models 2>&1) | |
| if [ $? -ne 0 ]; then | |
| echo "⚠️ Warning: vLLM doesn't appear to be running on port 8000" | |
| echo "curl error message:" | |
| echo "$VLLM_HEALTH_OUTPUT" |
| provider: openai | ||
| model: qwen3-coder-30b # Model name from litellm-config.yaml | ||
| apiBase: https://YOUR-POD.proxy.runpod.net:4000/v1 # LiteLLM port | ||
| apiKey: sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381 |
There was a problem hiding this comment.
The documentation shows a hardcoded API key that matches the one in the setup script. This reinforces the pattern of using fixed credentials. Consider updating the documentation to mention that users should replace this with their own secure key, or indicate that the key will be generated during setup.
| echo "✅ LiteLLM config created at /workspace/litellm-config.yaml" | ||
| echo "" | ||
| echo "Next steps:" | ||
| echo "1. Start vLLM WITHOUT tool parser: ./models/qwen.sh (with VLLM_TOOL_PARSER=\"\")" |
There was a problem hiding this comment.
The instruction says 'Start vLLM WITHOUT tool parser: ./models/qwen.sh (with VLLM_TOOL_PARSER="")' which is misleading. The qwen.sh script already sets VLLM_TOOL_PARSER="" in this PR, so users don't need to manually specify it. The parenthetical note could confuse users into thinking they need to modify something. Consider revising to: 'Start vLLM: ./models/qwen.sh (tool parser now disabled)'
| echo "1. Start vLLM WITHOUT tool parser: ./models/qwen.sh (with VLLM_TOOL_PARSER=\"\")" | |
| echo "1. Start vLLM: ./models/qwen.sh (tool parser now disabled)" |
| set -e | ||
|
|
||
| echo "🔧 Installing LiteLLM..." | ||
| pip install 'litellm[proxy]' --quiet |
There was a problem hiding this comment.
Using '--quiet' flag for pip install suppresses all output including potential warnings or errors. If the installation fails or has issues, users won't see helpful error messages. Consider removing '--quiet' or replacing it with '--progress-bar off' to show errors while reducing visual noise.
| pip install 'litellm[proxy]' --quiet | |
| pip install 'litellm[proxy]' --progress-bar off |
| set -e | ||
|
|
||
| echo "🔧 Installing LiteLLM..." | ||
| pip install 'litellm[proxy]' --quiet |
There was a problem hiding this comment.
Running pip install 'litellm[proxy]' without pinning to a specific version or verifying integrity introduces a supply-chain risk: any compromise of the litellm package or its dependencies would automatically execute attacker-controlled code in your runtime. Since this script is part of your deployment path, an attacker could leverage a malicious package version to access data or modify model-serving behavior. Pin the dependency to a trusted version and/or verify hashes or signatures before installation to reduce this risk.
- Updated models/qwen.sh: Set VLLM_TOOL_PARSER='qwen_coder' for XML parsing - Updated scripts/setup-litellm-proxy.sh: Added supports_parallel_function_calling - Created scripts/update-runpod-config.sh: Automated deployment script with testing - Created docs/setup/LITELLM-VLLM-TOOLCALLING.md: Comprehensive architecture guide - Created docs/setup/VLLM-LITELLM-INTEGRATION-DOCS.md: Context7 documentation findings - Updated QUICK-UPDATE.md: TL;DR deployment instructions - Updated scripts/start-litellm-proxy.sh: Background execution with logging Architecture: Continue.dev -> LiteLLM (port 4000) -> vLLM (port 8000) -> Model - vLLM parses Qwen's XML tool calls using qwen_coder parser - LiteLLM normalizes to OpenAI format and strips non-standard parameters - Fixes 'Invalid \escape' JSON parsing errors in Continue.dev
Error from RunPod showed parser should be 'qwen3_coder' not 'qwen_coder'. Verified against vLLM official source code: vllm/entrypoints/openai/tool_parsers/__init__.py line 115-118 Updated all occurrences: - models/qwen.sh: VLLM_TOOL_PARSER='qwen3_coder' - scripts/update-runpod-config.sh: References to qwen3_coder - docs/setup/LITELLM-VLLM-TOOLCALLING.md - docs/setup/VLLM-LITELLM-INTEGRATION-DOCS.md - QUICK-UPDATE.md Available parsers from vLLM source: deepseek_v3, deepseek_v31, ernie45, glm45, granite, granite-20b-fc, hermes, hunyuan_a13b, internlm, jamba, kimi_k2, llama3_json, llama4_json, llama4_pythonic, longcat, minimax, minimax_m2, mistral, olmo3, openai, phi4_mini_json, pythonic, qwen3_coder, qwen3_xml, seed_oss, step3, xlam
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 21 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| echo "✅ LiteLLM config created at /workspace/litellm-config.yaml" | ||
| echo "" | ||
| echo "Next steps:" | ||
| echo "1. Start vLLM with qwen_coder parser: ./models/qwen.sh (VLLM_TOOL_PARSER=\"qwen_coder\")" |
There was a problem hiding this comment.
The instruction mentions 'VLLM_TOOL_PARSER="qwen_coder"' but this contradicts the PR description and other documentation which states that qwen.sh should use the qwen_coder parser. This comment is correct, but the surrounding context in LITELLM-PROXY-SETUP.md (line 21) incorrectly states the parser should be empty. Ensure consistency across all documentation.
| - **vLLM only** with various parsers (`qwen3_coder`, `qwen3_xml`, `hermes`, `openai`) | ||
| - **Problem**: vLLM's `_postprocess_messages` validates all tool calls as JSON, causing failures with Qwen's XML output | ||
|
|
||
| ### Current Approach (Working) | ||
| 1. **vLLM** with `qwen3_coder` parser: | ||
| - Parses Qwen's XML tool call format (`<tool_call>`, `<function>`, `<parameter>`) | ||
| - Converts to vLLM's internal tool call structure | ||
| - No JSON validation issues because parsing happens BEFORE validation | ||
|
|
||
| 2. **LiteLLM** as normalization layer: | ||
| - Takes vLLM's parsed tool calls | ||
| - Normalizes to OpenAI format expected by Continue.dev | ||
| - Strips non-standard parameters like `supports_function_calling` | ||
| - Handles retries and timeouts |
There was a problem hiding this comment.
This documentation states that the previous approach used 'vLLM only' but the current PR description and code changes indicate that vLLM now uses the 'qwen_coder' parser, not no parser. The description here should clarify that the previous approaches tried different parsers but had issues, and the current solution combines vLLM's qwen_coder parser with LiteLLM for normalization.
| - **vLLM only** with various parsers (`qwen3_coder`, `qwen3_xml`, `hermes`, `openai`) | |
| - **Problem**: vLLM's `_postprocess_messages` validates all tool calls as JSON, causing failures with Qwen's XML output | |
| ### Current Approach (Working) | |
| 1. **vLLM** with `qwen3_coder` parser: | |
| - Parses Qwen's XML tool call format (`<tool_call>`, `<function>`, `<parameter>`) | |
| - Converts to vLLM's internal tool call structure | |
| - No JSON validation issues because parsing happens BEFORE validation | |
| 2. **LiteLLM** as normalization layer: | |
| - Takes vLLM's parsed tool calls | |
| - Normalizes to OpenAI format expected by Continue.dev | |
| - Strips non-standard parameters like `supports_function_calling` | |
| - Handles retries and timeouts | |
| - **vLLM with various parsers** (`qwen3_coder`, `qwen3_xml`, `hermes`, `openai`) | |
| - **Problem**: Each parser had issues—vLLM's `_postprocess_messages` step always validated tool calls as JSON, so Qwen's XML output (even when parsed) would fail validation, breaking tool calling. | |
| ### Current Approach (Working) | |
| 1. **vLLM** with `qwen_coder` parser: | |
| - Uses the `qwen_coder` parser to convert Qwen's XML tool call format (`<tool_call>`, `<function>`, `<parameter>`) into vLLM's internal tool call structure. | |
| - Parsing happens before vLLM's JSON validation, so XML is correctly handled and converted. | |
| 2. **LiteLLM** as normalization layer: | |
| - Receives vLLM's parsed tool calls. | |
| - Normalizes them to the OpenAI format expected by Continue.dev. | |
| - Strips non-standard parameters like `supports_function_calling`. | |
| - Handles retries and timeouts. |
| apiBase: https://3clxt008hl0a3a-4000.proxy.runpod.net/v1 | ||
| apiKey: sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381 | ||
| capabilities: | ||
| - tool_use |
There was a problem hiding this comment.
The configuration includes a hardcoded API key and a specific RunPod URL. Replace with environment variable references or placeholders to avoid exposing sensitive information in documentation.
| apiBase: https://3clxt008hl0a3a-4000.proxy.runpod.net/v1 | |
| apiKey: sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381 | |
| capabilities: | |
| - tool_use | |
| apiBase: ${CONTINUE_API_BASE} # e.g., https://<your-runpod-endpoint>:4000/v1 | |
| apiKey: ${CONTINUE_API_KEY} # Set this environment variable to your API key | |
| capabilities: | |
| - tool_use | |
| # Replace the placeholders above with your actual API base URL and API key, | |
| # or set the CONTINUE_API_BASE and CONTINUE_API_KEY environment variables. |
| provider: openai | ||
| model: qwen3-coder-30b # Model name from litellm-config.yaml | ||
| apiBase: https://YOUR-POD.proxy.runpod.net:4000/v1 # LiteLLM port | ||
| apiKey: sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381 |
There was a problem hiding this comment.
The configuration example includes a hardcoded API key. Replace with a placeholder like 'YOUR_LITELLM_API_KEY' to avoid exposing sensitive information in documentation.
| apiKey: sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381 | |
| apiKey: YOUR_LITELLM_API_KEY |
| 3. **vLLM** generates response (no tool parser, raw output) | ||
| 4. **LiteLLM** parses the response and converts to OpenAI format |
There was a problem hiding this comment.
Step 3 in the 'How It Works' section states that vLLM generates response with 'no tool parser, raw output'. This contradicts the actual implementation where models/qwen.sh sets VLLM_TOOL_PARSER="qwen_coder". The documentation should accurately reflect that vLLM uses the qwen_coder parser to convert XML to JSON, which LiteLLM then normalizes to OpenAI format.
| 3. **vLLM** generates response (no tool parser, raw output) | |
| 4. **LiteLLM** parses the response and converts to OpenAI format | |
| 3. **vLLM** uses the `qwen_coder` parser to convert XML tool calls to JSON | |
| 4. **LiteLLM** normalizes the JSON response to OpenAI format |
| -H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" | jq -r '.data[].id' || echo " ⚠️ LiteLLM test failed" | ||
|
|
||
| echo "" | ||
| echo " Testing tool calling through LiteLLM..." | ||
| RESPONSE=$(curl -s http://localhost:4000/v1/chat/completions \ | ||
| -H "Content-Type: application/json" \ | ||
| -H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" \ |
There was a problem hiding this comment.
The API key 'sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381' is hardcoded in the script and appears to be a real key used across multiple files in this PR. Hardcoding API keys in scripts and configuration files is a security risk, especially in version control. Consider using environment variables or a secrets management system instead.
| api_key: sk-vllm-c9be6c31b9f1ebd5bc5a316ac7d71381 | ||
| supports_function_calling: true | ||
| supports_parallel_function_calling: true | ||
| model_info: | ||
| mode: chat | ||
| supports_function_calling: true | ||
| supports_parallel_function_calling: true | ||
| max_tokens: 8192 # Max completion tokens | ||
| max_input_tokens: 131072 # 128K context | ||
|
|
||
| litellm_settings: | ||
| drop_params: true # Strip non-standard parameters | ||
| json_logs: false # Easier to read logs | ||
| num_retries: 2 | ||
| request_timeout: 600 | ||
| modify_params: true | ||
| success_callback: [] | ||
|
|
||
| general_settings: | ||
| master_key: sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381 # Same key for simplicity |
There was a problem hiding this comment.
The vLLM API key 'sk-vllm-c9be6c31b9f1ebd5bc5a316ac7d71381' is hardcoded in the configuration file template. This same key is also used as the LiteLLM master key on line 38. Hardcoding API keys in configuration templates that are committed to version control is a security risk. Consider using environment variable substitution or prompting for keys during setup.
| echo " tail -f ${LOG_FILE}" | ||
| exit 1 | ||
| fi | ||
|
|
There was a problem hiding this comment.
The litellm command uses --host 0.0.0.0 which binds to all network interfaces, potentially exposing the proxy to external networks. While this might be intentional for RunPod's port forwarding, it could be a security concern. Consider documenting this choice or adding a warning about firewall configuration.
| # SECURITY WARNING | |
| echo "⚠️ SECURITY WARNING: The LiteLLM proxy will bind to all network interfaces (0.0.0.0)." | |
| echo " This may expose the proxy to external networks. Ensure your firewall is configured" | |
| echo " to restrict access to trusted sources only." | |
| echo "" |
| curl https://3clxt008hl0a3a-4000.proxy.runpod.net/v1/chat/completions \ | ||
| -H "Content-Type: application/json" \ | ||
| -H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" \ |
There was a problem hiding this comment.
The curl command includes a hardcoded API key and a specific RunPod URL. Replace with placeholders or environment variable references.
| curl https://3clxt008hl0a3a-4000.proxy.runpod.net/v1/chat/completions \ | |
| -H "Content-Type: application/json" \ | |
| -H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" \ | |
| curl <RUNPOD_URL>/v1/chat/completions \ | |
| -H "Content-Type: application/json" \ | |
| -H "Authorization: Bearer <LITELLM_API_KEY>" \ |
| # 5. Test | ||
| curl http://localhost:4000/v1/chat/completions \ | ||
| -H "Content-Type: application/json" \ | ||
| -H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" \ |
There was a problem hiding this comment.
The curl command example includes a hardcoded API key. Replace with a placeholder or environment variable reference to avoid exposing sensitive information.
| -H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" \ | |
| -H "Authorization: Bearer YOUR_API_KEY_HERE" \ |
Matches the pattern of stop-server.sh for consistency: - Checks PID file - Graceful shutdown attempt - Force kill if needed - Cleans up PID file Usage on RunPod: ./scripts/stop-litellm-proxy.sh ./scripts/start-litellm-proxy.sh
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| echo "1. Start vLLM with qwen_coder parser: ./models/qwen.sh (VLLM_TOOL_PARSER=\"qwen_coder\")" | ||
| echo "2. Start LiteLLM proxy: ./scripts/start-litellm-proxy.sh" | ||
| echo "3. Update Continue.dev to use: http://localhost:4000 or https://...proxy.runpod.net:4000" | ||
| echo "" | ||
| echo "Architecture: Continue.dev → LiteLLM (format normalization) → vLLM (qwen_coder parser) → Model" |
There was a problem hiding this comment.
This next steps instruction contradicts the actual configuration in the script itself. It says to start vLLM with 'VLLM_TOOL_PARSER="qwen_coder"' (note: qwen_coder without the "3"), but line 14 of the generated config file and the models/qwen.sh file both use "qwen3_coder" (with the "3"). Additionally, the description states that vLLM should use the "qwen_coder parser" while the architecture diagram on line 54 shows "qwen3_coder parser". This inconsistency in the parser name will cause configuration errors.
| echo "1. Start vLLM with qwen_coder parser: ./models/qwen.sh (VLLM_TOOL_PARSER=\"qwen_coder\")" | |
| echo "2. Start LiteLLM proxy: ./scripts/start-litellm-proxy.sh" | |
| echo "3. Update Continue.dev to use: http://localhost:4000 or https://...proxy.runpod.net:4000" | |
| echo "" | |
| echo "Architecture: Continue.dev → LiteLLM (format normalization) → vLLM (qwen_coder parser) → Model" | |
| echo "1. Start vLLM with qwen3_coder parser: ./models/qwen.sh (VLLM_TOOL_PARSER=\"qwen3_coder\")" | |
| echo "2. Start LiteLLM proxy: ./scripts/start-litellm-proxy.sh" | |
| echo "3. Update Continue.dev to use: http://localhost:4000 or https://...proxy.runpod.net:4000" | |
| echo "" | |
| echo "Architecture: Continue.dev → LiteLLM (format normalization) → vLLM (qwen3_coder parser) → Model" |
| curl -s http://localhost:8000/v1/models | jq -r '.data[].id' || echo " ⚠️ vLLM test failed" | ||
|
|
||
| echo "" | ||
| echo " Testing LiteLLM (port 4000)..." | ||
| curl -s http://localhost:4000/v1/models \ | ||
| -H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" | jq -r '.data[].id' || echo " ⚠️ LiteLLM test failed" |
There was a problem hiding this comment.
The error handling here silently fails with a generic message. If the jq command fails (due to malformed JSON or connection issues), the user won't know the specific cause. Consider showing the actual error or the raw response to aid debugging.
| curl -s http://localhost:8000/v1/models | jq -r '.data[].id' || echo " ⚠️ vLLM test failed" | |
| echo "" | |
| echo " Testing LiteLLM (port 4000)..." | |
| curl -s http://localhost:4000/v1/models \ | |
| -H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" | jq -r '.data[].id' || echo " ⚠️ LiteLLM test failed" | |
| VLLM_MODELS_RESPONSE=$(curl -s http://localhost:8000/v1/models) | |
| if ! echo "$VLLM_MODELS_RESPONSE" | jq -r '.data[].id'; then | |
| echo " ⚠️ vLLM test failed" | |
| echo " Raw response:" | |
| echo "$VLLM_MODELS_RESPONSE" | |
| fi | |
| echo "" | |
| echo " Testing LiteLLM (port 4000)..." | |
| LITELLM_MODELS_RESPONSE=$(curl -s http://localhost:4000/v1/models \ | |
| -H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381") | |
| if ! echo "$LITELLM_MODELS_RESPONSE" | jq -r '.data[].id'; then | |
| echo " ⚠️ LiteLLM test failed" | |
| echo " Raw response:" | |
| echo "$LITELLM_MODELS_RESPONSE" | |
| fi |
| cat > /workspace/litellm-config.yaml << 'EOF' | ||
| model_list: | ||
| # Qwen 3 Coder 30B - Primary model for tool calling | ||
| # vLLM uses qwen_coder parser, LiteLLM normalizes to OpenAI format |
There was a problem hiding this comment.
The comment references "qwen_coder parser" but the correct parser name used throughout the codebase and vLLM is "qwen3_coder" (with the "3"). This typo could mislead users about which parser is actually being configured.
| # vLLM uses qwen_coder parser, LiteLLM normalizes to OpenAI format | |
| # vLLM uses qwen3_coder parser, LiteLLM normalizes to OpenAI format |
| curl -s http://localhost:4000/v1/models \ | ||
| -H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" | jq -r '.data[].id' || echo " ⚠️ LiteLLM test failed" |
There was a problem hiding this comment.
Similar to line 70, this error handling is too generic. If the curl or jq command fails, the user won't know why. Consider capturing and displaying the actual error or response to help with troubleshooting.
| curl -s http://localhost:4000/v1/models \ | |
| -H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381" | jq -r '.data[].id' || echo " ⚠️ LiteLLM test failed" | |
| LITELLM_RESPONSE=$(curl -s -w "\n%{http_code}" http://localhost:4000/v1/models -H "Authorization: Bearer sk-litellm-c9be6c31b9f1ebd5bc5a316ac7d71381") | |
| LITELLM_BODY=$(echo "$LITELLM_RESPONSE" | sed '$d') | |
| LITELLM_STATUS=$(echo "$LITELLM_RESPONSE" | tail -n1) | |
| if [ "$LITELLM_STATUS" -ne 200 ]; then | |
| echo " ⚠️ LiteLLM test failed (HTTP $LITELLM_STATUS)" | |
| echo " Response:" | |
| echo "$LITELLM_BODY" | head -10 | |
| else | |
| if ! echo "$LITELLM_BODY" | jq -r '.data[].id'; then | |
| echo " ⚠️ LiteLLM test failed (invalid JSON or missing .data[].id)" | |
| echo " Response:" | |
| echo "$LITELLM_BODY" | head -10 | |
| fi | |
| fi |
| echo "" | ||
| echo "2️⃣ Restarting vLLM server with qwen3_coder parser..." | ||
| echo " Stopping current vLLM server..." | ||
| pkill -f "python -m vllm.entrypoints.openai.api_server" || echo " No vLLM process found" |
There was a problem hiding this comment.
The pkill command with '|| echo "No vLLM process found"' pattern could be misleading because pkill returns non-zero both when no process is found AND when the user lacks permissions to kill the process. Consider checking process existence separately before attempting to kill, or provide more specific error handling.
| pkill -f "python -m vllm.entrypoints.openai.api_server" || echo " No vLLM process found" | |
| if pgrep -f "python -m vllm.entrypoints.openai.api_server" > /dev/null; then | |
| if pkill -f "python -m vllm.entrypoints.openai.api_server"; then | |
| echo " vLLM process stopped" | |
| else | |
| echo " Failed to stop vLLM process (insufficient permissions or other error)" | |
| fi | |
| else | |
| echo " No vLLM process found" | |
| fi |
| echo $! > "${PID_FILE}" | ||
|
|
||
| echo "✅ LiteLLM proxy started successfully!" | ||
| echo "" | ||
| echo "PID: $(cat ${PID_FILE})" | ||
| echo "Log file: ${LOG_FILE}" | ||
| echo "" | ||
| echo "📊 To monitor logs in real-time:" | ||
| echo " tail -f ${LOG_FILE}" | ||
| echo "" | ||
| echo "🛑 To stop the proxy:" | ||
| echo " kill $(cat ${PID_FILE})" | ||
| echo "" | ||
| echo "Waiting 5 seconds for proxy to initialize..." | ||
| sleep 5 | ||
|
|
||
| # Check if process is still running | ||
| if kill -0 $(cat "${PID_FILE}") 2>/dev/null; then |
There was a problem hiding this comment.
Using command substitution within the echo message means if the PID file is removed between line 51 and this echo, the command will fail silently or show an empty value. Consider storing the PID in a variable after line 51 and using that variable throughout the script for consistency and reliability.
| echo $! > "${PID_FILE}" | |
| echo "✅ LiteLLM proxy started successfully!" | |
| echo "" | |
| echo "PID: $(cat ${PID_FILE})" | |
| echo "Log file: ${LOG_FILE}" | |
| echo "" | |
| echo "📊 To monitor logs in real-time:" | |
| echo " tail -f ${LOG_FILE}" | |
| echo "" | |
| echo "🛑 To stop the proxy:" | |
| echo " kill $(cat ${PID_FILE})" | |
| echo "" | |
| echo "Waiting 5 seconds for proxy to initialize..." | |
| sleep 5 | |
| # Check if process is still running | |
| if kill -0 $(cat "${PID_FILE}") 2>/dev/null; then | |
| LITELLM_PID=$! | |
| echo "${LITELLM_PID}" > "${PID_FILE}" | |
| echo "✅ LiteLLM proxy started successfully!" | |
| echo "" | |
| echo "PID: ${LITELLM_PID}" | |
| echo "Log file: ${LOG_FILE}" | |
| echo "" | |
| echo "📊 To monitor logs in real-time:" | |
| echo " tail -f ${LOG_FILE}" | |
| echo "" | |
| echo "🛑 To stop the proxy:" | |
| echo " kill ${LITELLM_PID}" | |
| echo "" | |
| echo "Waiting 5 seconds for proxy to initialize..." | |
| sleep 5 | |
| # Check if process is still running | |
| if kill -0 "${LITELLM_PID}" 2>/dev/null; then |
| kill $(cat /workspace/logs/litellm-proxy.pid) | ||
|
|
||
| # 2. Update vLLM to use qwen3_coder parser | ||
| # Edit models/qwen.sh: |
There was a problem hiding this comment.
The instruction to manually edit models/qwen.sh contradicts the purpose of this PR, which includes that file with VLLM_TOOL_PARSER already set to "qwen3_coder". If users follow these steps, they would be manually making a change that should already be present from the PR. Either this step should be removed or clarified to indicate it's only needed if the file wasn't updated.
| # Edit models/qwen.sh: | |
| # (If models/qwen.sh was not updated by the PR, set:) |
Changed from 'openai/Qwen/...' to just 'Qwen/...' with explicit custom_llm_provider: openai This fixes the 'list index out of range' error when LiteLLM tries to connect to vLLM. The model name should match what vLLM is serving, not include a provider prefix.
The router_settings with model_group_alias was causing 'list index out of range' errors. Since we only have one model, we don't need routing or fallback logic. Also set num_retries: 0 to prevent retry logic from interfering. This should fix the ServiceUnavailableError in Continue.dev.
The previous pkill wasn't reliably stopping vLLM. Now: 1. Use stop-server.sh script if available (proper graceful shutdown) 2. Multiple pkill patterns as fallback 3. Longer sleep to ensure process is stopped 4. Remove background execution of qwen.sh This ensures vLLM truly restarts with the new qwen3_coder parser.
The qwen3_coder parser has a bug with streaming tool calling: IndexError: list index out of range in streamed_args_for_tool vLLM logs show error at serving_chat.py:1163 when Continue.dev uses streaming (which it does by default). The qwen3_xml parser should have better streaming support and still parses Qwen's XML tool format correctly.
qwen3_xml and qwen3_coder both have streaming issues causing JSON parsing errors. Hermes parser is more stable and widely tested for streaming tool calling.
All vLLM parsers have bugs: - qwen3_coder: IndexError in streaming - qwen3_xml: Malformed JSON responses - hermes: Can't parse Qwen's XML format Solution: Let vLLM output raw text, Continue.dev will handle tool parsing natively using the model's XML format.
Mistral parser uses simpler JSON format that may work better with Qwen model vs complex XML parsers.
qwen3_coder parser works perfectly in non-streaming mode. Force LiteLLM to disable streaming to avoid the IndexError bug.
The qwen3_coder parser has a streaming bug. The fix is to disable streaming in Continue.dev config by setting stream: false. Non-streaming mode works perfectly and is actually faster for tool calling scenarios.
LiteLLM proxy was unnecessary complexity. Continue.dev uses system message tools, not OpenAI tool calling format. Changes: - Continue.dev points directly to vLLM port 8000 - Disabled vLLM tool parser (Continue.dev handles tools itself) - Removed streaming workarounds (not needed without parser) This allows Continue.dev to work with MCP tools using its native system message tool approach.
- Remove all LiteLLM proxy setup/restart steps - Remove all Qwen references - Use Gemma3-27B with native OpenAI tool parser - Simplify to direct vLLM connection (port 8000) - Update tests to target vLLM directly Architecture: Continue.dev → vLLM → Gemma3-27B (openai parser)
- OpenAI parser failed with: 'requires token IDs and does not support text-based extraction' - Hermes parser is more generic and works with instruction-tuned models - No Gemma-specific parser exists in vLLM
- Disable vLLM tool parser completely (VLLM_TOOL_PARSER="") - Continue.dev supports gpt-oss models natively via system message tools - Update deployment script to use GPT-OSS instead of Gemma3 - This bypasses all vLLM parser bugs entirely

Problem
Tool calling with Qwen3-Coder-30B in Continue.dev fails with JSON parsing errors across all vLLM tool parsers:
qwen3_coder: JSON escape character issuesopenai: Requires token IDs, not text-basedqwen3_xml: vLLM still validates incoming messages as JSONhermes: Expects JSON but model outputs XMLRoot Cause: vLLM's tool calling architecture expects consistent JSON format throughout, which Qwen3 doesn't reliably provide. The
_postprocess_messagesfunction validates incoming tool calls as JSON regardless of parser.Solution: LiteLLM Proxy
Add LiteLLM proxy layer to handle tool calling format translation:
How It Works
Benefits
Changes Made
New Scripts
scripts/setup-litellm-proxy.sh- Installs LiteLLM and creates configscripts/start-litellm-proxy.sh- Starts LiteLLM proxy on port 4000Configuration Updates
models/qwen.sh- Removed tool parser (VLLM_TOOL_PARSER="")Documentation
docs/setup/LITELLM-PROXY-SETUP.md- Complete setup guideTesting Required
1. Setup on RunPod
cd /workspace/llm-hosting git pull origin feature/litellm-proxy-tool-calling ./scripts/setup-litellm-proxy.sh2. Restart vLLM (no tool parser)
3. Start LiteLLM
4. Expose Port 4000
4000→ TCP5. Update Continue.dev
6. Test Tool Calling
Alternative Considered
Disabling tool calling entirely was rejected because tool parsing is a priority requirement.
References