Skip to content

Conversation

@peterkc
Copy link
Contributor

@peterkc peterkc commented Dec 11, 2025

Summary

Implements per-model, per-user, and per-team max_parallel_requests limits in the proxy rate limiter, addressing three TODOs in parallel_request_limiter.py.

  • Model-level: Configurable via key metadata (model_max_parallel_requests) or model_max_budget
  • User-level: New user_max_parallel_requests field on UserAPIKeyAuth
  • Team-level: New team_max_parallel_requests field on UserAPIKeyAuth

Changes

File Change
auth_utils.py Added get_key_model_max_parallel_requests() helper
_types.py Added user_max_parallel_requests and team_max_parallel_requests fields
parallel_request_limiter.py Wired all 3 TODO locations (v1 limiter)
parallel_request_limiter_v3.py Added model-level support to v3 limiter
test_max_parallel_requests.py 10 unit tests

Documentation

📄 Model-Specific Max Parallel Requests - New doc with:

  • Quick start guide
  • Key, team, and priority override examples
  • Mermaid flowchart showing priority resolution

Config Examples

Model-level (via API key metadata):

{
  "model_max_parallel_requests": {
    "gpt-4": 5,
    "gpt-3.5-turbo": 20
  }
}

User-level: Set user_max_parallel_requests on user record
Team-level: Set team_max_parallel_requests on team record

Test plan

  • Unit tests pass (10/10)
  • v3 limiter support added
  • Documentation added
  • CI passes

Closes #17824

- Add `get_key_model_max_parallel_requests()` helper in auth_utils.py
- Add `user_max_parallel_requests` and `team_max_parallel_requests` fields to UserAPIKeyAuth
- Wire model-level limits in parallel_request_limiter.py (line 323 TODO)
- Wire user-level limits in parallel_request_limiter.py (line 366 TODO)
- Wire team-level limits in parallel_request_limiter.py (line 394 TODO)
- Add 10 unit tests for helper function and type fields

Closes BerriAI#17824
@vercel
Copy link

vercel bot commented Dec 11, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
litellm Error Error Dec 13, 2025 11:27pm

@peterkc peterkc marked this pull request as ready for review December 12, 2025 01:29
@krrishdholakia
Copy link
Contributor

@peterkc can you add a loom of key model max parallel requests working as expected?

Copy link

@JiangJiaWei1103 JiangJiaWei1103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Over LGTM. I'll test this integration in our use case. Thanks!

@peterkc
Copy link
Contributor Author

peterkc commented Dec 13, 2025

@krrishdholakia Happy to add a Loom demo! A couple of quick clarifications to make sure I cover what you're looking for:

v3 Limiter Support
While testing, I noticed the PR modifies parallel_request_limiter.py (v1), but the default is v3 (parallel_request_limiter_v3.py). I've drafted a small patch (~15 lines) to port model-level max_parallel_requests to v3 as well. Should I:

  1. Include the v3 port in this PR? (Recommended—ensures feature works out of the box)
  2. Demo with v1 only using LEGACY_MULTI_INSTANCE_RATE_LIMITING=true?

Demo Scope
Planning to show:

  • Key-level model limits (metadata.model_max_parallel_requests)
  • Model isolation (GPT-4 limit=2 doesn't affect GPT-3.5 limit=5)
  • 429 responses when limits are exceeded

Want me to also include team-level limits and the priority override (key > team)?

- Port model-level max_parallel_requests support to v3 rate limiter
- Add documentation with mermaid priority flowchart
- Covers key-level, team-level, and priority override patterns
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Support max_parallel_requests in proxy-level parallel request limiter

3 participants