MoonCell-ModelHub is a Spring Boot/WebFlux gateway for routing chat requests to multiple LLM providers with unified request/response handling.
- RPM/TPM-aware admission control per model instance
- Two hot-swappable load balancing algorithms:
TRADITIONALOBJECT_POOL
- Runtime strategy switch via admin settings API/UI
- Flexible request/response conversion system with JSON-based rules
- SSE response normalization and configurable field mapping
- Redis-based idempotency guard
Routing decisions are budget-aware:
RPMcontrols request-rate budgetTPMcontrols token-consumption budget
An instance can receive a request only when both budget checks pass.
- Randomly sample
Nhealthy instances (sampleCount) - Score sampled instances by:
- current concurrency pressure
- RPM headroom pressure
- TPM headroom pressure
- Try acquire from lowest-score candidate first
This gives a stable baseline with low operational complexity.
- Uses the same RPM/TPM gates as
TRADITIONAL - Adds per-instance pool slot control (
coreSize,maxSize) - Samples
Ninstances and scores by:- pool pressure (active/allocated slots)
- concurrency pressure
- RPM/TPM pressure
- Prefers candidates with the lowest combined pressure
This improves behavior under heterogeneous traffic (especially when long requests dominate), and reduces allocation churn.
For each request:
- estimate tokens from prompt input
- sample candidate instances
- rank candidates by algorithm-specific pressure score
- attempt acquire in sorted order
- release after completion/error in
finally
If no candidate can pass admission checks, gateway returns 503.
Settings endpoint:
GET /admin/load-balancing/settingsPUT /admin/load-balancing/settings
Key fields:
algorithm:TRADITIONALorOBJECT_POOLsampleCount: random-sampling sizeobjectPoolCoreSize: initial slot count per instance (OBJECT_POOL only)objectPoolMaxSize: max slot count per instance (OBJECT_POOL only)
The gateway supports flexible conversion between standard OpenAPI format and provider-specific formats through a rule-based converter system.
Each model instance can be configured with JSON-based conversion rules:
- Request Conversion Rule (
requestConversionRule): Converts gateway requests to provider-specific format - Response Conversion Rule (
responseConversionRule): Converts provider responses to standard OpenAPI format
- Template Mode: Use JSON templates with placeholders (e.g.,
$model,$messages,$content) - Mapping Mode: Field mapping from source to target format using JSONPath
{
"requestConversionRule": {
"type": "template",
"template": {
"model": "$model",
"messages": "$messages",
"stream": "$stream"
}
},
"responseConversionRule": {
"type": "template",
"template": {
"id": "$requestId",
"choices": [{
"index": "$seq",
"delta": { "content": "$content" }
}]
}
}
}For detailed documentation, see docs/CONVERTER_ARCHITECTURE.md.
target/contains build artifacts; do not rely on it as source of truth.- Load-balancer simulation scripts are under
experiments/ab_simulator/.