diff --git a/docs/docs/concepts/services.md b/docs/docs/concepts/services.md index 745f78e3f..0f6bf07bb 100644 --- a/docs/docs/concepts/services.md +++ b/docs/docs/concepts/services.md @@ -164,6 +164,57 @@ Setting the minimum number of replicas to `0` allows the service to scale down t > The `scaling` property requires creating a [gateway](gateways.md). +??? info "Replica groups" + A service can include multiple replica groups. Each group can define its own `commands`, `resources` requirements, and `scaling` rules. + +
+ + ```yaml + type: service + name: llama-8b-service + + image: lmsysorg/sglang:latest + env: + - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B + + replicas: + - count: 1..2 + scaling: + metric: rps + target: 10 + commands: + - | + python -m sglang.launch_server \ + --model-path $MODEL_ID \ + --port 8000 \ + --trust-remote-code + resources: + gpu: 48GB + + - count: 1..4 + scaling: + metric: rps + target: 5 + commands: + - | + python -m sglang.launch_server \ + --model-path $MODEL_ID \ + --port 8000 \ + --trust-remote-code + resources: + gpu: 24GB + + port: 8000 + model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B + ``` + +
+ + > Properties such as `regions`, `port`, `image`, `env` and some other cannot be configured per replica group. This support is coming soon. + +??? info "Disaggregated serving" + Native support for disaggregated prefill and decode, allowing both worker types to run within a single service, is coming soon. + ### Model If the service is running a chat model with an OpenAI-compatible interface,