🚀[v0.3.34] Release Note: Dynamic LoRA Routing, Control Vectors, and Assistant Prefill #100
JamePeng
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
🚀[v0.3.34] Release Note: Dynamic LoRA Routing, Control Vectors, and Assistant Prefill
Hello everyone,
I am pleased to announce the release of
llama-cpp-pythonv0.3.34. This update focuses on improving how the engine handles model adapters (LoRAs), generation continuity, and underlying memory state management.By removing outdated static loading mechanisms and introducing dynamic routing, this version aims to better support multi-tenant serving environments and complex conversational workflows.
Here is a detailed breakdown of the new features and changes.
1. Dynamic LoRA Routing (Multi-Tenant Serving)
Historically,
llama-cpp-pythonrelied on static loading: a LoRA adapter was permanently bound to the context during theLlamainstance initialization. Serving different personas required either reloading the entire model or duplicating instances in VRAM.v0.3.34 introduces Just-In-Time (JIT) dynamic adapter routing. You can now preload multiple adapters into VRAM and apply them on-the-fly for each individual request. The engine handles the compute graph weight swapping atomically before evaluation, and includes an internal debounce mechanism to guarantee zero state contamination between requests.
Practical Scenario:
Imagine a single backend serving three different users concurrently.
Here is the pseudo-code demonstrating how this is now handled:
2. Seamless Assistant Prefill (Resolves #97)
Based on community feedback regarding the difficulty of continuing an interrupted or partially generated assistant response (Issue #97), we have introduced the
assistant_prefillparameter.Previously, continuing a message required string manipulations (like
rstrip()) or modifying internal Jinja2 formatter states, which was fragile and often broke specific control tokens.By setting
assistant_prefill=True, the engine now safely pops the final partialassistantmessage, renders theN-1conversation history cleanly through the standard Jinja2 template, and appends the partial text directly to the prompt.Usage Example:
(Note: This operation deep-copies the input array internally, ensuring your original message history is never mutated).
3. Control Vector (CVec) Injection
For users working with representation engineering, this release adds support for dynamic Control Vector injection. You can now steer model behavior by modifying activation values at specific hidden layers directly via the API.
Important: The C++ backend maps the buffer continuously starting from layer 1. Ensure your
dataarray length is at leastn_embd * layer_end, with zero-padding for any early layers you wish to skip.Under the Hood & Minor Fixes
LlamaLoraAdapterwrappers to manage C++ pointer lifecycles and prevent VRAM leaks during adapter unloads.clear_lorasandclear_cvecnow track applied states to prevent unnecessarysched_need_reserveoverhead in the GGML backend during standard generation.lora_base,lora_path, andlora_scalefromLlamainitialization parameters to enforce the new dynamic routing architecture.longest_token_prefixfast paths and hybrid cache clearance scenarios.huggingface_hubinstallation instructions.You can compile locally using
git cloneorgit pullto get the latest version of the code, or choose the appropriate pre-compiled Wheel from Releases based on your environment.If you encounter any errors or have any feedback on the new adapter management system, please feel free to submit an issue or leave a message in the discussion area.
A Note on Our Community and License Update
Finally, I've updated old MIT license copyright from a
single authorto collective authorship: Thellama-cpp-python authors. (2023-2026).Every
llama-cpp-pythoncontributor who participates and makes an effort makes this project more reliable, efficient, and user-friendly, and you all deserve to be remembered. Welcome to join us in promoting the project and enriching the open-source community!Best regards :)
JamePeng
Beta Was this translation helpful? Give feedback.
All reactions