🚀[v0.3.34] Release Note: Dynamic LoRA Routing, Control Vectors, and Assistant Prefill #100

JamePeng · 2026-03-31T16:50:30Z

JamePeng
Mar 31, 2026
Maintainer

🚀[v0.3.34] Release Note: Dynamic LoRA Routing, Control Vectors, and Assistant Prefill

Hello everyone,

I am pleased to announce the release of llama-cpp-python v0.3.34. This update focuses on improving how the engine handles model adapters (LoRAs), generation continuity, and underlying memory state management.

By removing outdated static loading mechanisms and introducing dynamic routing, this version aims to better support multi-tenant serving environments and complex conversational workflows.

Here is a detailed breakdown of the new features and changes.

1. Dynamic LoRA Routing (Multi-Tenant Serving)

Historically, llama-cpp-python relied on static loading: a LoRA adapter was permanently bound to the context during the Llama instance initialization. Serving different personas required either reloading the entire model or duplicating instances in VRAM.

v0.3.34 introduces Just-In-Time (JIT) dynamic adapter routing. You can now preload multiple adapters into VRAM and apply them on-the-fly for each individual request. The engine handles the compute graph weight swapping atomically before evaluation, and includes an internal debounce mechanism to guarantee zero state contamination between requests.

Practical Scenario:
Imagine a single backend serving three different users concurrently.

User A needs a coding assistant.
User B needs a Spanish translator.
User C needs general factual answers from the pure base model.

Here is the pseudo-code demonstrating how this is now handled:

from llama_cpp import Llama

# 1. Load the base model into VRAM once
llm = Llama(model_path="models/llama-3-8b.gguf")

# 2. Preload required LoRAs into the VRAM registry
llm.load_lora("coder", "loras/python-coder.gguf")
llm.load_lora("translator", "loras/spanish-translator.gguf")

# --- Scenario A: User A requests code ---
# The engine instantly mounts the 'coder' weights to the compute graph
resp_a = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Write a quicksort in Python."}],
    active_loras=[{"name": "coder", "scale": 1.0}] 
)

# --- Scenario B: User B requests translation ---
# The engine seamlessly unmounts 'coder' and mounts 'translator'
resp_b = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Explain gravity."}],
    active_loras=[{"name": "translator", "scale": 0.85}] # Supports custom scaling
)

# --- Scenario C: User C asks a general question ---
# The engine automatically wipes the graph clean, restoring the base model state
resp_c = llm.create_chat_completion(
    messages=[{"role": "user", "content": "What is the capital of Japan?"}]
    # active_loras is omitted, triggering a safe state wipe
)

2. Seamless Assistant Prefill (Resolves #97)

Based on community feedback regarding the difficulty of continuing an interrupted or partially generated assistant response (Issue #97), we have introduced the assistant_prefill parameter.

Previously, continuing a message required string manipulations (like rstrip()) or modifying internal Jinja2 formatter states, which was fragile and often broke specific control tokens.

By setting assistant_prefill=True, the engine now safely pops the final partial assistant message, renders the N-1 conversation history cleanly through the standard Jinja2 template, and appends the partial text directly to the prompt.

Usage Example:

messages = [
    {"role": "user", "content": "Name the first 5 planets."},
    {"role": "assistant", "content": "The first 5 are:\n1. Mercury\n2."} # Partial response
]

# The model will seamlessly continue generating from " Venus\n3. Earth..."
response = llm.create_chat_completion(
    messages=messages,
    assistant_prefill=True
)

(Note: This operation deep-copies the input array internally, ensuring your original message history is never mutated).

3. Control Vector (CVec) Injection

For users working with representation engineering, this release adds support for dynamic Control Vector injection. You can now steer model behavior by modifying activation values at specific hidden layers directly via the API.

response = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Tell me a story."}],
    control_vector={
        "data": [...],         # Flattened 1D list of floats
        "layer_start": 10,     
        "layer_end": 32        
    }
)

Important: The C++ backend maps the buffer continuously starting from layer 1. Ensure your data array length is at least n_embd * layer_end, with zero-padding for any early layers you wish to skip.

Under the Hood & Minor Fixes

Memory Safety: Introduced LlamaLoraAdapter wrappers to manage C++ pointer lifecycles and prevent VRAM leaks during adapter unloads.
Debounce Optimization: clear_loras and clear_cvec now track applied states to prevent unnecessary sched_need_reserve overhead in the GGML backend during standard generation.
Legacy Cleanup: Removed lora_base, lora_path, and lora_scale from Llama initialization parameters to enforce the new dynamic routing architecture.
Debugging: Added explicit verbose logging for longest_token_prefix fast paths and hybrid cache clearance scenarios.
Documentation: Updated README with new features and corrected huggingface_hub installation instructions.

You can compile locally using git clone or git pull to get the latest version of the code, or choose the appropriate pre-compiled Wheel from Releases based on your environment.

If you encounter any errors or have any feedback on the new adapter management system, please feel free to submit an issue or leave a message in the discussion area.

A Note on Our Community and License Update

Finally, I've updated old MIT license copyright from a single author to collective authorship: The llama-cpp-python authors. (2023-2026).

Every llama-cpp-python contributor who participates and makes an effort makes this project more reliable, efficient, and user-friendly, and you all deserve to be remembered. Welcome to join us in promoting the project and enriching the open-source community!

Best regards :)
JamePeng

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀[v0.3.34] Release Note: Dynamic LoRA Routing, Control Vectors, and Assistant Prefill #100

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

🚀[v0.3.34] Release Note: Dynamic LoRA Routing, Control Vectors, and Assistant Prefill #100

Uh oh!

Uh oh!

JamePeng Mar 31, 2026 Maintainer

🚀[v0.3.34] Release Note: Dynamic LoRA Routing, Control Vectors, and Assistant Prefill

1. Dynamic LoRA Routing (Multi-Tenant Serving)

2. Seamless Assistant Prefill (Resolves #97)

3. Control Vector (CVec) Injection

Under the Hood & Minor Fixes

A Note on Our Community and License Update

Replies: 0 comments

JamePeng
Mar 31, 2026
Maintainer