A fully-featured Vercel AI SDK provider for node-llama-cpp, enabling you to run local LLMs with the Vercel AI SDK's ergonomic API.
- 🚀 Auto-initializing - No manual setup required
- 🔧 Full AI SDK Integration - Works seamlessly with
generateText,streamText, and more - 🛠️ Multi-Step Tool Calling - Complete support for tools with automatic execution
- 🤔 Reasoning Support - Separate thinking from final answers for reasoning models
- 📡 Streaming & Non-Streaming - Both modes fully supported
- 🎮 GPU Acceleration - Optional GPU layers for faster inference
- 🔌 OpenAI-Compatible API - Drop-in replacement for OpenAI API
- 📝 TypeScript - Fully typed for great DX
import { createNodeLlamaCppProvider } from "./provider.js";
import { generateText } from "ai";
// Create provider - auto-initializes on first use
const provider = createNodeLlamaCppProvider({
modelPath: "hf:username/model-name/model-file.gguf", // path to a hugging face model
modelId: "my-model",
contextSize: 8096,
});
// Generate text
const { text } = await generateText({
model: provider.chat(),
prompt: "Explain quantum computing in simple terms",
});
console.log(text);import { streamText } from "ai";
const { textStream } = streamText({
model: provider.chat(),
prompt: "Write a haiku about programming",
});
for await (const chunk of textStream) {
process.stdout.write(chunk);
}import { generateText, tool, stepCountIs } from "ai";
import { z } from "zod";
const weatherTool = tool({
description: "Get weather for a location",
inputSchema: z.object({
location: z.string(),
}),
execute: async ({ location }) => ({
temperature: 72,
condition: "sunny",
location,
}),
});
const { text } = await generateText({
model: provider.chat(),
prompt: "What's the weather in San Francisco?",
tools: { weatherTool },
stopWhen: stepCountIs(5), // Allow up to 5 steps
});
console.log(text);
// Output: "The weather in San Francisco is currently sunny with a temperature of 72°F."import { streamText } from "ai";
const { fullStream } = streamText({
model: provider.chat(),
prompt: "Solve: If a train leaves at 2pm going 60mph...",
});
for await (const chunk of fullStream) {
if (chunk.type === "reasoning-delta") {
console.log("💭 Thinking:", chunk.text);
}
if (chunk.type === "text-delta") {
console.log("📝 Answer:", chunk.text);
}
}createNodeLlamaCppProvider({
// Required: Path to GGUF model (supports HuggingFace)
modelPath: "hf:username/repo/file.gguf",
// Required: Model identifier for AI SDK
modelId: "my-model",
// Optional: Context window size
contextSize: 8096,
// Optional: Directory to store downloaded models
modelsDirectory: "./models",
// Optional: Number of GPU layers to offload
gpuLayers: 32,
});You can use any GGUF model from HuggingFace:
// Mistral 7B
modelPath: "hf:TheBloke/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_M.gguf"
// Llama 3
modelPath: "hf:QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf"
// Qwen 2.5
modelPath: "hf:Qwen/Qwen2.5-7B-Instruct-GGUF/qwen2.5-7b-instruct-q4_k_m.gguf"
// DeepSeek R1 (reasoning model)
modelPath: "hf:deepseek-ai/DeepSeek-R1-Distill-Qwen-7B-GGUF/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf"This provider fully supports multi-step tool calling, allowing the model to:
- Reason about which tool to call
- Call the tool with appropriate parameters
- See the tool result
- Continue reasoning or provide a final answer
- Model decides to call a tool - Detection happens synchronously
- Generation aborts - Provider emits tool-call event
- AI SDK executes the tool - Your
executefunction runs - Provider called again - Full conversation history with tool results
- Model continues - Sees tool result and generates response
import { generateText, tool, stepCountIs } from "ai";
import { z } from "zod";
const getCurrentWeather = tool({
description: "Get current weather for a location",
inputSchema: z.object({
location: z.string().describe("City name"),
}),
execute: async ({ location }) => {
// Call your weather API
return {
temperature: 72,
condition: "sunny",
location,
};
},
});
const { text, steps } = await generateText({
model: provider.chat(),
prompt: "What's the weather like in Tokyo and should I bring an umbrella?",
tools: { getCurrentWeather },
stopWhen: stepCountIs(5),
});
console.log("Steps taken:", steps.length);
console.log("Final answer:", text);Run a local OpenAI-compatible API server:
npm run serverThen use it with any OpenAI-compatible client:
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:3000/v1",
apiKey: "not-needed",
});
const response = await client.chat.completions.create({
model: "gpt-oss-20b",
messages: [{ role: "user", content: "Hello!" }],
stream: true,
});
for await (const chunk of response) {
process.stdout.write(chunk.choices[0]?.delta?.content || "");
}src/
├── provider.ts # Main AI SDK provider
├── server.ts # OpenAI-compatible API server
├── example-ai-sdk.ts # AI SDK examples
├── example-tools.ts # Tool calling examples
├── example-local.ts # Direct usage examples
└── example-api-client.ts # API client examples
The provider automatically reuses the same model session across calls for efficiency:
const provider = createNodeLlamaCppProvider({...});
// These share the same underlying session
await generateText({ model: provider.chat(), prompt: "Hello" });
await generateText({ model: provider.chat(), prompt: "How are you?" });Offload layers to GPU for faster inference:
const provider = createNodeLlamaCppProvider({
modelPath: "...",
modelId: "my-model",
gpuLayers: 32, // Offload 32 layers to GPU
});For models that output thinking process (like DeepSeek R1, QwQ):
const { fullStream } = streamText({
model: provider.chat(),
prompt: "Solve this complex problem...",
});
for await (const chunk of fullStream) {
switch (chunk.type) {
case "reasoning-start":
console.log("🤔 Starting to think...");
break;
case "reasoning-delta":
process.stdout.write(chalk.gray(chunk.text));
break;
case "reasoning-end":
console.log("\n✅ Done thinking");
break;
case "text-delta":
process.stdout.write(chunk.text);
break;
}
}# AI SDK integration
npm run example:ai-sdk
# Tool calling
npm run example:tools
# Direct usage
npm run example:local
# API client
npm run server # In one terminal
npm run example:api-client # In another| Feature | Status | Notes |
|---|---|---|
generateText |
✅ | Fully supported |
streamText |
✅ | Fully supported |
| Tool calling | ✅ | Multi-step with stopWhen |
| Reasoning | ✅ | Separate thinking from answer |
| Temperature | ✅ | Full control |
| Top-P | ✅ | Full control |
| Max tokens | ✅ | Full control |
| Stop sequences | ✅ | Custom stop triggers |
| Streaming | ✅ | SSE format |
| Multi-modal | ❌ | Not yet supported |
# Manually download with node-llama-cpp CLI
npx --no node-llama-cpp download --model hf:username/repo/file.gguf- Reduce
contextSize - Use a smaller quantized model (Q4_K_M instead of Q6_K)
- Reduce
gpuLayersif using GPU
- Make sure to use
stopWhen: stepCountIs(n)notmaxSteps - Ensure your model supports function calling
- Some models require specific prompting for tools
- Increase
gpuLayersif you have a GPU - Use a smaller model
- Reduce
contextSize
- AI SDK Docs - Official AI SDK documentation
- node-llama-cpp Docs - node-llama-cpp documentation
MIT
- Vercel AI SDK - Ergonomic AI SDK
- node-llama-cpp - Node.js bindings for llama.cpp
- llama.cpp - LLM inference in C/C++