node-llama-cpp AI SDK Provider

A fully-featured Vercel AI SDK provider for node-llama-cpp, enabling you to run local LLMs with the Vercel AI SDK's ergonomic API.

✨ Features

🚀 Auto-initializing - No manual setup required
🔧 Full AI SDK Integration - Works seamlessly with generateText, streamText, and more
🛠️ Multi-Step Tool Calling - Complete support for tools with automatic execution
🤔 Reasoning Support - Separate thinking from final answers for reasoning models
📡 Streaming & Non-Streaming - Both modes fully supported
🎮 GPU Acceleration - Optional GPU layers for faster inference
🔌 OpenAI-Compatible API - Drop-in replacement for OpenAI API
📝 TypeScript - Fully typed for great DX

🚀 Quick Start

Basic Usage

import { createNodeLlamaCppProvider } from "./provider.js";
import { generateText } from "ai";

// Create provider - auto-initializes on first use
const provider = createNodeLlamaCppProvider({
    modelPath: "hf:username/model-name/model-file.gguf", // path to a hugging face model
    modelId: "my-model",
    contextSize: 8096,
});

// Generate text
const { text } = await generateText({
    model: provider.chat(),
    prompt: "Explain quantum computing in simple terms",
});

console.log(text);

Streaming

import { streamText } from "ai";

const { textStream } = streamText({
    model: provider.chat(),
    prompt: "Write a haiku about programming",
});

for await (const chunk of textStream) {
    process.stdout.write(chunk);
}

Tool Calling (Multi-Step)

import { generateText, tool, stepCountIs } from "ai";
import { z } from "zod";

const weatherTool = tool({
    description: "Get weather for a location",
    inputSchema: z.object({
        location: z.string(),
    }),
    execute: async ({ location }) => ({
        temperature: 72,
        condition: "sunny",
        location,
    }),
});

const { text } = await generateText({
    model: provider.chat(),
    prompt: "What's the weather in San Francisco?",
    tools: { weatherTool },
    stopWhen: stepCountIs(5), // Allow up to 5 steps
});

console.log(text);
// Output: "The weather in San Francisco is currently sunny with a temperature of 72°F."

Reasoning Models

import { streamText } from "ai";

const { fullStream } = streamText({
    model: provider.chat(),
    prompt: "Solve: If a train leaves at 2pm going 60mph...",
});

for await (const chunk of fullStream) {
    if (chunk.type === "reasoning-delta") {
        console.log("💭 Thinking:", chunk.text);
    }
    if (chunk.type === "text-delta") {
        console.log("📝 Answer:", chunk.text);
    }
}

📖 Configuration

Provider Options

createNodeLlamaCppProvider({
    // Required: Path to GGUF model (supports HuggingFace)
    modelPath: "hf:username/repo/file.gguf",

    // Required: Model identifier for AI SDK
    modelId: "my-model",

    // Optional: Context window size
    contextSize: 8096,

    // Optional: Directory to store downloaded models
    modelsDirectory: "./models",

    // Optional: Number of GPU layers to offload
    gpuLayers: 32,
});

Supported HuggingFace Models

You can use any GGUF model from HuggingFace:

// Mistral 7B
modelPath: "hf:TheBloke/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_M.gguf"

// Llama 3
modelPath: "hf:QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf"

// Qwen 2.5
modelPath: "hf:Qwen/Qwen2.5-7B-Instruct-GGUF/qwen2.5-7b-instruct-q4_k_m.gguf"

// DeepSeek R1 (reasoning model)
modelPath: "hf:deepseek-ai/DeepSeek-R1-Distill-Qwen-7B-GGUF/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf"

🛠️ Tool Calling

This provider fully supports multi-step tool calling, allowing the model to:

Reason about which tool to call
Call the tool with appropriate parameters
See the tool result
Continue reasoning or provide a final answer

How It Works

Model decides to call a tool - Detection happens synchronously
Generation aborts - Provider emits tool-call event
AI SDK executes the tool - Your execute function runs
Provider called again - Full conversation history with tool results
Model continues - Sees tool result and generates response

Example: Weather Agent

import { generateText, tool, stepCountIs } from "ai";
import { z } from "zod";

const getCurrentWeather = tool({
    description: "Get current weather for a location",
    inputSchema: z.object({
        location: z.string().describe("City name"),
    }),
    execute: async ({ location }) => {
        // Call your weather API
        return {
            temperature: 72,
            condition: "sunny",
            location,
        };
    },
});

const { text, steps } = await generateText({
    model: provider.chat(),
    prompt: "What's the weather like in Tokyo and should I bring an umbrella?",
    tools: { getCurrentWeather },
    stopWhen: stepCountIs(5),
});

console.log("Steps taken:", steps.length);
console.log("Final answer:", text);

🎭 OpenAI-Compatible API Server

Run a local OpenAI-compatible API server:

npm run server

Then use it with any OpenAI-compatible client:

import OpenAI from "openai";

const client = new OpenAI({
    baseURL: "http://localhost:3000/v1",
    apiKey: "not-needed",
});

const response = await client.chat.completions.create({
    model: "gpt-oss-20b",
    messages: [{ role: "user", content: "Hello!" }],
    stream: true,
});

for await (const chunk of response) {
    process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

📁 Project Structure

src/
├── provider.ts              # Main AI SDK provider
├── server.ts                # OpenAI-compatible API server
├── example-ai-sdk.ts        # AI SDK examples
├── example-tools.ts         # Tool calling examples
├── example-local.ts         # Direct usage examples
└── example-api-client.ts    # API client examples

🔥 Advanced Features

Session Reuse

The provider automatically reuses the same model session across calls for efficiency:

const provider = createNodeLlamaCppProvider({...});

// These share the same underlying session
await generateText({ model: provider.chat(), prompt: "Hello" });
await generateText({ model: provider.chat(), prompt: "How are you?" });

GPU Acceleration

Offload layers to GPU for faster inference:

const provider = createNodeLlamaCppProvider({
    modelPath: "...",
    modelId: "my-model",
    gpuLayers: 32, // Offload 32 layers to GPU
});

Reasoning Separation

For models that output thinking process (like DeepSeek R1, QwQ):

const { fullStream } = streamText({
    model: provider.chat(),
    prompt: "Solve this complex problem...",
});

for await (const chunk of fullStream) {
    switch (chunk.type) {
        case "reasoning-start":
            console.log("🤔 Starting to think...");
            break;
        case "reasoning-delta":
            process.stdout.write(chalk.gray(chunk.text));
            break;
        case "reasoning-end":
            console.log("\n✅ Done thinking");
            break;
        case "text-delta":
            process.stdout.write(chunk.text);
            break;
    }
}

📚 Examples

Run the examples:

# AI SDK integration
npm run example:ai-sdk

# Tool calling
npm run example:tools

# Direct usage
npm run example:local

# API client
npm run server  # In one terminal
npm run example:api-client  # In another

🎯 Supported AI SDK Features

Feature	Status	Notes
`generateText`	✅	Fully supported
`streamText`	✅	Fully supported
Tool calling	✅	Multi-step with `stopWhen`
Reasoning	✅	Separate thinking from answer
Temperature	✅	Full control
Top-P	✅	Full control
Max tokens	✅	Full control
Stop sequences	✅	Custom stop triggers
Streaming	✅	SSE format
Multi-modal	❌	Not yet supported

🐛 Troubleshooting

Model download fails

# Manually download with node-llama-cpp CLI
npx --no node-llama-cpp download --model hf:username/repo/file.gguf

Out of memory

Reduce contextSize
Use a smaller quantized model (Q4_K_M instead of Q6_K)
Reduce gpuLayers if using GPU

Tool calling not working

Make sure to use stopWhen: stepCountIs(n) not maxSteps
Ensure your model supports function calling
Some models require specific prompting for tools

Generation is slow

Increase gpuLayers if you have a GPU
Use a smaller model
Reduce contextSize

📖 Documentation

AI SDK Docs - Official AI SDK documentation
node-llama-cpp Docs - node-llama-cpp documentation

📄 License

MIT

🙏 Acknowledgments

Vercel AI SDK - Ergonomic AI SDK
node-llama-cpp - Node.js bindings for llama.cpp
llama.cpp - LLM inference in C/C++

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
src		src
.editorconfig		.editorconfig
.gitignore		.gitignore
README.md		README.md
eslint.config.js		eslint.config.js
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

node-llama-cpp AI SDK Provider

✨ Features

🚀 Quick Start

Basic Usage

Streaming

Tool Calling (Multi-Step)

Reasoning Models

📖 Configuration

Provider Options

Supported HuggingFace Models

🛠️ Tool Calling

How It Works

Example: Weather Agent

🎭 OpenAI-Compatible API Server

📁 Project Structure

🔥 Advanced Features

Session Reuse

GPU Acceleration

Reasoning Separation

📚 Examples

Run the examples:

🎯 Supported AI SDK Features

🐛 Troubleshooting

Model download fails

Out of memory

Tool calling not working

Generation is slow

📖 Documentation

📄 License

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages