Run Hugging Face models locally with the Vercel AI SDK — no API keys, no cloud, no cost.
This package wraps Transformers.js (ONNX Runtime) into a fully-compatible LanguageModelV3 provider, giving you generateText, streamText, and tool calling — all running on your own machine.
npm install transformers-ai-provider @huggingface/transformers ai
import { createTransformersModel } from 'transformers-ai-provider';
import { generateText } from 'ai';
const model = createTransformersModel({
modelId: 'onnx-community/gemma-3-1b-it-ONNX',
});
const { text } = await generateText({
model,
prompt: 'Explain quantum computing in one sentence.',
});
console.log(text);import { createTransformersModel } from 'transformers-ai-provider';
import { streamText } from 'ai';
const model = createTransformersModel({
modelId: 'onnx-community/gemma-3-1b-it-ONNX',
});
const result = streamText({
model,
prompt: 'Write a haiku about TypeScript.',
});
for await (const chunk of result.textStream) {
process.stdout.write(chunk);
}Works with the standard Vercel AI SDK tool() helper. The provider injects tool definitions into the system prompt and parses structured JSON tool calls from the model output.
import { createTransformersModel } from 'transformers-ai-provider';
import { streamText, tool, stepCountIs } from 'ai';
import { z } from 'zod';
const model = createTransformersModel({
modelId: 'onnx-community/gemma-3-1b-it-ONNX',
});
const result = streamText({
model,
system: 'You are a helpful assistant.',
prompt: "What's the weather in Amsterdam?",
tools: {
weather: tool({
description: 'Get the current weather for a location',
inputSchema: z.object({
location: z.string(),
unit: z.enum(['celsius', 'fahrenheit']).default('celsius'),
}),
execute: async ({ location, unit }) => ({
location,
temperature: unit === 'celsius' ? 18 : 64,
condition: 'partly cloudy',
}),
}),
},
stopWhen: stepCountIs(3),
});
for await (const chunk of result.textStream) {
process.stdout.write(chunk);
}createTransformersModel({
// Required — any ONNX model from Hugging Face Hub
modelId: 'onnx-community/gemma-3-1b-it-ONNX',
// Quantization level (default: 'q4')
dtype: 'q4',
// Device: 'cpu', 'webgpu', etc.
device: 'cpu',
// Extra options passed to AutoModelForCausalLM.from_pretrained
modelOptions: {},
// Extra options passed to AutoTokenizer.from_pretrained
tokenizerOptions: {},
// Custom template for injecting tool definitions into the system prompt
toolPromptTemplate: (toolDefs) => `\n\nTools:\n${toolDefs}`,
});| Option | Type | Default | Description |
|---|---|---|---|
modelId |
string |
— | HF model ID (e.g. onnx-community/gemma-3-1b-it-ONNX) |
dtype |
string |
'q4' |
Quantization dtype ('q4', 'q8', 'fp16', 'fp32') |
device |
string |
auto | Runtime device ('cpu', 'webgpu') |
modelOptions |
object |
{} |
Passed to AutoModelForCausalLM.from_pretrained |
tokenizerOptions |
object |
{} |
Passed to AutoTokenizer.from_pretrained |
toolPromptTemplate |
function |
built-in | Custom tool prompt injection template |
Any ONNX-format causal language model on Hugging Face Hub works. Some tested models:
| Model | ID | Size |
|---|---|---|
| Gemma 3 1B Instruct | onnx-community/gemma-3-1b-it-ONNX |
~700 MB (q4) |
| Phi-3 Mini | onnx-community/Phi-3-mini-4k-instruct-onnx-web |
~2.3 GB (q4) |
| SmolLM 135M | HuggingFaceTB/SmolLM-135M-Instruct |
~270 MB |
-
Model loading — Uses
AutoTokenizerandAutoModelForCausalLMfrom Transformers.js with lazy initialization (loaded once, reused across calls). -
Prompt conversion — Translates Vercel AI SDK's
LanguageModelV3Prompt(system/user/assistant/tool messages) into HuggingFace chat format, applying the model's chat template viatokenizer.apply_chat_template. -
Tool calling — When tools are provided, their JSON Schema definitions are injected into the system prompt. The model is instructed to respond with
{"name": "...", "arguments": {...}}for tool calls. The provider parses these from the output and emits propertool-call/tool-input-*stream events. -
Streaming — Uses
TextStreamerfrom Transformers.js to push token-by-token deltas into aReadableStream<LanguageModelV3StreamPart>, compatible withstreamText().
Works anywhere Transformers.js runs:
- Node.js ≥ 18
- Bun
- Deno
- Edge runtimes (with ONNX Runtime Web)
- Browsers (with WebGPU for acceleration)
MIT