A C# inference engine for running large language models (LLMs) locally using GGUF model files. TensorSharp provides a console application, a web-based chatbot interface, and Ollama/OpenAI-compatible HTTP APIs for programmatic access.
- Multi-architecture support -- Gemma 4, Gemma 3, Qwen 3, Qwen 3.5
- Multimodal inference -- image, video, and audio inputs (Gemma 4); images for Gemma 3 / Qwen 3.5
- Thinking / reasoning mode -- structured chain-of-thought output with
<think>/<|channel>thoughttags (Qwen 3, Qwen 3.5, Gemma 4) - Tool calling / function calling -- models can invoke user-defined tools; multi-turn tool-call conversations supported across all three API styles
- Quantized model support -- loads GGUF files with Q4_K_M, Q8_0, F16, and other quantization formats; performs native quantized matmul without dequantizing to FP32, including memory-efficient pure C# CPU loading for large GGUFs
- GPU-accelerated -- GGML Metal on macOS and GGML CUDA on Linux/NVIDIA, with fused whole-model GPU dispatch for Gemma 4 decode on Metal (~2.6x speedup over per-op dispatch)
- Optimized pure C# CPU backend -- managed GEMM fast paths plus fused SIMD kernels for RMSNorm, RoPE, softmax, fused activations, and other inference hot paths
- Ollama & OpenAI API compatibility -- drop-in replacement endpoints for existing tooling
- Configurable sampling -- temperature, top-k, top-p, min-p, repetition/presence/frequency penalties, seed, stop sequences
- Chat templates -- auto-loaded from GGUF metadata (Jinja2), with hardcoded fallbacks per architecture
- Request queue -- FIFO inference queue ensures single-request execution for KV cache stability, with real-time position tracking for clients
- Batch processing -- JSONL input support in the console application
- Streaming -- token-by-token output via SSE (web) or stdout (console)
- Mixture of Experts -- Gemma 4 MoE variants (e.g. gemma-4-26B-A4B)
- Large file uploads -- supports video/audio uploads up to 500 MB in the web interface
| Architecture | Example Models | Multimodal | Thinking | Tool Calling |
|---|---|---|---|---|
| Gemma 4 | gemma-4-E4B, gemma-4-31B, gemma-4-26B-A4B (MoE) | Image, Video, Audio | Yes | Yes |
| Gemma 3 | gemma-3-4b | Image | No | No |
| Qwen 3 | Qwen3-4B | Text only | Yes | Yes |
| Qwen 3.5 | Qwen3.5-9B | Image | Yes | Yes |
| Backend | Flag | Description |
|---|---|---|
| GGML Metal | --backend ggml_metal |
GPU-accelerated via Apple Metal (macOS). Recommended for Apple Silicon. |
| GGML CUDA | --backend ggml_cuda |
GPU-accelerated via GGML CUDA on Linux with an NVIDIA GPU. |
| GGML CPU | --backend ggml_cpu |
CPU inference using native GGML with optimized kernels. |
| Pure C# CPU | --backend cpu |
Portable CPU inference with no native dependencies. |
TensorSharp/
├── TensorSharp/ # Core tensor library (CPU operations, SIMD)
├── TensorSharp.GGML/ # GGML backend bindings (Metal/CUDA/CPU via native library)
├── TensorSharp.GGML.Native/ # Native C++ bridge to ggml (builds libGgmlOps)
├── AdvUtils/ # Utility library
├── InferenceEngine/ # Model loading, tokenization, and inference logic
│ ├── Models/
│ │ ├── Gemma3/
│ │ ├── Gemma4/ # Vision encoder, audio encoder, MoE, fused GPU decode
│ │ ├── Qwen3/
│ │ └── Qwen35/
│ ├── GgufReader.cs # GGUF file parser
│ ├── ModelBase.cs # Base class for all model architectures
│ ├── ChatTemplate.cs # Chat template rendering (hardcoded + Jinja2 from GGUF)
│ ├── Jinja2Template.cs # Jinja2 template renderer
│ ├── OutputParser.cs # Extracts thinking, content, and tool calls from model output
│ ├── SamplingConfig.cs # Sampling parameter configuration
│ ├── TokenSampler.cs # Token sampling (greedy, top-k, top-p, min-p, penalties)
│ └── MediaHelper.cs # Video frame extraction, audio decoding
├── InferenceConsole/ # CLI application
├── InferenceWeb/ # Web chatbot + API server (ASP.NET Core)
│ ├── ModelService.cs # Model lifecycle management
│ ├── InferenceQueue.cs # FIFO request queue with position tracking
│ ├── wwwroot/index.html # Chat UI
│ ├── testdata/ # Integration test suites (bash + Python)
│ └── API_EXAMPLES.md # Detailed API documentation
└── ExternalProjects/ # Third-party dependencies (ggml)
- .NET 10 SDK
- macOS (Metal backend): CMake 3.20+ and Xcode command-line tools for building the native GGML library
- Linux (GGML CPU / CUDA backends): CMake 3.20+; for
ggml_cuda, install an NVIDIA driver plus CUDA Toolkit 12.x or another compatible CUDA toolkit - GGUF model files (e.g., from Hugging Face)
dotnet build TensorSharp.slnx# Console application
dotnet build InferenceConsole/InferenceConsole.csproj
# Web application
dotnet build InferenceWeb/InferenceWeb.csprojThe native library is built automatically during the first dotnet build if it doesn't exist. To build it manually:
cd TensorSharp.GGML.NativemacOS:
bash build-macos.shLinux (CPU-only):
bash build-linux.shLinux (GGML_CUDA enabled):
bash build-linux.sh --cudaYou can also request a CUDA-enabled native build from dotnet build:
TENSORSHARP_GGML_NATIVE_ENABLE_CUDA=ON dotnet build InferenceConsole/InferenceConsole.csproj -c ReleaseOn macOS this compiles libGgmlOps.dylib with Metal GPU support. On Linux, build-linux.sh builds libGgmlOps.so with the GGML CPU backend by default, and build-linux.sh --cuda enables GGML_CUDA support for NVIDIA GPUs. The build output is automatically copied to the application's output directory.
cd InferenceConsole/bin
# Text inference
./InferenceConsole --model <model.gguf> --input prompt.txt --output result.txt \
--max-tokens 200 --backend ggml_metal
# Text inference on Linux + NVIDIA GPU
./InferenceConsole --model <model.gguf> --input prompt.txt --output result.txt \
--max-tokens 200 --backend ggml_cuda
# Image inference (Gemma 3/4, Qwen 3.5)
./InferenceConsole --model <model.gguf> --image photo.png --backend ggml_metal
# Video inference (Gemma 4)
./InferenceConsole --model <model.gguf> --video clip.mp4 --backend ggml_metal
# Audio inference (Gemma 4)
./InferenceConsole --model <model.gguf> --audio speech.wav --backend ggml_metal
# Thinking / reasoning mode
./InferenceConsole --model <model.gguf> --input prompt.txt --backend ggml_metal --think
# Tool calling
./InferenceConsole --model <model.gguf> --input prompt.txt --backend ggml_metal \
--tools tools.json
# With sampling parameters
./InferenceConsole --model <model.gguf> --input prompt.txt --backend ggml_metal \
--temperature 0.7 --top-p 0.9 --top-k 40 --repeat-penalty 1.2 --seed 42
# Batch processing (JSONL)
./InferenceConsole --model <model.gguf> --input-jsonl requests.jsonl \
--output results.txt --backend ggml_metalCommand-line options:
| Option | Description |
|---|---|
--model <path> |
Path to a GGUF model file (required) |
--input <path> |
Text file containing the user prompt |
--input-jsonl <path> |
JSONL file with batch requests (one JSON per line) |
--output <path> |
Write generated text to this file |
--image <path> |
Image file for vision inference |
--video <path> |
Video file for video inference |
--audio <path> |
Audio file (WAV, MP3, OGG) for audio inference |
--mmproj <path> |
Path to the multimodal projector GGUF file |
--max-tokens <N> |
Maximum tokens to generate (default: 100) |
--backend <type> |
Compute backend: cpu, ggml_cpu, ggml_metal, or ggml_cuda |
--think |
Enable thinking/reasoning mode (chain-of-thought) |
--tools <path> |
JSON file with tool/function definitions |
--temperature <f> |
Sampling temperature (0 = greedy) |
--top-k <N> |
Top-K filtering (0 = disabled) |
--top-p <f> |
Nucleus sampling threshold (1.0 = disabled) |
--min-p <f> |
Minimum probability filtering (0 = disabled) |
--repeat-penalty <f> |
Repetition penalty (1.0 = none) |
--presence-penalty <f> |
Presence penalty (0 = disabled) |
--frequency-penalty <f> |
Frequency penalty (0 = disabled) |
--seed <N> |
Random seed (-1 = non-deterministic) |
--stop <string> |
Stop sequence (can be repeated) |
--test |
Run built-in test suite |
The multimodal projector file is auto-detected if placed alongside the model file with a recognized name (e.g., gemma-4-mmproj-F16.gguf).
JSONL input format:
Each line is a JSON object with messages, optional prompt, and optional sampling parameters:
{"id": "q1", "messages": [{"role": "user", "content": "What is 2+3?"}], "max_tokens": 50}
{"id": "q2", "messages": [{"role": "user", "content": "Write a haiku."}], "max_tokens": 100, "temperature": 0.8}cd InferenceWeb/bin
# Set environment variables and run
MODEL_DIR=./models BACKEND=ggml_metal ./InferenceWeb
# Linux + NVIDIA GPU
MODEL_DIR=./models BACKEND=ggml_cuda ./InferenceWebOpen http://localhost:5000 in your browser. The web interface supports:
- Multi-turn chat conversations
- Model selection from available GGUF files in
MODEL_DIR - Image, video, and audio uploads for multimodal inference (up to 500 MB)
- Thinking/reasoning mode toggle
- Tool calling with function definitions
- Streaming token generation via Server-Sent Events
- Request queue with real-time position feedback
Environment variables:
| Variable | Description |
|---|---|
MODEL_DIR |
Directory containing GGUF model files |
BACKEND |
Compute backend: cpu, ggml_cpu, ggml_metal, or ggml_cuda (default: ggml_metal on macOS, ggml_cpu elsewhere) |
PORT |
HTTP port (default: 5000) |
InferenceWeb exposes three API styles. See API_EXAMPLES.md for full documentation with curl and Python examples.
Ollama-compatible API:
# List models
curl http://localhost:5000/api/tags
# Generate text
curl -X POST http://localhost:5000/api/generate \
-H "Content-Type: application/json" \
-d '{"model": "Qwen3-4B-Q8_0.gguf", "prompt": "Hello!", "stream": false}'
# Chat
curl -X POST http://localhost:5000/api/chat/ollama \
-H "Content-Type: application/json" \
-d '{"model": "Qwen3-4B-Q8_0.gguf", "messages": [{"role": "user", "content": "Hi"}], "stream": false}'
# Chat with thinking mode
curl -X POST http://localhost:5000/api/chat/ollama \
-H "Content-Type: application/json" \
-d '{"model": "Qwen3-4B-Q8_0.gguf", "messages": [{"role": "user", "content": "Solve 17*23"}], "think": true, "stream": false}'
# Chat with tool calling
curl -X POST http://localhost:5000/api/chat/ollama \
-H "Content-Type: application/json" \
-d '{"model": "Qwen3-4B-Q8_0.gguf", "messages": [{"role": "user", "content": "What is the weather?"}], "tools": [{"function": {"name": "get_weather", "description": "Get current weather", "parameters": {"properties": {"city": {"type": "string"}}, "required": ["city"]}}}], "stream": false}'OpenAI-compatible API:
# Chat completions
curl -X POST http://localhost:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "Qwen3-4B-Q8_0.gguf", "messages": [{"role": "user", "content": "Hi"}], "max_tokens": 50}'OpenAI Python SDK:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:5000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="Qwen3-4B-Q8_0.gguf",
messages=[{"role": "user", "content": "What is 2+3?"}],
max_tokens=50
)
print(response.choices[0].message.content)Queue status:
curl http://localhost:5000/api/queue/status
# {"busy":false,"pending_requests":0,"total_processed":42}Models that support thinking mode (Qwen 3, Qwen 3.5, Gemma 4) can produce structured chain-of-thought reasoning before generating the final answer. The thinking content is separated from the main response and can be displayed or hidden by the client.
- Qwen 3 / Qwen 3.5: uses
<think>...</think>tags - Gemma 4: uses
<|channel>thought\n...<channel|>tags
Enable via --think (console), "think": true (Ollama API), or the thinking toggle in the web UI.
Models can invoke user-defined tools and participate in multi-turn tool-call conversations. Define tools as JSON and pass them via --tools (console) or the tools parameter in the API.
Each architecture uses its own wire format for tool calls:
- Qwen 3 / Qwen 3.5:
<tool_call>{"name": "...", "arguments": {...}}</tool_call> - Gemma 4:
<|tool_call>call:function_name{args}<tool_call|>
The output parser (OutputParser.cs) automatically extracts tool calls from the model's raw output regardless of architecture.
Gemma 4 models support image, video, and audio inputs. Place the multimodal projector (gemma-4-mmproj-F16.gguf) in the same directory as the model file for automatic loading.
- Images: PNG, JPEG
- Video: MP4 (extracts up to 8 frames at 1 fps using OpenCV)
- Audio: WAV (16kHz mono), MP3, OGG Vorbis
These models support image inputs with their respective multimodal projector files.
TensorSharp is structured as a layered system:
-
TensorSharp provides the core
Tensortype, storage abstraction, and an extensible operation registry (Ops). CPU implementations useSystem.Numerics.Vectorsfor SIMD acceleration. -
TensorSharp.GGML registers accelerated implementations of the same operations via a native C++ bridge (
libGgmlOps) that links against ggml. On macOS this provides Metal GPU compute, and on Linux it can expose GGML CUDA for NVIDIA GPUs. Operations include native quantized matmul (Q4_K_M, Q8_0, etc.) without dequantizing to FP32. -
InferenceEngine implements model-specific logic: GGUF parsing, tokenization (SentencePiece BPE), chat template rendering (Jinja2 from GGUF metadata with hardcoded fallbacks), configurable token sampling, output parsing (thinking extraction, tool-call extraction), and the forward pass for each architecture. Models are loaded via
ModelBase.Create()which auto-detects the architecture from GGUF metadata. -
InferenceConsole and InferenceWeb are application layers that handle I/O and user interaction. InferenceWeb provides Ollama-compatible and OpenAI-compatible REST APIs alongside a browser-based chat UI, with a FIFO inference queue to serialize concurrent requests.
- Fused GPU decode (Gemma 4): all transformer layers are executed in a single GGML compute graph dispatch on Metal, reducing CPU-GPU round-trips from hundreds per token to one. This achieves ~2.6x speedup over per-operation dispatch.
- Fused weight projections: Q/K/V projections are fused into a single QKV matmul; gate and up projections are fused into a single gate_up matmul.
- Native quantized compute: quantized weights (Q4_K_M, Q6_K, Q8_0, etc.) are used directly in matmul without expanding to FP32, saving memory and bandwidth.
- Optimized pure C# CPU path: managed GEMM fast paths and contiguous float32 kernels accelerate decode, softmax, RMSNorm, RoPE, fused activations, and other hot paths while keeping quantized GGUF weights compressed during CPU loading.
- Circular KV cache: sliding-window attention layers use a fixed-size circular buffer, bounding memory usage regardless of sequence length.
- Memory-efficient model loading: large tensors are streamed directly to native memory without intermediate managed allocations.
Integration tests for InferenceWeb are in InferenceWeb/testdata/. They cover all three API styles (Web UI SSE, Ollama, OpenAI), multi-turn conversations, thinking mode, tool calling, queue behavior, concurrent requests, and abort support.
# Start InferenceWeb, then run:
python3 InferenceWeb/testdata/test_multiturn.py
# or
bash InferenceWeb/testdata/test_multiturn.shSee InferenceWeb/testdata/README.md for the full test matrix.
Zhongkai Fu
See LICENSE for details.