hfl

Run HuggingFace models locally like Ollama.

Versión en Español

Why HFL?

Ollama has a curated catalog of ~500 models. HuggingFace Hub has 500,000+.

If you want to run a model that isn't in Ollama's catalog — a specific fine-tune, a recent release from a small lab, a niche model — you have to manually download from HuggingFace, convert to GGUF with llama.cpp, quantize, and configure inference. HFL automates all of this in a single command.

Feature	Ollama	HFL
Model catalog	~500 curated	500K+ (all HF Hub)
Auto-conversion	Not needed (pre-converted)	Yes (safetensors→GGUF)
Ease of use	Excellent	Good
OpenAI API compatible	Yes	Yes
Ollama API compatible	Native	Yes (drop-in)
Anthropic Messages API	No	Yes (Claude Code compatible)
Structured tool calling	Yes	Yes (qwen / llama3 / mistral)
Multi-backend	llama.cpp only	llama.cpp + Transformers + vLLM
License verification	No	Yes (5 risk levels)
Legal traceability	No	Yes (provenance log)
Maturity	High (established)	Beta (v0.13.0)

HFL doesn't compete with Ollama — it complements it. Use Ollama for curated models; use HFL when you need something from the full HuggingFace ecosystem.

What's new in V4

V4 (2026-05) ships features that exploit HFL's HF-Hub-native nature — things Ollama can't structurally have because of its curated registry model:

hfl discover — filter the live HF Hub (1.5M+ models) by family, quant, multimodal, license, popularity. Marks what you already have locally.
hfl recommend — picks top-N models that fit your hardware (probes RAM/VRAM/MLX, scores by hardware fit + capability + popularity + recency).
hfl pull-smart — given a base repo, finds the best community variant (mlx-community/...-4bit on Apple Silicon, bartowski/...-GGUF/Q5_K_M on CUDA, smallest fitting quant on CPU).
hfl verify — 5 sanity probes against a freshly-pulled model in seconds: tokenizer round-trip, chat-template render, smoke generation, tool-parser, embedding dim.
hfl bench — TTFT + tok/s + p50/p95 with golden prompts (16/256/2048 chars).
hfl lora apply|remove|list — hot-swap LoRA adapters at fractional scales without reloading base weights.
hfl snapshot save|load|list|delete — persist KV cache to disk for warm-start across restarts. Format-versioned.
hfl compliance-dashboard — license-risk overview of the local registry, gated repos pending HF_TOKEN, EU AI Act warnings.
hfl draft-recommend — auto-pick a small Hub sibling for speculative decoding. Measured 1.33× speedup on Qwen3-14B + 0.6B draft for structured prompts.
REST POST /v1/responses — OpenAI Responses API (2025 surface used by client.responses.create()).
WebSocket /ws/chat — bidirectional with frame-level cancellation (vs HTTP streaming where cancel = TCP close).
REST POST /api/push — upload a registered model to the HF Hub (closes the V1/V2 "out-of-scope" gap).

See docs/v4.md for the full V4 guide and docs/env-vars.md for the env var matrix (every HFL_* knob has an OLLAMA_* fallback so a drop-in replacement of an Ollama install works without re-reading the docs).

Features

CLI & API: Full CLI interface plus REST API compatible with OpenAI, Ollama, and Anthropic
Model Search: Interactive paginated search of HuggingFace Hub (like more)
Multiple Backends: llama.cpp (GGUF/CPU), Transformers (GPU native), vLLM (production)
Automatic Conversion: Downloads HuggingFace models and converts to GGUF automatically
Smart Quantization: Supports Q2_K through F16 quantization levels
Text-to-Speech: Native TTS support with Bark, SpeechT5, Coqui XTTS and more
Structured tool calling: Ollama-compatible tools / tool_calls wire protocol with per-family parsers for qwen, llama3, and mistral — agents work out of the box
Bounded inference queue: server-side serialisation of requests with explicit 429 / 503 backpressure, live X-Queue-Depth headers, and GET /healthz for orchestrators
Drop-in Compatible: Works as a replacement for Ollama with existing tooling
Internationalized: Full i18n support (English, Spanish) - set HFL_LANG to change language

How It Works

┌─────────────────────────────────────────────────────────────────────────────┐
│                              HFL Architecture                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────┐         ┌──────────────────┐         ┌─────────────────┐   │
│  │  hfl pull   │───────▶ │  HuggingFace Hub │───────▶ │  Local Storage  │   │
│  │             │         │                  │         │   ~/.hfl/       │   │
│  └─────────────┘         │  • Search API    │         │   ├── models/   │   │
│        │                 │  • Download      │         │   ├── cache/    │   │
│        │                 │  • License info  │         │   └── registry  │   │
│        ▼                 └──────────────────┘         └─────────────────┘   │
│  ┌─────────────┐                                              │             │
│  │  Converter  │◀─────────────────────────────────────────────┘             │
│  │             │                                                            │
│  │ safetensors │──────────▶ GGUF (quantized Q2_K...F16)                     │
│  └─────────────┘                                                            │
│                                                                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────┐         ┌──────────────────┐         ┌─────────────────┐   │
│  │  hfl run    │───────▶ │ Inference Engine │───────▶ │  Interactive    │   │
│  │             │         │                  │         │     Chat        │   │
│  └─────────────┘         │  • llama.cpp     │         └─────────────────┘   │
│                          │  • Transformers  │                               │
│                          │  • vLLM          │                               │
│                          └──────────────────┘                               │
│                                                                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────┐         ┌──────────────────┐         ┌─────────────────┐   │
│  │  hfl serve  │───────▶ │   REST API       │───────▶ │  OpenAI SDK /   │   │
│  │             │         │                  │         │  Ollama clients │   │
│  └─────────────┘         │  • /v1/chat/...  │         └─────────────────┘   │
│                          │  • /api/chat     │                               │
│                          │  • /api/generate │                               │
│                          └──────────────────┘                               │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Flow Summary:

Pull: Download from HuggingFace Hub → Convert to GGUF (if needed) → Store locally
Run: Load model into inference engine → Start interactive chat session
Serve: Start API server → Accept OpenAI/Ollama-compatible requests

Prerequisites

Python 3.10+ (required)
git (for cloning llama.cpp during first conversion)
cmake and C++ compiler (for building llama.cpp quantization tools)
- macOS: xcode-select --install
- Ubuntu/Debian: sudo apt install build-essential cmake
- Windows: Install Visual Studio Build Tools

Note: Build tools are only needed if you convert safetensors models to GGUF. If you only use pre-quantized GGUF models, they're not required.

Installation

# Clone the repository
git clone https://github.com/ggalancs/hfl
cd hfl

# Basic installation (CPU + GGUF)
pip install .

# With GPU support (Transformers + bitsandbytes)
pip install ".[transformers]"

# With TTS support (Bark, SpeechT5)
pip install ".[tts]"

# With Coqui TTS (XTTS-v2, VITS)
pip install ".[coqui]"

# With vLLM for production
pip install ".[vllm]"

# Everything
pip install ".[all]"

Quick Start

Download a Model

# Download with default Q4_K_M quantization
hfl pull meta-llama/Llama-3.3-70B-Instruct

# Specify quantization level
hfl pull meta-llama/Llama-3.3-70B-Instruct --quantize Q5_K_M

# Keep as safetensors (for GPU inference)
hfl pull mistralai/Mistral-7B-Instruct-v0.3 --format safetensors

# Download with a custom alias for easier reference
hfl pull meta-llama/Llama-3.3-70B-Instruct --alias llama70b

Interactive Chat

# Start chat with a model
hfl run llama-3.3-70b-instruct-q4_k_m

# With system prompt
hfl run llama-3.3-70b-instruct-q4_k_m --system "You are a Python expert"

API Server

# Start server (default port 11434, same as Ollama)
hfl serve

# Pre-load a model
hfl serve --model llama-3.3-70b-instruct-q4_k_m

# Custom host/port
hfl serve --host 0.0.0.0 --port 8080

Text-to-Speech (TTS)

HFL supports TTS models from HuggingFace like Bark, SpeechT5, and Coqui XTTS.

# Download a TTS model (no GGUF conversion needed)
hfl pull suno/bark-small --alias bark

# Synthesize text to audio file
hfl tts bark "Hello, this is a test." -o output.wav

# Synthesize and play directly (requires sounddevice)
hfl speak bark "Hello, this is a test."

# With options
hfl tts bark "Hola mundo" --lang es --output spanish.wav --speed 0.9
hfl speak bark "Fast speech" --speed 1.5

TTS Options:

--output, -o: Output file path (default: output.wav)
--lang, -l: Language code (en, es, fr, etc.)
--voice, -v: Voice/speaker to use
--speed, -s: Speed multiplier (0.25-4.0)
--rate, -r: Sample rate in Hz
--format, -f: Audio format (wav, mp3, ogg)

TTS API:

# OpenAI-compatible endpoint
curl -X POST http://localhost:11434/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "bark", "input": "Hello world", "voice": "alloy"}' \
  --output speech.wav

# Native HFL endpoint
curl -X POST http://localhost:11434/api/tts \
  -H "Content-Type: application/json" \
  -d '{"model": "bark", "text": "Hello world", "language": "en"}' \
  --output speech.wav

Search Models on HuggingFace

# Search for models (paginated like 'more')
hfl search llama

# Search only models with GGUF files
hfl search mistral --gguf

# Customize pagination and results
hfl search phi --limit 50 --page-size 5

# Sort by likes instead of downloads
hfl search qwen --sort likes

Navigation controls:

SPACE / ENTER - Next page
p - Previous page
q / ESC - Exit

Model Management

# List all local models
hfl list

# Show model details
hfl inspect llama-3.3-70b-instruct-q4_k_m

# Remove a model
hfl rm llama-3.3-70b-instruct-q4_k_m

# Set an alias for an existing model
hfl alias llama-3.3-70b-instruct-q4_k_m llama70b

# Now use the alias in any command
hfl run llama70b
hfl inspect llama70b

API Endpoints

OpenAI-Compatible

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.3-70b-instruct-q4_k_m",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Ollama-Compatible

curl http://localhost:11434/api/chat \
  -d '{
    "model": "llama-3.3-70b-instruct-q4_k_m",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Using OpenAI Python SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="llama-3.3-70b-instruct-q4_k_m",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
)
print(response.choices[0].message.content)

Tool Calling (Agents)

HFL implements the Ollama wire protocol for structured tool calling, so agents written against the Ollama Python SDK can drive multi-turn tool loops directly. When a client sends tools on /api/chat, HFL forwards them through the model's native chat template (qwen3 <tool_call>, llama3 <|python_tag|>, mistral [TOOL_CALLS]), parses the reply into canonical message.tool_calls with arguments as a parsed object, and accepts role: "tool" results on the next turn.

curl http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-32b-q4_k_m",
    "stream": false,
    "messages": [
      {"role":"system","content":"You MUST call write_wiki, never respond with text."},
      {"role":"user","content":"Save Hello at topics/hello.md"}
    ],
    "tools":[{
      "type":"function",
      "function":{
        "name":"write_wiki",
        "description":"Create or overwrite a wiki article",
        "parameters":{
          "type":"object",
          "properties":{"path":{"type":"string"},"content":{"type":"string"}},
          "required":["path","content"]
        }
      }
    }]
  }'

Response:

{
  "model": "qwen3-32b-q4_k_m",
  "message": {
    "role": "assistant",
    "content": "",
    "tool_calls": [
      {
        "function": {
          "name": "write_wiki",
          "arguments": {"path": "topics/hello.md", "content": "Hello"}
        }
      }
    ]
  },
  "done": true
}

The per-family parser also handles a generic {"tool_call": {...}} fallback for templates that weren't properly applied. Streaming (stream: true) accumulates the full reply and emits tool_calls on the final done: true chunk. Full spec is in hfl-tool-calling-spec.md and the acceptance suite lives at tests/test_tool_calling_acceptance.py.

Concurrency & Backpressure

Local inference backends (llama.cpp, transformers-GPU) share a single non-reentrant model instance. HFL protects them with an in-server inference dispatcher that serialises requests on the currently-loaded engine with a bounded wait queue:

Setting	Env var	Default	Meaning
Max in-flight	`HFL_QUEUE_MAX_INFLIGHT`	`1`	Parallel requests allowed
Wait queue size	`HFL_QUEUE_MAX_SIZE`	`16`	Requests allowed to wait
Acquire timeout	`HFL_QUEUE_ACQUIRE_TIMEOUT`	`60`	Seconds a request may wait
Enabled	`HFL_QUEUE_ENABLED`	`true`	Master switch

When the wait queue is saturated, HFL returns 429 with a structured envelope and Retry-After:

{
  "error": "Inference queue is full",
  "code": "QUEUE_FULL",
  "category": "rate_limit",
  "retryable": true,
  "details": {"retry_after_seconds": 60, "queue_depth": 1, "max_queued": 1}
}

When a caller has been queued longer than HFL_QUEUE_ACQUIRE_TIMEOUT, HFL returns 503 with code=QUEUE_TIMEOUT. Every response carries X-Queue-Depth, X-Queue-In-Flight, X-Queue-Max-Inflight and X-Queue-Max-Size so agents can back off proportionally. Live state is also available via:

curl http://localhost:11434/healthz
# { "status":"ok", "models_loaded":[...], "queue_depth":0,
#   "queue_in_flight":0, "uptime_seconds":12345 }

All three API surfaces (Ollama, OpenAI, Anthropic) share the same dispatcher, so a slow call on /api/chat correctly blocks /v1/chat/completions and /v1/messages.

Quantization Levels

Level	Bits/weight	Quality	Use Case
Q2_K	~2.5	~80%	Extreme compression
Q3_K_M	~3.5	~87%	Low RAM
Q4_K_M	~4.5	~92%	Default - best balance
Q5_K_M	~5.0	~96%	High quality
Q6_K	~6.5	~97%	Premium
Q8_0	~8.0	~98%+	Maximum quantized quality
F16	16.0	100%	No quantization

RAM Requirements

RAM needed ≈ (parameters × bits_per_weight) / 8 + 2GB overhead

Example: Llama 3.3 70B with Q4_K_M
= (70B × 4.5) / 8 + 2GB ≈ 41.4 GB

Model Size	Q4_K_M RAM	Recommended Hardware
7B	~5 GB	8 GB RAM
13B	~9 GB	16 GB RAM
30B	~20 GB	32 GB RAM
70B	~42 GB	48 GB+ RAM or GPU

Authentication

Configure your HuggingFace token for faster downloads and access to gated models:

# Interactive login (recommended - stores token securely)
hfl login

# Or use environment variable (more private - not persisted)
export HF_TOKEN=hf_your_token_here

Get your token at: https://huggingface.co/settings/tokens

Configuration

Environment variables:

HFL_HOME: Data directory (default: ~/.hfl)
HF_TOKEN: HuggingFace token for gated models (alternative to hfl login)
HFL_LANG: Interface language (en for English, es for Spanish). Defaults to English.

Language Support

hfl supports multiple languages. Set the HFL_LANG environment variable to change the CLI language:

# Use Spanish
export HFL_LANG=es
hfl --help

# Use English (default)
export HFL_LANG=en
hfl --help

Supported languages: English (en), Spanish (es)

Known Limitations

This is a v0.3.x alpha release. Known limitations include:

vLLM backend is experimental: Basic implementation without full streaming support
CORS is restrictive by default: same-origin only; opt in via cors_allow_all or explicit cors_origins
Windows support: Not fully tested; Unix-like systems recommended

API Authentication

The API server supports optional authentication via the --api-key flag:

# Start server with authentication
hfl serve --api-key your-secret-key

# Client requests must include the key
curl -H "Authorization: Bearer your-secret-key" http://localhost:11434/v1/models
# Or
curl -H "X-API-Key: your-secret-key" http://localhost:11434/v1/models

Documentation

Complete architecture documentation with diagrams is available:

📖 View Architecture Documentation - Interactive HTML documentation with architecture diagrams, module descriptions, and flow charts

The documentation covers:

System architecture and design patterns
Module structure and dependencies
Inference engine selection logic
GGUF conversion pipeline
Legal compliance features
API endpoints reference

Note: Documentation is also available in Spanish.

Development

# Clone and install in development mode
git clone https://github.com/ggalancs/hfl
cd hfl
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=hfl --cov-report=term-missing

# Format code
ruff format .
ruff check . --fix

Legal Notices

Export Compliance

hfl only downloads publicly available open-weight models from HuggingFace Hub. Users are responsible for compliance with applicable export control regulations in their jurisdiction.

hfl does not facilitate access to closed-weight or export-controlled model weights.

Model Licenses

Models downloaded through hfl may have their own license restrictions. hfl displays license information before download and stores it with the model metadata. Users are responsible for complying with model licenses.

Common restrictions include:

Non-commercial use only (CC-BY-NC, MRL)
Attribution required (Llama, Gemma)
Usage restrictions (OpenRAIL)

Use hfl inspect <model> to view license details for downloaded models.

Disclaimer

AI models may generate inaccurate, biased, or inappropriate content. Users are solely responsible for evaluating and using model outputs appropriately. See DISCLAIMER.md for full details.

Trademarks

"OpenAI" is a trademark of OpenAI, Inc. "Ollama" is a trademark of Ollama, Inc. "Hugging Face" and the Hugging Face logo are trademarks of Hugging Face, Inc. These marks are used here for identification purposes only.

hfl is an independent project and is not affiliated with, endorsed by, or officially connected to Hugging Face, Inc., OpenAI, Inc., or Ollama, Inc. References to these services describe technical interoperability only.

License

hfl is source-available under the hfl Responsible Use License (HRUL) v1.0.

This license allows free use, modification, and commercial distribution with one condition: derivative works that are publicly distributed must maintain the legal compliance features (license checking, AI disclaimers, provenance tracking, privacy protections, and gating respect).

You are free to rewrite, extend, rebrand, and sell derivatives — you just can't strip out the safety features.

Note: The HRUL is not an OSI-approved open-source license. It is a source-available license with responsible use requirements, inspired by Apache 2.0, GPL copyleft, and the RAIL family of AI licenses.

See LICENSE for the full text and LICENSE-FAQ.md for common questions.

Name		Name	Last commit message	Last commit date
Latest commit History 270 Commits
.github		.github
docs		docs
packaging		packaging
scripts		scripts
src/hfl		src/hfl
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DISCLAIMER.md		DISCLAIMER.md
Dockerfile		Dockerfile
LICENSE		LICENSE
LICENSE-DEPENDENCIES.md		LICENSE-DEPENDENCIES.md
LICENSE-FAQ.md		LICENSE-FAQ.md
NOTICE-EU-AI-ACT.md		NOTICE-EU-AI-ACT.md
OLLAMA_PARITY_PLAN.md		OLLAMA_PARITY_PLAN.md
OLLAMA_PARITY_PLAN_V2.md		OLLAMA_PARITY_PLAN_V2.md
PRIVACY.md		PRIVACY.md
README.es.md		README.es.md
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
hfl.spec		hfl.spec
pyproject.toml		pyproject.toml

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

hfl

Why HFL?

What's new in V4

Features

How It Works

Prerequisites

Installation

Quick Start

Download a Model

Interactive Chat

API Server

Text-to-Speech (TTS)

Search Models on HuggingFace

Model Management

API Endpoints

OpenAI-Compatible

Ollama-Compatible

Using OpenAI Python SDK

Tool Calling (Agents)

Concurrency & Backpressure

Quantization Levels

RAM Requirements

Authentication

Configuration

Language Support

Known Limitations

API Authentication

Documentation

Development

Legal Notices

Export Compliance

Model Licenses

Disclaimer

Trademarks

License

About

Topics

Resources

License

Licenses found

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages