AnimAIOS - AnimaRouter
One endpoint. Any provider. Every free tier. Smart routing that learns.
Fork of MLuqmanBR/api-gateway (which itself forks tashfeenahmed/freellmapi). What follows is only what AnimaRouter adds on top of api-gateway.
- What AnimaRouter adds
- In Development
- Quick Start
- Docker
- Cloud Proxy
- Using the API
- Context Handoff
- Supported Providers
- Custom Platforms and Models
- Screenshots
- Limitations
- Contributing
- Terms of Service Review
- Disclaimer
- License
Everything below is on top of MLuqmanBR/api-gateway — which already ships OpenAI-compatible routing, Thompson-sampling bandits, per-key rate tracking, recovery, custom provider CRUD, context handoff, encrypted storage, dark-mode dashboard, Docker support, and more. If it's not listed here, api-gateway already has it.
| Feature | What | Why it matters |
|---|---|---|
| Dynamic Degradation | Severity-weighted progressive penalties replacing the flat rateLimitFactor. 429 → minor (2 min half-life), 5xx → major (15 min), consecutive hard failures → critical (60 min). Compounding exponential penalties + sigmoid degradation factor. Dashboard-visible + env-configurable. |
Flat rate-limit penalties can't distinguish a polite retry from a burning server. Degradation matches the penalty to the problem. (spec: docs/specs/dynamic-degradation/) |
| Strict model pinning | When you pin a model, it stays pinned — even on error. | Upstream silently falls through to a different model on failure, which breaks workflows that depend on a specific model's behaviour. |
| Free-only enforcement | OpenRouter and OpenCode routes restricted to :free models only. Monthly token budget system removed in favour of simpler quotas. |
AnimaRouter is about stacking free tiers; accidentally hitting a paid endpoint through those routes costs real money. |
| Keyless custom providers | Local servers that don't need auth can be added without a key. | Upstream requires a key for every provider — a blocker for self-hosted/Ollama-style endpoints that have no auth layer. |
| Archive instead of hard-delete | Custom providers and models get archived (restorable), not cascade-deleted. | Upstream cascade-deletes on removal. One misclick and your whole custom provider config is gone. Archive is reversible. |
| Feature | What | Why it matters |
|---|---|---|
| Benchmark Unification | Merges Artificial Analysis, SWE-rebench, and NIMStats into a single composite intelligence score. Per-source columns, canonical model keys, incremental recomputation, configurable weights. | Upstream has no benchmark integration. A single composite score replaces guessing which single benchmark to trust. (spec: docs/specs/benchmark-unification/) |
| SWE-rebench for intelligence | Replaced SWE-bench with SWE-rebench for model intelligence scoring. | SWE-rebench is a more reliable, less-contaminated benchmark for real-world coding ability. |
| Reasoning token fairness | tokPerSec speed score includes reasoning/thinking tokens so reasoning models aren't unfairly penalised. |
Upstream doesn't count reasoning tokens in speed calc — a thinking model that outputs 100 tok/s of reasoning + 30 tok/s of visible text looks "slower" than a 40 tok/s non-reasoning model. Unfair. (spec: docs/specs/reasoning-speed-fairness.md) |
| Feature | What | Why it matters |
|---|---|---|
| CommandCode provider | Full NDJSON translation layer, built-in. | Upstream doesn't ship it. CommandCode has a non-standard streaming format that needs a dedicated adapter. |
| NVIDIA NIM hardening | Robust handling of NIM-specific API quirks. | NIM's API diverges from the OpenAI spec in subtle ways; without hardening, requests silently fail or return malformed data. |
| End-to-end thinking/reasoning passthrough | reasoning_content deltas, Gemini thought signatures, thinking block handling across all providers. |
Reasoning models (DeepSeek, Gemini, etc.) emit thinking tokens that must be preserved end-to-end, not silently dropped. |
max_output_tokens fallback |
Uses catalog max_output_tokens as max_tokens fallback when the request doesn't specify. |
Prevents truncation when a client omits max_tokens but the provider enforces a default cap. |
| Feature | What | Why it matters |
|---|---|---|
| Live model filter | Search/filter on Models page with punctuation normalisation (kimi k2.6 → kimi-k2.6). |
Model names are inconsistent across providers; normalised search actually finds what you're looking for. |
| Enabled rows float to top | In the fallback chain, disabled models sink to the bottom. | Long chains are hard to scan when disabled entries clutter the top; float makes the active set visible at a glance. |
| Toast notifications | e.g. when auto-discovery adds new models. | Silent model additions are confusing — "where did that come from?" Toasts make changes visible. |
| Feature | What | Why it matters |
|---|---|---|
api CLI |
api start, api stop, api status, multi-instance tracking, log rotation. |
Upstream has no CLI. Running the server means remembering node server/dist/index.js flags. |
| Server log rotation | Prevents disk fill from unbounded logs. | Long-running instances silently eat disk until something breaks. |
| DeepSource coverage | Integrated test-coverage reporting pipeline. | Continuous coverage visibility catches regressions before merge. |
| Spec | Status |
|---|---|
docs/specs/dynamic-degradation/ |
Implemented, dashboard controls pending polish |
docs/specs/benchmark-unification/ |
Composite scorer live, incremental recomputation in progress |
docs/specs/reasoning-speed-fairness.md |
Reasoning tokens counted in tokPerSec; per-model toggle pending |
docs/specs/freellmproxy-integration/ |
npm run proxy:up zero-config deploy replacing proxy:deploy |
docs/specs/provider-outage-fast-fail/ |
Spec complete — reactive within-request provider-level skip on repeated 5xx |
docs/specs/provider-health-heartbeat/ |
Spec complete — proactive cross-request periodic health pings feeding degradation engine |
docs/specs/global-settings-panel/ |
Spec complete — dashboard Settings tab to toggle experimental features with DB persistence |
Prerequisites: Node.js 20+, npm. (Docker also works — see Docker.)
git clone https://github.com/animaios/animarouter.git
cd animarouter
npm install
cp .env.example .env
# Generate an encryption key for at-rest key storage
ENCRYPTION_KEY="$(node -e 'console.log(require("crypto").randomBytes(32).toString("hex"))')"
printf "ENCRYPTION_KEY=%s\nPORT=3001\n" "$ENCRYPTION_KEY" > .env
npm run devOpen http://localhost:5173 (the Vite dev UI), add your provider keys on the Keys page, reorder the Fallback Chain to taste, and grab your unified API key from the Keys page header. That unified key is what you point your OpenAI SDK at.
Reaching the dev UI from another device on your LAN? Use
npm run dev:lan— it passes--hostthrough to Vite. (Plainnpm run dev -- --hostdoes not work: the rootdevscript is aconcurrentlywrapper, so the flag never reaches Vite.)
For a production build without Docker:
npm run build
node server/dist/index.js # server + dashboard both served on :3001ENCRYPTION_KEY is required for startup. The server only falls back to a database-stored development key when DEV_MODE=true and NODE_ENV is not production; do not use that fallback with real provider keys.
Request analytics are retained for 90 days or 100000 request rows by default, whichever limit prunes first. Set REQUEST_ANALYTICS_RETENTION_DAYS=0 or REQUEST_ANALYTICS_MAX_ROWS=0 in .env to disable either retention limit.
git clone https://github.com/animaios/animarouter.git
cd animarouter
# Generate an encryption key
ENCRYPTION_KEY="$(openssl rand -hex 32)"
printf "ENCRYPTION_KEY=%s\nPORT=3001\n" "$ENCRYPTION_KEY" > .env
docker compose up -d --buildOpen http://localhost:3001. SQLite data is stored in the api-gateway-data volume at /app/server/data. Keep the same .env ENCRYPTION_KEY and volume when upgrading.
Reaching it from another machine? By default the container is published only on
127.0.0.1. To expose it on your LAN — e.g. a Raspberry Pi athttp://192.168.1.x:3001— start it withHOST_BIND=0.0.0.0:HOST_BIND=0.0.0.0 docker compose up -d --buildOnly do this on a trusted network: the proxy is single-user and guarded only by the unified API key.
More Docker operations and examples live in docker/README.md.
AnimaRouter ships a Cloudflare Workers proxy layer for IP rotation and header stripping. Deploy it to route requests through geographically-distributed exit IPs so upstream providers see consistent, non-identifying IP addresses instead of your real one.
Prerequisites: wrangler installed and logged in (npm i -g wrangler && wrangler login).
npm run proxy:upThat's it. The first run automatically:
- Checks wrangler auth
- Initializes the
freellmproxygit submodule - Installs proxy dependencies
- Generates secure secrets (
freellmproxy/.env) - Deploys proxy workers + router to Cloudflare
- Detects and prints your working endpoint URL
After deployment, register the proxy as a custom provider in the dashboard:
- Base64url-encode your target URL:
node -e "console.log(Buffer.from('https://api.example.com/v1').toString('base64url'))" - Construct:
https://{ROUTER_URL}/{AUTH_KEY}/{PROXY_NUM}/{BASE64_URL} - Add as a custom provider with that URL as the base URL
Custom domain (optional): Add ROUTER_DOMAIN=your.domain.com to freellmproxy/.env before deploying. This replaces the workers.dev subdomain with your own domain. The domain must be a Cloudflare-proxied zone.
| Command | Purpose |
|---|---|
npm run proxy:up |
Deploy everything to Cloudflare |
npm run proxy:dev |
Local dev server via wrangler |
npm run proxy:status |
Show deployment status |
npm run proxy:test |
Run proxy test suite |
Adjust PROXY_COUNT in freellmproxy/.env. See the proxy's README for the full architecture.
Any OpenAI-compatible client works. Examples:
Python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:3001/v1",
api_key="api-gateway-your-unified-key",
)
resp = client.chat.completions.create(
model="auto", # let the router pick; or specify e.g. "gemini-2.5-flash"
messages=[{"role": "user", "content": "Summarise the fall of Rome in one sentence."}],
)
print(resp.choices[0].message.content)
print("Routed via:", resp.headers.get("x-routed-via"))curl
curl http://localhost:3001/v1/chat/completions \
-H "Authorization: Bearer api-gateway-your-unified-key" \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "hi"}]
}'Streaming
stream = client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": "Stream me a haiku about SQLite."}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)Tool calling
Pass OpenAI-style tools and tool_choice; the assistant response round-trips back through the proxy exactly like the OpenAI API. Multi-step flows (assistant tool_calls → tool role follow-up → final answer) work across every provider the router can reach.
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city.",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
}]
# 1. Model asks for a tool call
first = client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": "What's the weather in Karachi?"}],
tools=tools,
tool_choice="required",
)
call = first.choices[0].message.tool_calls[0]
# 2. You execute the tool, feed the result back
final = client.chat.completions.create(
model="auto",
messages=[
{"role": "user", "content": "What's the weather in Karachi?"},
first.choices[0].message,
{"role": "tool", "tool_call_id": call.id, "content": '{"temp_c": 32, "cond": "sunny"}'},
],
tools=tools,
)
print(final.choices[0].message.content)Vision / image input
Send images with the standard OpenAI image_url content blocks (base64 data: URLs or http(s) URLs). When a request contains an image, the router restricts itself to vision-capable models and ignores text-only ones. Vision models are tagged with a Vision badge on the Fallback Chain page.
resp = client.chat.completions.create(
model="auto", # auto-routes to a vision model
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,<...>"}},
],
}],
)
print(resp.choices[0].message.content)If no vision-capable model is enabled in your Fallback Chain, an image request returns 422 (code: "no_vision_model") rather than silently dropping the image.
Every response carries an X-Routed-Via: <platform>/<model> header so you can see which provider actually served each call. If a request fell over between providers, you'll also see X-Fallback-Attempts: N.
/v1/embeddings is OpenAI-compatible, with one deliberate difference from chat routing: failover never crosses models. Vectors from different models live in incompatible spaces — silently switching models would corrupt any vector store built on top of the proxy. So embeddings route by family (one model identity + dimension), and failover only walks the providers serving that same family.
resp = client.embeddings.create(
model="auto", # default family; or a family name like "bge-m3"
input=["the quick brown fox", "pack my box with five dozen liquor jugs"],
)
print(len(resp.data), "vectors of", len(resp.data[0].embedding), "dims")curl http://localhost:3001/v1/embeddings \
-H "Authorization: Bearer api-gateway-your-unified-key" \
-H "Content-Type: application/json" \
-d '{"model": "auto", "input": "hello world"}'model accepts auto (the configured default family), a family name, or a provider-specific model id (which resolves to its family). Available families:
Family (model) |
Dims | Providers (failover order) |
|---|---|---|
gemini-embedding-001 (default) |
3072 | |
text-embedding-3-large |
3072 | GitHub Models |
text-embedding-3-small |
1536 | GitHub Models |
embed-v4.0 |
1536 | Cohere |
bge-m3 |
1024 | Cloudflare → Hugging Face |
qwen3-embedding-0.6b |
1024 | Cloudflare |
nv-embedqa-e5-v5 |
1024 | NVIDIA |
llama-nemotron-embed-1b-v2 |
2048 | NVIDIA |
llama-nemotron-embed-vl-1b-v2 |
2048 | NVIDIA → OpenRouter |
embeddinggemma-300m |
768 | Cloudflare |
The default family, per-provider toggles, and priorities live on the dashboard's Models → Embeddings page.
When AnimaRouter falls over to a different model mid-conversation (quota, rate limit, cooldown), the new model has no idea it is picking up someone else's task. Context handoff adds a single compact system message to the outbound request that tells the new model exactly that:
AnimaRouter context handoff:
You are taking over an ongoing conversation from another model (groq:llama-3 → google:gemini-flash).
Continue the user's task using the conversation context already provided in this request.
Do not restart the task, re-ask already answered setup questions, or discard prior tool results.
Respect the user's latest message as the highest-priority instruction.
Recent session summary:
User: …
Assistant: …
Enable it in .env:
ANIMAROUTER_CONTEXT_HANDOFF=on_model_switchHow it works:
- Messages per session are stored in memory (TTL: 3 hours).
- Only injected when the selected model changes for a given session key.
- Not injected on the first request, on same-model continuations, or if a handoff message is already present.
- Session key:
X-Session-Idheader if present, otherwise SHA-1 of the first user message (same as sticky sessions). - Storage is in-memory only. Nothing is written to disk or logged.
Important: Context Handoff improves continuity for conversations routed through AnimaRouter. It cannot recover provider-internal hidden state or messages that were never sent to the proxy.
Don't see yours? Add it. Any OpenAI-compatible endpoint — cloud service, local server, homelab GPU — becomes a provider in under a minute. It gets the same fallback chain, the same intelligent routing, the same rate-limit protection as every built-in. See Custom Providers →
The built-in provider list is a starting point, not a boundary. From the Keys page, the Platforms grid is the unified catalog — every built-in platform you've added a key for, alongside every custom platform you've registered. The grid ends with an Add New Platform tile that opens a modal for:
- Slug — a short identifier like
my-ollama(lowercase letters, digits, dashes; 2-32 chars; cannot collide with a built-in). - Display name — shown in the dashboard.
- Base URL — the OpenAI-compatible endpoint, e.g.
http://192.168.1.10:11434/v1. - Rate limits (optional) — RPM, RPD, TPM, TPD caps enforced per-provider.
- Max parallel requests (optional) — concurrency ceiling so this provider never hogs all connection slots.
Once a platform exists, its models are automatically discovered from the endpoint's /v1/models during creation by default. However, discovered models are disabled by default and must be manually enabled. You can also re-run discovery at any time via POST /api/custom-providers/:slug/sync-models, and there's an option to auto-enable discovered models during creation.
- Model ID and Display name — required.
- Context window, Supports tools, Supports vision — basic flags.
- Advanced toggle exposes intelligence rank, speed rank, size label, per-model rate limits, and max output tokens.
The model joins the fallback chain at the lowest priority and shows up everywhere built-in models do — /v1/models, the Fallback page, the Analytics page. You can edit any model (built-in or custom) later: adjust ranks, toggle tools/vision, cap output tokens, change rate limits — all from the dashboard.
Adding an API key for a custom platform works the same as for a built-in: pick the custom slug in the Add a provider key form, paste the bearer (or leave blank for local servers that don't need one), and the key routes to your endpoint.
Removing a custom platform archives it — models, keys, and fallback entries are preserved and restorable. Hard-delete is available but off by default.
Manage provider credentials and grab the unified API key your apps connect with. Each key shows a status dot and when it was last health-checked. Custom platforms appear alongside built-in ones in the unified grid.
Send a chat completion through the router and see which provider served it, with the model ID and latency printed right on the message.
Request volume, success rate, tokens in and out, average latency, and per-provider breakdowns over 24h / 7d / 30d windows.
Stacking free tiers — even with custom providers in the mix — has real trade-offs:
- No frontier models out of the box. The free-tier catalog tops out around Llama 3.3 70B, GLM-4.5, Qwen 3 Coder, and Gemini 2.5 Pro. For hard problems, pay for a real API — or bring your own paid provider as a custom platform.
- Intelligence degrades as the day progresses. Your top-ranked models have the lowest daily caps. Once they hit their limits, the router falls down your priority chain to smaller/weaker models. Expect the effective intelligence to drop in the late hours — then reset at UTC midnight.
- Latency is highly variable. Cerebras and Groq are extremely fast; others are not. You get whichever one is available at the moment.
- Free tiers can change without notice. Providers regularly tighten, loosen, or remove free tiers. When that happens you'll see 429s or auth errors until the catalog catches up.
- No SLA, by definition. If you need reliability, use a paid provider with a contract — either directly or plugged into AnimaRouter as a custom platform.
- Local-first. No multi-tenant auth. Run this for yourself; don't expose it to the internet.
Contributors welcome! Good first PRs:
- Add a provider — copy
server/src/providers/openai-compat.tsas a template, wire it intoserver/src/providers/index.ts, seed its models inserver/src/db/migrations.ts, add a test inserver/src/__tests__/providers/. - Add an endpoint — images, moderations, audio. The provider base class can grow new methods; adapters declare which they support.
- Improve the router — cost-aware routing, better latency-weighted priority, regional pinning.
- Dashboard polish — charts on the Analytics page, key rotation UX, batch import of keys from
.env. - Docs — more examples, client library snippets, deployment recipes.
Development loop:
npm install
npm run dev # server on :3001, dashboard on :5173, both with HMR
npm test # server vitest + client typecheck
npm run build # compile server and dashboardOr use the api CLI: api start, api stop, api status.
PRs should include a test, keep the existing test suite green, and match the .editorconfig / tsconfig defaults. Issues and discussions are open.
A self-hosted, single-user, personal-use setup was re-reviewed against each provider's ToS (May 2026). Summary:
| Provider | Verdict | Notes |
|---|---|---|
| Google Gemini | March 2026 ToS narrows scope to "professional or business purposes, not for consumer use" — a self-hosted developer proxy is still defensible, but the clause is new. | |
| Groq | ✅ Likely OK | GroqCloud Services Agreement permits Customer Application integration. |
| Cerebras | ✅ Likely OK | Permitted; explicitly forbids selling/transferring API keys. |
| Mistral | ✅ Likely OK | APIs allowed for personal/internal business use. |
| OpenRouter | ✅ Likely OK | April 2026 ToS sharpens the no-resale / no-competing-service clause; private single-user proxy still fine. |
| Cloudflare Workers AI | No anti-proxy clause; covered by general Self-Serve Subscription Agreement. | |
| NVIDIA NIM | Trial ToS §1.2 / §1.4: "evaluation only, not production." Free access is a recurring 40 RPM rate limit (the 2025 credit system was discontinued), but the evaluation-only scope stands. | |
| GitHub Models | Free tier explicitly scoped to "experimentation" and "prototyping." | |
| Cohere | ❌ Avoid | Terms §14 still forbids "personal, family or household purposes." |
| Zhipu (open.bigmodel.cn) | ✅ Likely OK | Personal/non-commercial research carve-out still in the platform docs. |
| Z.ai (api.z.ai) | Singapore entity (distinct from Zhipu CN). §III.3(l) anti-traffic-redirect clause could plausibly be read against a proxy; no explicit personal-use carve-out. | |
| Ollama Cloud | ✅ Likely OK | Free plan permits cloud-model access (1 concurrent, 5-hour session caps). No anti-proxy / anti-resale clauses found. |
| OVH AI Endpoints | ✅ Likely OK | Anonymous access is officially documented (2 req/min per IP per model). OVH reserves the right to introduce token/consumption caps. |
Rules of thumb: one account per provider, no reselling, no sharing your endpoint with other humans, don't hammer a free tier as a paid production backend. This is informational, not legal advice — read each provider's ToS and make your own call.
This project is for personal experimentation and learning, not production. Free tiers exist so developers can prototype against them; they aren't a stable, supported inference substrate and shouldn't be treated as one. If you build something real on top of AnimaRouter, swap in a paid API before you ship. Your relationship with each upstream provider is governed by the terms you accepted when you created your account — those terms still apply when the traffic is proxied through this project, and you're responsible for complying with them.
Built on MLuqmanBR/api-gateway (itself a fork of tashfeenahmed/freellmapi). Maintained by animaios.
MIT — © upstream contributors + animaios contributors.






































