⚡ LLM Cascade Router

A production-style intelligent prompt router that dynamically decides whether a request should be handled:

locally using Ollama + Qwen
or escalated to a cloud LLM (Gemini)

It includes:

complexity-based routing
semantic caching
live dashboard
OpenAI-compatible API
cost optimization
latency tracking
routing observability

🚀 Why This Exists

Most AI applications either:

send everything to expensive cloud models
or force everything through weaker local models

This project solves that.

The router first analyzes the complexity of a prompt, then decides:

Prompt Type	Route
Simple / factual / coding help	Local Qwen
Complex reasoning / architecture / deep analysis	Gemini
Repeated prompts	Semantic cache

This dramatically reduces:

cloud API cost
latency
unnecessary escalations

while still preserving high-quality answers for difficult prompts.

🧠 Architecture

                    ┌─────────────────┐
                    │ Incoming Prompt │
                    └────────┬────────┘
                             │
                             ▼
                 ┌────────────────────┐
                 │ Semantic Cache     │
                 │ (SQLite / vector)  │
                 └────────┬───────────┘
                          │ hit
                          ▼
                    Cached Response

                          │ miss
                          ▼

              ┌────────────────────────┐
              │ Complexity Analyzer    │
              │ (Local Qwen via Ollama)│
              └──────────┬─────────────┘
                         │
         ┌───────────────┴────────────────┐
         │                                │
         ▼                                ▼

┌──────────────────┐          ┌────────────────────┐
│ Local Model      │          │ Cloud Escalation   │
│ Qwen via Ollama  │          │ Gemini Flash Lite  │
└────────┬─────────┘          └─────────┬──────────┘
         │                               │
         └──────────────┬────────────────┘
                        ▼
                 Final Response

✨ Features

Intelligent Prompt Routing

Complexity classifier determines whether a prompt should remain local or go to the cloud.

Local-First Inference

Simple prompts are handled completely offline using:

Ollama
Qwen2.5-Coder

Cloud Escalation

Complex prompts automatically route to Gemini for stronger reasoning.

Semantic Cache

Repeated prompts are served instantly from cache.

OpenAI-Compatible API

Works with:

Continue.dev
OpenWebUI
VSCode extensions
custom agents
OpenAI SDKs

Live Dashboard

Real-time observability dashboard showing:

local vs cloud routing
cache hits
complexity scores
latency
request logs

Cost Optimization

Designed to minimize paid token usage.

🖥 Dashboard

Open:

http://localhost:8000/dashboard

You’ll see:

live request stream
complexity scoring
local/cloud/cache routing
latency metrics
routing percentages

📦 Tech Stack

Component	Tech
API	FastAPI
Local LLM	Ollama
Local Model	Qwen2.5-Coder
Cloud Model	Gemini Flash Lite
Cache	SQLite
HTTP Client	httpx
Dashboard	Vanilla HTML/CSS/JS

📂 Project Structure

.
├── main.py           # FastAPI server
├── router.py         # Complexity analysis + routing logic
├── dashboard.py      # Live monitoring dashboard
├── cache.py          # Semantic caching layer
├── cache.db          # SQLite cache database
├── .env              # Environment variables
└── README.md

⚙️ Installation

1. Clone the repo

git clone https://github.com/YOUR_USERNAME/llm-cascade-router.git
cd llm-cascade-router

2. Create virtual environment

python -m venv .venv

Activate:

macOS/Linux

source .venv/bin/activate

Windows

.venv\Scripts\activate

3. Install dependencies

pip install -r requirements.txt

4. Install Ollama

From:

Ollama Official Website

5. Pull Qwen model

ollama pull qwen2.5-coder:7b

6. Configure environment variables

Create .env

OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=qwen2.5-coder:7b

GEMINI_API_KEY=your_key_here

COMPLEXITY_THRESHOLD=65

▶️ Running The Project

Start Ollama:

ollama serve

Then run FastAPI:

uvicorn main:app --reload

🔌 API Usage

Endpoint:

POST /v1/chat/completions

Example:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": "Design a scalable notification system"
      }
    ]
  }'

📊 Example Routing Decisions

Prompt	Route
"Reverse a linked list"	Local
"Fix this Python syntax error"	Local
"Design a distributed event streaming platform"	Cloud
"Compare CQRS vs Event Sourcing tradeoffs"	Cloud

🧠 Complexity Signals

The classifier considers:

deep reasoning
ambiguity
generative requirements
domain breadth
architectural complexity

These are combined into a final complexity_score.

📈 Future Improvements

vector embeddings cache
Redis cache backend
streaming responses
async queueing
Prometheus metrics
Docker support
Kubernetes deployment
adaptive thresholds
multi-model routing
token usage tracking
reinforcement learning for routing

🔥 Example Use Cases

AI IDE Backend

Reduce API costs for coding assistants.

Enterprise Gateways

Keep sensitive prompts local.

Multi-LLM Agents

Route tasks intelligently.

Edge AI Systems

Run hybrid local/cloud inference.

🛡 Disclaimer

This project is experimental and intended for learning/research purposes.

Not production hardened yet.

⭐ If You Like This Project

Star the repo and feel free to fork/build on top of it.

👨‍💻 Author

Built by Rohith.

Focused on:

AI infrastructure
intelligent orchestration
developer tooling
cost-efficient LLM systems

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.env		.env
.gitignore		.gitignore
README.md		README.md
cache.db		cache.db
cache.py		cache.py
dashboard.py		dashboard.py
main.py		main.py
requirements.txt		requirements.txt
router.py		router.py

Folders and files

Latest commit

History

Repository files navigation

⚡ LLM Cascade Router

🚀 Why This Exists

🧠 Architecture

✨ Features

Intelligent Prompt Routing

Local-First Inference

Cloud Escalation

Semantic Cache

OpenAI-Compatible API

Live Dashboard

Cost Optimization

🖥 Dashboard

📦 Tech Stack

📂 Project Structure

⚙️ Installation

1. Clone the repo

2. Create virtual environment

macOS/Linux

Windows

3. Install dependencies

4. Install Ollama

5. Pull Qwen model

6. Configure environment variables

▶️ Running The Project

🔌 API Usage

📊 Example Routing Decisions

🧠 Complexity Signals

📈 Future Improvements

🔥 Example Use Cases

AI IDE Backend

Enterprise Gateways

Multi-LLM Agents

Edge AI Systems

🛡 Disclaimer

⭐ If You Like This Project

👨‍💻 Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages