A production-style intelligent prompt router that dynamically decides whether a request should be handled:
- locally using Ollama + Qwen
- or escalated to a cloud LLM (Gemini)
It includes:
- complexity-based routing
- semantic caching
- live dashboard
- OpenAI-compatible API
- cost optimization
- latency tracking
- routing observability
Most AI applications either:
- send everything to expensive cloud models
- or force everything through weaker local models
This project solves that.
The router first analyzes the complexity of a prompt, then decides:
| Prompt Type | Route |
|---|---|
| Simple / factual / coding help | Local Qwen |
| Complex reasoning / architecture / deep analysis | Gemini |
| Repeated prompts | Semantic cache |
This dramatically reduces:
- cloud API cost
- latency
- unnecessary escalations
while still preserving high-quality answers for difficult prompts.
βββββββββββββββββββ
β Incoming Prompt β
ββββββββββ¬βββββββββ
β
βΌ
ββββββββββββββββββββββ
β Semantic Cache β
β (SQLite / vector) β
ββββββββββ¬ββββββββββββ
β hit
βΌ
Cached Response
β miss
βΌ
ββββββββββββββββββββββββββ
β Complexity Analyzer β
β (Local Qwen via Ollama)β
ββββββββββββ¬ββββββββββββββ
β
βββββββββββββββββ΄βββββββββββββββββ
β β
βΌ βΌ
ββββββββββββββββββββ ββββββββββββββββββββββ
β Local Model β β Cloud Escalation β
β Qwen via Ollama β β Gemini Flash Lite β
ββββββββββ¬ββββββββββ βββββββββββ¬βββββββββββ
β β
ββββββββββββββββ¬βββββββββββββββββ
βΌ
Final Response
Complexity classifier determines whether a prompt should remain local or go to the cloud.
Simple prompts are handled completely offline using:
- Ollama
- Qwen2.5-Coder
Complex prompts automatically route to Gemini for stronger reasoning.
Repeated prompts are served instantly from cache.
Works with:
- Continue.dev
- OpenWebUI
- VSCode extensions
- custom agents
- OpenAI SDKs
Real-time observability dashboard showing:
- local vs cloud routing
- cache hits
- complexity scores
- latency
- request logs
Designed to minimize paid token usage.
Open:
http://localhost:8000/dashboardYouβll see:
- live request stream
- complexity scoring
- local/cloud/cache routing
- latency metrics
- routing percentages
| Component | Tech |
|---|---|
| API | FastAPI |
| Local LLM | Ollama |
| Local Model | Qwen2.5-Coder |
| Cloud Model | Gemini Flash Lite |
| Cache | SQLite |
| HTTP Client | httpx |
| Dashboard | Vanilla HTML/CSS/JS |
.
βββ main.py # FastAPI server
βββ router.py # Complexity analysis + routing logic
βββ dashboard.py # Live monitoring dashboard
βββ cache.py # Semantic caching layer
βββ cache.db # SQLite cache database
βββ .env # Environment variables
βββ README.md
git clone https://github.com/YOUR_USERNAME/llm-cascade-router.git
cd llm-cascade-routerpython -m venv .venvActivate:
source .venv/bin/activate.venv\Scripts\activatepip install -r requirements.txtFrom:
ollama pull qwen2.5-coder:7bCreate .env
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=qwen2.5-coder:7b
GEMINI_API_KEY=your_key_here
COMPLEXITY_THRESHOLD=65Start Ollama:
ollama serveThen run FastAPI:
uvicorn main:app --reloadEndpoint:
POST /v1/chat/completionsExample:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{
"role": "user",
"content": "Design a scalable notification system"
}
]
}'| Prompt | Route |
|---|---|
| "Reverse a linked list" | Local |
| "Fix this Python syntax error" | Local |
| "Design a distributed event streaming platform" | Cloud |
| "Compare CQRS vs Event Sourcing tradeoffs" | Cloud |
The classifier considers:
- deep reasoning
- ambiguity
- generative requirements
- domain breadth
- architectural complexity
These are combined into a final complexity_score.
- vector embeddings cache
- Redis cache backend
- streaming responses
- async queueing
- Prometheus metrics
- Docker support
- Kubernetes deployment
- adaptive thresholds
- multi-model routing
- token usage tracking
- reinforcement learning for routing
Reduce API costs for coding assistants.
Keep sensitive prompts local.
Route tasks intelligently.
Run hybrid local/cloud inference.
This project is experimental and intended for learning/research purposes.
Not production hardened yet.
Star the repo and feel free to fork/build on top of it.
Built by Rohith.
Focused on:
- AI infrastructure
- intelligent orchestration
- developer tooling
- cost-efficient LLM systems