Skip to content

rohith-nandan-6/LLM-Cascade-Router

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

⚑ LLM Cascade Router

A production-style intelligent prompt router that dynamically decides whether a request should be handled:

  • locally using Ollama + Qwen
  • or escalated to a cloud LLM (Gemini)

It includes:

  • complexity-based routing
  • semantic caching
  • live dashboard
  • OpenAI-compatible API
  • cost optimization
  • latency tracking
  • routing observability

πŸš€ Why This Exists

Most AI applications either:

  • send everything to expensive cloud models
  • or force everything through weaker local models

This project solves that.

The router first analyzes the complexity of a prompt, then decides:

Prompt Type Route
Simple / factual / coding help Local Qwen
Complex reasoning / architecture / deep analysis Gemini
Repeated prompts Semantic cache

This dramatically reduces:

  • cloud API cost
  • latency
  • unnecessary escalations

while still preserving high-quality answers for difficult prompts.


🧠 Architecture

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚ Incoming Prompt β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                             β–Ό
                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚ Semantic Cache     β”‚
                 β”‚ (SQLite / vector)  β”‚
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚ hit
                          β–Ό
                    Cached Response

                          β”‚ miss
                          β–Ό

              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚ Complexity Analyzer    β”‚
              β”‚ (Local Qwen via Ollama)β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚                                β”‚
         β–Ό                                β–Ό

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Local Model      β”‚          β”‚ Cloud Escalation   β”‚
β”‚ Qwen via Ollama  β”‚          β”‚ Gemini Flash Lite  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                               β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β–Ό
                 Final Response

✨ Features

Intelligent Prompt Routing

Complexity classifier determines whether a prompt should remain local or go to the cloud.

Local-First Inference

Simple prompts are handled completely offline using:

  • Ollama
  • Qwen2.5-Coder

Cloud Escalation

Complex prompts automatically route to Gemini for stronger reasoning.

Semantic Cache

Repeated prompts are served instantly from cache.

OpenAI-Compatible API

Works with:

  • Continue.dev
  • OpenWebUI
  • VSCode extensions
  • custom agents
  • OpenAI SDKs

Live Dashboard

Real-time observability dashboard showing:

  • local vs cloud routing
  • cache hits
  • complexity scores
  • latency
  • request logs

Cost Optimization

Designed to minimize paid token usage.


πŸ–₯ Dashboard

Open:

http://localhost:8000/dashboard

You’ll see:

  • live request stream
  • complexity scoring
  • local/cloud/cache routing
  • latency metrics
  • routing percentages

πŸ“¦ Tech Stack

Component Tech
API FastAPI
Local LLM Ollama
Local Model Qwen2.5-Coder
Cloud Model Gemini Flash Lite
Cache SQLite
HTTP Client httpx
Dashboard Vanilla HTML/CSS/JS

πŸ“‚ Project Structure

.
β”œβ”€β”€ main.py           # FastAPI server
β”œβ”€β”€ router.py         # Complexity analysis + routing logic
β”œβ”€β”€ dashboard.py      # Live monitoring dashboard
β”œβ”€β”€ cache.py          # Semantic caching layer
β”œβ”€β”€ cache.db          # SQLite cache database
β”œβ”€β”€ .env              # Environment variables
└── README.md

βš™οΈ Installation

1. Clone the repo

git clone https://github.com/YOUR_USERNAME/llm-cascade-router.git
cd llm-cascade-router

2. Create virtual environment

python -m venv .venv

Activate:

macOS/Linux

source .venv/bin/activate

Windows

.venv\Scripts\activate

3. Install dependencies

pip install -r requirements.txt

4. Install Ollama

From:

Ollama Official Website


5. Pull Qwen model

ollama pull qwen2.5-coder:7b

6. Configure environment variables

Create .env

OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=qwen2.5-coder:7b

GEMINI_API_KEY=your_key_here

COMPLEXITY_THRESHOLD=65

▢️ Running The Project

Start Ollama:

ollama serve

Then run FastAPI:

uvicorn main:app --reload

πŸ”Œ API Usage

Endpoint:

POST /v1/chat/completions

Example:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": "Design a scalable notification system"
      }
    ]
  }'

πŸ“Š Example Routing Decisions

Prompt Route
"Reverse a linked list" Local
"Fix this Python syntax error" Local
"Design a distributed event streaming platform" Cloud
"Compare CQRS vs Event Sourcing tradeoffs" Cloud

🧠 Complexity Signals

The classifier considers:

  • deep reasoning
  • ambiguity
  • generative requirements
  • domain breadth
  • architectural complexity

These are combined into a final complexity_score.


πŸ“ˆ Future Improvements

  • vector embeddings cache
  • Redis cache backend
  • streaming responses
  • async queueing
  • Prometheus metrics
  • Docker support
  • Kubernetes deployment
  • adaptive thresholds
  • multi-model routing
  • token usage tracking
  • reinforcement learning for routing

πŸ”₯ Example Use Cases

AI IDE Backend

Reduce API costs for coding assistants.

Enterprise Gateways

Keep sensitive prompts local.

Multi-LLM Agents

Route tasks intelligently.

Edge AI Systems

Run hybrid local/cloud inference.


πŸ›‘ Disclaimer

This project is experimental and intended for learning/research purposes.

Not production hardened yet.


⭐ If You Like This Project

Star the repo and feel free to fork/build on top of it.


πŸ‘¨β€πŸ’» Author

Built by Rohith.

Focused on:

  • AI infrastructure
  • intelligent orchestration
  • developer tooling
  • cost-efficient LLM systems

About

Intelligent LLM router that dynamically routes prompts between local Ollama (Qwen) and cloud models (Gemini) using complexity scoring, semantic caching, and cost-aware decisioning.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages