A memory-aware, conversational RAG engine that finds the single best podcast clip or episode for a natural-language question.
EchoFind is the conversational front door to a large podcast catalog. Ask it something the way you'd ask a friend β "what did that founder say about pricing last month?", "find the latest episode with the neuroscientist", "who was that guy again?" β and it returns one precise audio/video clip (with exact start/end timestamps) or a whole episode, plus a few one-tap follow-ups.
It is not keyword search. EchoFind routes intent, expands the query with hypothetical documents, runs hybrid dense + sparse retrieval over time-aware buckets, reranks, re-scores with metadata, and asks an LLM to pick the winning clip β all while keeping a structured memory of the conversation so it can resolve pronouns, follow-ups, and topic shifts across turns.
βοΈ Built with FastAPI + Server-Sent Events, Google Gemini, OpenAI embeddings, Pinecone hybrid search, Cohere reranking, and PostgreSQL. ~17k lines of Python.
- Why it's interesting
- How it works
- Standout engineering
- Performance
- Evaluation
- Tech stack
- Project structure
- Getting started
- API reference
- Testing
- Design notes & limitations
Most "chat with your data" demos are a single embedding lookup glued to an LLM. EchoFind is a full agentic retrieval pipeline that tackles the problems that actually show up in production conversational search:
| Problem | How EchoFind handles it |
|---|---|
| "Find a clip" vs "explain this" vs "find an episode" are different tasks | A 3-branch LLM router dispatches each query to a dedicated pipeline |
| Users speak in pronouns and follow-ups ("what else did he say?") | A deterministic conversation memory state machine resolves entities, topics, and "the other one" |
| The system keeps surfacing the same result | A 5-turn exclusion window hard-drops already-shown clips |
| "Latest" / "before the election" / "last month" need time reasoning | A multi-bucket time-search planner builds and fuses temporal buckets |
| One query rarely matches the best transcript wording | HyDE expansion + multi-level Reciprocal Rank Fusion over dense & sparse vectors |
| Semantic similarity alone misranks by recency/person/show | Hybrid metadata-aware re-scoring with intent-driven weight profiles |
| LLMs return malformed JSON | A robustness ladder: structured output β JSON mode β a hand-written JSON repairer |
| Latency matters | Selection and the memory update happen in a single LLM call; recommendations are pre-computed |
ββββββββββββββββββββββββββββ
β Web UI (web/index.html) β single-file, no build step
ββββββββββββββ¬ββββββββββββββ
β POST /api/chat/stream (Server-Sent Events)
ββββββββββββββΌββββββββββββββ
β FastAPI (server.py) β streams staged progress + result
β API routes (api/) β
ββββββββββββββ¬ββββββββββββββ
β
ββββββββββββββΌββββββββββββββ
β Agent orchestrator β engine/agent.py
β + conversation memory β engine/memory.py
ββββββββββββββ¬ββββββββββββββ
β Stage 0: LLM ROUTER (engine/router.py)
βββββββββββββββββββΌββββββββββββββββββββββ
βΌ βΌ βΌ
ββββββββββββββ βββββββββββββββββ ββββββββββββββββββββ
β small_talk β β episode_searchβ β clip_search β β default RAG
β (+grounding)β β β β full pipeline β
ββββββββββββββ βββββββββββββββββ ββββββββββββββββββββ
A single request β "what did the neuroscientist say recently about sleep?" β flows through these stages, each streamed to the UI as an SSE progress event:
- Route β Gemini classifies the query into
small_talk/episode_search/clip_searchwith a confidence score (low confidence β safe fallback). - Analyze β resolve pronouns against memory, detect follow-ups, extract guests/hosts/show/time filters via a gazetteer (fuzzy entity index), and generate HyDE hypothetical transcript snippets β all in parallel.
- Embed β one batched OpenAI
text-embedding-3-largecall (dense) plus BM25 sparse vectors, in parallel. HyDE vectors are weighted by their cosine-similarity rank to the original query. - Retrieve β a multi-bucket time-search plan (
latest/relative/between/before/after, with pre/post-event splitting) runs each bucket in parallel against Pinecone's dense + sparse indexes, fused with three-level RRF (dense+sparse per query, then original+HyDE across queries, then weighted across buckets). Already-shown clips are hard-excluded. - Hydrate β fetch full rows (transcript, titles, speakers, timestamps, media URLs) from PostgreSQL and merge with vector scores.
- Rerank β Cohere rerank over a constraint-enriched query.
- Re-score β
apply_hybrid_metadata_scoringblends semantic + date + person-match + show-match using a weight profile chosen from detected intent. - Select β Gemini reads the top candidates as XML documents, extracts supporting quotes before choosing, and returns the winning clip, a user-facing answer, a confidence score, and the conversation-memory update β in one call.
- Recommend β the top-3 alternative clips are pre-computed and cached so a follow-up tap resolves instantly.
The episode_search and small_talk branches are analogous; small_talk
optionally uses Google Search grounding for explanatory answers.
For a much deeper walkthrough of every subsystem, see
docs/ARCHITECTURE.md.
These are the parts worth a closer look:
- Single-call select + memory. The selection step makes the LLM both pick the best chunk and emit the full structured memory update at once β deliberately trading one round-trip for ~1β2s of latency.
- Deterministic memory as a separate source of truth. Pronoun resolution,
topic-drift detection, "the other one" disambiguation, and topic-thread
tracking live in explicit, testable state machinery (
engine/memory.py) β not left to the model. Per-component renderers tailor exactly what each LLM sees. - Multi-level Reciprocal Rank Fusion with weighted HyDE. Three fusion stages: dense + sparse fused per query, then original + HyDE variants fused across queries (each HyDE document weighted by its cosine-similarity rank to the base query), then time buckets fused by bucket weight.
- Time-aware multi-bucket retrieval. A planner builds latest/oldest/relative/ before/after/between buckets (and can regex-infer event anchors), each independently filtered, fused, and re-weighted.
- Intent-driven metadata re-scoring. One of several weight profiles (pure-recency / recency+topic / person-focused / show-focused / standard) is selected per query, with tiered person matching and multiplicative penalties.
- A robustness ladder for unreliable LLM JSON: structured output β JSON-object mode β a hand-written character-state-machine JSON repairer.
- Pervasive concurrency & graceful degradation: batched embeddings,
asyncio.gatherover HyDE calls / buckets / branches, exponential-backoff retries, and fallbacks at every stage (optional grounding, entity-filter and recall fallbacks, per-stage fallback selections).
π Deep dives: the architecture and these two techniques are written up in
detail β see docs/DESIGN.md and the engineering write-ups in
docs/blog/ (multi-level RRF + weighted HyDE,
single-call select + memory).
A design goal is that the deterministic memory layer is never the bottleneck β
LLM calls should dominate latency, not bookkeeping. A reproducible micro-benchmark
(benchmarks/memory_bench.py) measures the pure-Python
per-turn cost at a realistic steady state (~50 turns of history), on commodity hardware:
| Operation (per turn) | p50 | p95 |
|---|---|---|
| Full memory update (entity decay, topic thread, exclusion window, compression) | ~0.04 ms | ~0.14 ms |
| Render router context | ~0.005 ms | ~0.008 ms |
| Render query-analyzer context | ~0.009 ms | ~0.027 ms |
| Render full memory prompt | ~0.006 ms | ~0.018 ms |
| Exclusion-window lookup | <0.001 ms | <0.001 ms |
The entire memory layer costs well under 0.2 ms per turn β three to four orders
of magnitude below a single model round-trip. Reproduce: python benchmarks/memory_bench.py.
Retrieval quality is measured by a reproducible harness
(evals/retrieval_eval.py) that runs the real
clip-search pipeline against a labeled query set and reports Hit@k, Recall@k,
MRR, end-to-end selection accuracy, and latency percentiles β with optional
RAGAS answer scoring (--ragas). The harness captures
the ranked candidate set by wrapping the scoring stage at runtime, so it
measures the production path without touching it.
Because EchoFind's accuracy depends on a private podcast index and relational store, the harness has two modes:
- Skeleton (default, no credentials) β validates the labeled set and metric wiring and prints what would run, but emits zero numbers. It will never fabricate a metric.
- Live (with
OPENAI/PINECONE/GEMINI/RDS_*set) β boots the same agentserver.pybuilds and prints real metrics.
python evals/retrieval_eval.py # skeleton check, no creds
python evals/retrieval_eval.py --repeats 3 # live: retrieval + latency
python evals/retrieval_eval.py --ragas # live + answer faithfulness/relevancySee evals/README.md for the metric definitions and the
labeled-query-set format.
| Layer | Technology |
|---|---|
| API & streaming | FastAPI, Uvicorn, Server-Sent Events (sse-starlette) |
| Frontend | Vanilla HTML/CSS/JS β single file, no build step |
| LLM | Google Gemini (gemini-3-flash-preview + gemini-2.5-flash-lite) via OpenAI-compatible API; google-genai for Search grounding |
| Embeddings | OpenAI text-embedding-3-large (dense) |
| Sparse | BM25 via pinecone-text |
| Vector search | Pinecone hybrid (dense + sparse indexes) |
| Reranking | Cohere rerank via Pinecone inference |
| Relational store | PostgreSQL (clip + episode metadata) via psycopg2 |
| Object storage | AWS S3 (BM25 model) via boto3 |
| Retrieval techniques | HyDE, multi-level RRF, fuzzy matching (rapidfuzz/thefuzz) |
| Validation | Pydantic v2 structured outputs |
| Concurrency | asyncio throughout; in-RAM session memory |
EchoFind/
βββ server.py # FastAPI app: clients, lifespan, CORS, serves the UI
βββ config.py # All settings via env vars (no secrets committed)
βββ run_local.py # Local dev server launcher
βββ api/
β βββ routes.py # /chat, /chat/stream (SSE), recommendations, sessions
βββ engine/ # The agentic RAG engine
β βββ agent.py # Orchestrator: routing, the clip pipeline, fusion, scoring
β βββ router.py # 3-branch LLM query router
β βββ query_analyzer.py # Pronoun resolution, entity/time extraction, HyDE
β βββ selection.py # Single-call clip selection + memory update
β βββ episode_search.py # Episode-level retrieval branch
β βββ recommendations.py # Pre-computed follow-up clips
β βββ episode_recommendations.py
β βββ small_talk.py # Greetings / explanations (+ optional grounding)
β βββ memory.py # Deterministic conversation-memory state machine
β βββ schemas.py # Pydantic request/response/memory schemas
βββ retrieval/ # Search & data-access layer
β βββ data_fetcher.py # Pinecone search, RRF, PostgreSQL hydration
β βββ search.py # Cohere reranking
β βββ search_filter.py # Metadata + date filter construction
β βββ sparse_encoder.py # BM25 sparse encoding (graceful default fallback)
β βββ gazetteer.py # Fast fuzzy entity lookup (hosts/guests/shows)
βββ web/
β βββ index.html # Streaming chat UI
βββ data/
β βββ entities.sample.json # Synthetic catalog for local dev/demo
βββ tests/ # Memory-behavior verification scripts
βββ docs/
β βββ ARCHITECTURE.md # Deep-dive design documentation
βββ .env.example # Copy to .env and fill in
βββ requirements.txt
EchoFind talks to several managed services (Gemini, OpenAI, Pinecone, Cohere via Pinecone, PostgreSQL). To run it end-to-end you need accounts/keys for those and a populated index + database. The code, structure, and pipeline are fully readable without them.
# 1. Clone & enter
git clone https://github.com/akira231097/echofind.git
cd echofind
# 2. (Recommended) create a virtualenv
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Configure environment
cp .env.example .env # then fill in your keys
# 5. Run the dev server
python run_local.py
# β UI: http://localhost:8000
# β API docs: http://localhost:8000/docs
# β Health: http://localhost:8000/api/healthThe BM25 sparse encoder gracefully falls back to a default model if no trained model is available, so the service still boots without the (proprietary) index.
| Method | Endpoint | Purpose |
|---|---|---|
POST |
/api/chat |
Non-streaming chat (full response) |
POST |
/api/chat/stream |
Streaming chat via Server-Sent Events |
POST |
/api/recommendation/click |
Resolve a pre-computed clip recommendation |
POST |
/api/episode-recommendation/click |
Resolve an episode recommendation |
POST |
/api/session/reset |
Clear a session's memory |
GET |
/api/session/{id} |
Session info (turns, entities, themes) |
DELETE |
/api/session/{id} |
Delete a session |
GET |
/api/sessions |
List active sessions |
POST |
/api/cleanup |
Evict sessions older than N hours |
GET |
/api/health |
Health check |
Streaming request body:
{ "session_id": "session-1234", "question": "find the latest episode about sleep" }The tests/ directory contains executable specifications for the conversation
memory β they feed synthetic per-turn updates through the memory state machine
and print the resulting state, demonstrating entity tracking, topic-shift
resets, and the exclusion window:
python tests/test_memory_branches.py # verbose, full memory dumps
python tests/test_memory_samples.py # compact snapshots
python tests/run_live_test.py # live end-to-end (needs API keys + data)This repository is a portfolio/reference implementation. A few things are intentionally demo-grade and would be hardened before a real deployment:
- CORS is wide open (
allow_origins=["*"]) β restrict it for production. - Session memory is in-RAM β swap for Redis/DB for horizontal scaling.
- A debug endpoint (
/api/session/{id}/memory/debug) dumps full memory β remove or guard behind auth. - The UI renders some server content via
innerHTMLβ sanitize/escape before exposing to untrusted content (links already userel="noopener noreferrer"). - The sample catalog (
data/entities.sample.json) is synthetic; real retrieval requires populated Pinecone indexes and a PostgreSQL database.
MIT β feel free to read, learn from, and build on it.