Legacy Code Archaeologist is a production-grade Retrieval-Augmented Generation (RAG) tool designed to analyze how and why a codebase evolved over time.
Instead of reading static snapshots, it mines actual Git diffs, allowing you to ask high-impact questions like:
- “Who introduced the timeout bug..?”
- “Why was the authentication logic rewritten in 2021?”
- “When did this API contract change?”
Built for scalability, accuracy, and real-world engineering workflows, now supercharged by Groq's lightning-fast inference engine.
- Parses real commit diffs (added/removed lines) — not just file snapshots.
- Understands why code changed, not just what changed.
- Generator-based architecture enables O(1) memory usage.
- Handles massive repositories (Linux, React, Kubernetes) efficiently.
- Uses Groq API (LLaMA-3.3-70B) for lightning-fast, highly accurate contextual reasoning.
- Powered by ChromaDB for semantic search over commit history.
- Switch between:
- Recent History (Fast Scan)
- Deep Excavation (Full Repo Analysis)
- Automatically generates PDF audit reports of chat sessions.
- Ideal for compliance, audits, and engineering reviews.
- Shallow Git clones (
--depth) for 99% faster fetches. - Auto-cleanup of cloned repos and vector databases.
Backend
- Python 3.12+
- GitPython (Custom Mining Engine)
- ChromaDB (Local Vector Store)
AI Engine
- Groq API (
llama-3.3-70b-versatilevia OpenAI compatible endpoint)
Frontend
- Streamlit
DevOps / Tooling
- uv (Fast Python package manager)
- python-dotenv
┌──────────────┐
│ Git Repo │
└──────┬───────┘
↓
┌─────────────────────┐
│ Custom Miner Engine│ ← Generator-based diff processing
└──────┬──────────────┘
↓
┌─────────────────────┐
│ ChromaDB Vector │ ← Semantic indexing
└──────┬──────────────┘
↓
┌─────────────────────┐
│ Groq API │ ← Lightning-fast LLM reasoning
└──────┬──────────────┘
↓
┌─────────────────────┐
│ Streamlit UI │ ← Chat + Time Controls
└─────────────────────┘
- Paste any public GitHub repository URL
- Select:
- Fast Mode → Recent commits only
- Deep Mode → Full historical analysis
- Provide your Free Groq API Key via the UI (if not set in
.env). - Ask natural language questions:
- “Why was this function refactored?”
- “Who changed the authentication logic?”
- Export a PDF audit report if needed.
git clone https://github.com/kartik0905/git-archaeologist.git
cd git-archaeologistpip install -r requirements.txt(Or use uv pip install -r requirements.txt for faster installation)
Create a .env file:
GROQ_API_KEY=your_groq_api_key_herestreamlit run app.py.
├── app.py # Streamlit UI & user interaction
├── miner.py # Core mining engine (diff parsing, batching)
├── requirements.txt
└── pyproject.toml
- Designed like a real production system, not a demo.
- Handles large-scale repositories efficiently.
- Solves a real developer pain point — understanding legacy code.
- Uses Groq to eliminate LLM latency, making log parsing instantaneous.
MIT License
Star the repository and feel free to contribute or fork it for your own tooling.
Built with engineering discipline, not just prompts.