codesearch

A semantic code search tool that indexes code repositories using embeddings and AST analysis for intelligent code discovery.

Features

Hybrid search (default): combines semantic vector search with BM25-style keyword matching, fused via Reciprocal Rank Fusion (RRF) for best-of-both precision and recall
Semantic search: uses ML embeddings to find conceptually similar code even without exact keyword matches
AST-aware: parses code using tree-sitter for structure-aware indexing
Multi-language support: supports Rust, Python, JavaScript, TypeScript, Go, HCL, PHP, C++
Persistent storage: DuckDB with VSS (Vector Similarity Search) acceleration
Fast indexing: efficient batch processing with ONNX embedding generation

Architecture

This project follows Domain-Driven Design (DDD) principles:

src/
├── domain/
├── application/
├── connector/
└── cli/

Installation

cargo build --release
# Binary will be placed in bin/
cp target/release/codesearch bin/

Usage

Getting Started

No external services required! CodeSearch uses DuckDB by default for persistent storage.

# Build the project
cargo build --release

# Index a repository
./target/release/codesearch index /path/to/repo --name my-repo

# Search
./target/release/codesearch search "function that handles authentication"

# List indexed repositories
./target/release/codesearch list

Commands

codesearch index /path/to/repo

# Search for code
codesearch search "function that handles authentication"

# Show indexed repositories
codesearch list

# Delete a repository by name or path
codesearch delete my-repo
codesearch delete /path/to/repo

# Show the blast radius of a symbol change (BFS over call graph)
codesearch impact authenticate

# Show 360-degree caller/callee context for a symbol
codesearch context authenticate

# Start MCP server (stdio, for AI tool integration)
codesearch mcp

# Start MCP server over HTTP
codesearch mcp --http 8080

Configuration Options

Flag	Default	Description
`--data-dir`	`~/.codesearch`	Directory for DuckDB database files
`--namespace`	`search`	DuckDB schema namespace for vector storage
`--memory-storage`	`false`	Use in-memory storage (no persistence)
`--mock-embeddings`	`false`	Use mock embeddings (for testing)
`--no-rerank`	`false`	Disable reranking
`-v, --verbose`	`false`	Enable debug logging

Search Options

Flag	Default	Description
`--num`	`10`	Number of results to return
`-m, --min-score`	(none)	Minimum relevance score threshold (0.0-1.0)
`-L, --language`	(none)	Filter by programming language (can specify multiple)
`-r, --repository`	(none)	Filter by repository (can specify multiple)
`-F, --format`	`text`	Output format: `text`, `json`, or `vimgrep`
`--no-text-search`	(off)	Disable the keyword leg; use only vector/semantic search

Output Formats

Format	Description
`text`	Human-readable output with code previews (default)
`json`	Structured JSON array for programmatic use and editor integrations
`vimgrep`	`file:line:col:text` format for Neovim quickfix list and Telescope

Examples

# Index with a custom data directory
codesearch --data-dir /var/lib/codesearch index /path/to/repo --name my-repo

# Use a separate namespace for different projects
codesearch --namespace project-a index /path/to/repo-a --name repo-a
codesearch --namespace project-b index /path/to/repo-b --name repo-b

# Verbose logging with debug output
codesearch -v search "authentication error handling"

# Use mock embeddings for testing
codesearch --mock-embeddings index ./test-repo --name test

codesearch search "error handling" --num 25

# Filter by language
codesearch search "async function" --language rust

# JSON output for scripts or editor integrations
codesearch search "error handling" --format json

# Vimgrep format for Neovim quickfix
codesearch search "error handling" --format vimgrep

Call Graph Analysis

CodeSearch builds a call graph during indexing and exposes two commands to query it: impact for blast-radius analysis and context for 360-degree dependency views.

Impact Analysis

Shows every symbol that would be affected (transitively) if a given symbol changes. Uses BFS over the call graph up to a configurable depth.

# Show what breaks if `authenticate` changes (default depth: 5)
codesearch impact authenticate

# Limit hop depth
codesearch impact authenticate --depth 3

# Restrict to a specific repository
codesearch impact authenticate --repository my-api

# JSON output (for scripts)
codesearch impact authenticate --format json

Example output:

Impact analysis for 'authenticate'
─────────────────────────────────────────
process_request [call]  src/router.rs:10
└── handle_login [call]  src/api/auth.rs:42
    └── authenticate

run_tests [call]  tests/integration.rs:5
└── verify_token [call]  src/middleware/auth.rs:18
    └── authenticate

Symbol Context

Shows the 360-degree dependency view for a symbol: who calls it (callers) and what it calls (callees).

# Show callers and callees of `authenticate`
codesearch context authenticate

# Limit the number of results per direction
codesearch context authenticate --limit 10

# Restrict to a specific repository
codesearch context authenticate --repository my-api

# JSON output
codesearch context authenticate --format json

Example output:

Context for 'authenticate'
─────────────────────────────────────────
Callers (2 total) — who uses this symbol:
  ← handle_login [call]  src/api/auth.rs:42
  ← verify_session [call]  src/middleware/session.rs:18

Callees (3 total) — what this symbol uses:
  → hash_password [call]  src/crypto/hash.rs:10
  → lookup_user [call]  src/db/users.rs:55
  → generate_token [call]  src/crypto/token.rs:7

Call Graph Options

Flag	Command	Default	Description
`--depth`	`impact`	`5`	Maximum BFS hop depth
`-l, --limit`	`context`	(none)	Max callers/callees per direction
`-r, --repository`	both	(none)	Restrict to a specific repository
`-F, --format`	both	`text`	Output format: `text` or `json`

Note: Call graph data is populated during codesearch index. Re-index after code changes to keep the graph up to date.

Editor Integrations

Neovim / Telescope

A Telescope extension is included under ide/nvim/. It provides a fuzzy picker over semantic search results, with file preview at the correct line.

Setup:

Add ide/nvim to your Neovim runtime path (Neovim resolves the lua/ subdirectory automatically):

vim.opt.runtimepath:append("/path/to/codesearch/ide/nvim")

Load the extension:

require("telescope").load_extension("codesearch")

Bind a key:

vim.keymap.set("n", "<leader>cs", function()
  require("telescope").extensions.codesearch.codesearch()
end, { desc = "Semantic code search" })

Configuration (optional):

require("telescope").setup({
  extensions = {
    codesearch = {
      bin = "codesearch",     -- path to binary
      num = 20,               -- number of results
      min_score = 0.3,        -- minimum relevance score
      data_dir = nil,         -- custom data directory
      namespace = nil,        -- custom namespace
    },
  },
})

Quick use without Telescope:

# Load results directly into Neovim's quickfix list
codesearch search "error handling" --format vimgrep | nvim -q /dev/stdin

MCP Server

CodeSearch can run as a Model Context Protocol (MCP) server, allowing AI tools (Claude, Cursor, etc.) to search your codebase semantically.

Stdio mode (default, for local AI tool integration):

codesearch mcp

HTTP mode (for network-accessible deployments):

# Listen on localhost:8080
codesearch mcp --http 8080

# Listen on all interfaces (public)
codesearch mcp --http 8080 --public

The HTTP server exposes the MCP endpoint at /mcp.

Exposed tool: search_code — accepts query, limit, min_score, languages, and repositories parameters.

Storage Backends

Mode	Persistence	Use Case
DuckDB (default)	Persistent	Fast semantic search with VSS acceleration, no external dependencies
In-memory (`--memory-storage`)	None	Testing, development, ephemeral indexing

Storage Details:

Metadata: Always stored in DuckDB locally via DuckdbMetadataRepository (repository info, chunks, file paths, statistics)
Vectors: DuckDB with HNSW (Hierarchical Navigable Small World) for Vector Similarity Search with cosine distance

Hybrid Search

By default, every search query runs two complementary retrieval legs and fuses them with Reciprocal Rank Fusion (RRF):

Semantic leg — vector similarity via HNSW cosine distance (finds conceptually related code)
Keyword leg — BM25-style LIKE matching on content and symbol names (finds exact keyword occurrences)

RRF assigns each result a score of 1 / (60 + rank) from each leg it appears in; items found by both legs accumulate the highest fused scores. Final scores are in the ~0.016–0.033 range.

# Hybrid search (default — no flag needed)
codesearch search "parse configuration file"

# Semantic-only (disable keyword leg)
codesearch search "parse configuration file" --no-text-search

Reranking

CodeSearch supports optional reranking to improve search result relevance using cross-encoder models.

How It Works

Initial hybrid/vector search retrieves candidates using inverse-log scaling: num + ⌈num / ln(num)⌉ (defaults to 20 base candidates when num ≤ 10)
For semantic-only results, candidates with vector similarity score below 0.1 are excluded (too irrelevant to benefit from reranking); hybrid RRF results bypass this filter because RRF scores are intentionally small (~0.016–0.033)
A cross-encoder model (bge-reranker-base) reranks remaining candidates based on query-document relevance
Top num reranked results are returned

Usage

codesearch search "authentication"

# Customize number of results
codesearch search "error handling" --num 20

# Combine with filters
codesearch search "validation" --language rust --min-score 0.7

Models

Default: BAAI/bge-reranker-base (110M parameters, ONNX)
Downloaded automatically from HuggingFace Hub on first use
No API key or external service required

Logging

CodeSearch uses structured logging with sensible defaults to keep output clean while providing detailed information when needed.

Default Behavior

By default, only application-level logs are shown:

Indexing progress and completion
Search queries and results
Reranking operations
Repository deletion

Logs from external dependencies (ONNX runtime, tokenizers, database drivers, etc.) are suppressed to reduce noise.

Verbose Mode

Use -v or --verbose to enable debug-level logging for troubleshooting:

codesearch -v index /path/to/repo
codesearch -v search "authentication"

This shows additional details like:

File processing progress
Model initialization
Storage backend configuration

Advanced: External Crate Logs

To debug issues with external dependencies, use the RUST_LOG environment variable:

# Debug ONNX runtime issues
RUST_LOG=warn,codesearch=info,ort=debug codesearch index /path/to/repo

# Debug database issues
RUST_LOG=warn,codesearch=info,duckdb=debug codesearch search "query"

# Debug everything (very verbose)
RUST_LOG=debug codesearch index /path/to/repo

Development

# Run tests
cargo test

# Run with logging
RUST_LOG=debug cargo run -- index /path/to/repo

# Format code
cargo fmt

# Run linter
cargo clippy

Dependencies

ort - ONNX Runtime for ML embedding inference
tree-sitter - AST parsing and code extraction
duckdb-rs - DuckDB Rust bindings with VSS extension
tokio - Async runtime

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
.claude/skills/codesearch		.claude/skills/codesearch
.github/workflows		.github/workflows
docs		docs
ide		ide
src		src
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Cargo.toml		Cargo.toml
README.md		README.md
clippy.toml		clippy.toml

Folders and files

Latest commit

History

Repository files navigation

codesearch

Features

Architecture

Installation

Usage

Getting Started

Commands

Configuration Options

Search Options

Output Formats

Examples

Call Graph Analysis

Impact Analysis

Symbol Context

Call Graph Options

Editor Integrations

Neovim / Telescope

MCP Server

Storage Backends

Hybrid Search

Reranking

How It Works

Usage

Models

Logging

Default Behavior

Verbose Mode

Advanced: External Crate Logs

Development

Dependencies

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 16

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages