Skip to content

bartdegoede/python-searchengine

Repository files navigation

python-searchengine

Simple search engine implementation in Python for illustrative purposes to go with this blog post.

Requirements

Python 3.10 or greater, and uv.

Usage

Install dependencies:

uv sync

Run the full-text search from the command line. On first run, the Wikipedia dataset (~20GB) will be downloaded from Hugging Face and cached automatically:

uv run python run.py

Run the semantic (vector) search:

uv run python run_semantic.py

On first run this builds a vector index by embedding all 6.4M documents. Embeddings are checkpointed to data/checkpoints/ so you can resume if interrupted. The finished index is saved to data/vector_index.* and memory-mapped on subsequent runs.

To skip the multi-hour encoding step, download the pre-computed embeddings from Hugging Face, place the JSON and .npy files in data/checkpoints/, and run uv run python run_semantic.py.

If you'd like to download the dataset separately (e.g. before a demo):

uv run python download.py

To get higher download rate limits, set a Hugging Face token:

export HF_TOKEN=hf_...

Run from interactive console:

uv run ipython

In [1]: run run.py
In [2]: index.search('python programming language', rank=True)[:5]

Development

Lint and type check:

uv run ruff check .
uv run mypy search/

Run tests:

uv run pytest -v

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages