This repository contains code for doing retrieval-augmented generation (RAG) with CSC user documentation using models run locally on workstation CPUs.
Make sure you have a working installation of Python and Docker Engine.
python3 -m venv .venv
. .venv/bin/activate
pip install -r requirements.txtThe models are run using the llama.cpp framework, which
requires models to be converted to the
GGUF format. Fortunately, there are
many pre-converted models available on Hugging Face.
The hf CLI tool used here is installed by the huggingface-hub Python package, which is included
in the dependencies.
hf download --local-dir ./data/llama.cpp unsloth/embeddinggemma-300m-GGUF embeddinggemma-300m-Q4_0.gguf
hf download --local-dir ./data/llama.cpp unsloth/gemma-3-4b-it-qat-GGUF gemma-3-4b-it-qat-Q4_K_M.ggufStart the llama.cpp inference server and Qdrant vector database containers using Docker Compose. The container images are pulled automatically.
docker compose upThe containers can be stopped using a similar command.
docker compose downIf you encounter any issues, it can be helpful to remove any stopped containers before running Compose.
docker container pruneBuild the vector database using the provided script. This takes around 25 minutes on my workstation.
python3 build_index.pyOpen the Chainlit chat interface. The chainlit CLI tool
is installed by a Python package of the same name, which is included in the dependencies. The
server's startup script takes a few seconds to run, so if you see an error message when trying to
access the web UI, try refreshing the page.
chainlit run app.pyRAG is disabled by default, but can be enabled from the settings menu, which is accessed by clicking on the gear icon located on the left side of the input field. The retrieved documents can be viewed by expanding the "Used retrieve" step in the chat and clicking on one of the source icons at the end of the step.
