Local Gemma GGUF chat utilities powered by llama-cpp-python.
This project provides three ways to run the same local model:
llm_cli.py- interactive terminal chat.llm_web.py- Gradio web chat with optional file attachments.llm_server.py- minimal OpenAI-compatible API server for tools such as Continue.dev or VS Code extensions.
- Python 3.10+
- A GGUF model file
llama-cpp-pythonPyYAML- Optional web/API dependencies:
gradio,fastapi,uvicorn
For CUDA builds of llama-cpp-python, install it with the flags appropriate for your system. For example:
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --no-cache-dirInstall the remaining Python dependencies as needed:
pip install pyyaml gradio fastapi uvicornPlace your .gguf model under models/ or update model.path in config.yaml.
The current config expects:
~/ai-gemma/models/gemma-4-E2B-it-Q4_K_M.gguf
GGUF files are intentionally ignored by Git because they are large local artifacts.
Edit config.yaml to control:
- Model path, GPU layers, context size, and flash attention.
- Generation settings such as temperature, top-p, top-k, and max tokens.
- CLI session defaults.
- Gradio web UI host, port, title, and upload file types.
- OpenAI-compatible API host and port.
CLI arguments override values from config.yaml.
Run the interactive CLI:
python llm_cli.pyUse a custom config or model:
python llm_cli.py --config /path/to/config.yaml
python llm_cli.py --model /path/to/model.gguf --gpu-layers 99Run the Gradio web UI:
python llm_web.pyRun the OpenAI-compatible API server:
python llm_server.pyThe API server exposes:
GET /v1/modelsPOST /v1/chat/completions
By default, the server base URL is:
http://127.0.0.1:8000/v1
Inside the interactive CLI:
/help- show available commands./exitor/quit- exit./clear- clear conversation history./system <msg>- set the system prompt./history- show conversation history./info- show model and config details./save <file>- save the conversation.