·
2 commits
to main
since this release
v0.1.1 — Initial release
What's included
- GGUF quantization (Q2_K through Q8_0) via llama.cpp
- GPTQ quantization (INT4/INT8) via gptqmodel — runs on Kaggle T4
- TFLite conversion via onnx2tf — runs on Google Colab
- Real benchmark: tok/s + perplexity via llama-cpp-python
- Simulated mobile benchmark: MAC count + latency across 7 SoC profiles
- Pareto frontier chart (interactive Plotly HTML)
- Gradio web UI with 4 tabs
- CLI:
runanduicommands
Verified on
- Qwen2-0.5B, GTX 1060 6GB, Windows, CPU inference
Known limitations
- TFLite not supported on Windows (use Colab notebook)
- GPTQ requires 16GB+ VRAM locally (use Kaggle notebook)