Skip to content

Initial Release

Latest

Choose a tag to compare

@Bhabesh-Rath Bhabesh-Rath released this 09 Apr 05:41
· 2 commits to main since this release

v0.1.1 — Initial release

What's included

  • GGUF quantization (Q2_K through Q8_0) via llama.cpp
  • GPTQ quantization (INT4/INT8) via gptqmodel — runs on Kaggle T4
  • TFLite conversion via onnx2tf — runs on Google Colab
  • Real benchmark: tok/s + perplexity via llama-cpp-python
  • Simulated mobile benchmark: MAC count + latency across 7 SoC profiles
  • Pareto frontier chart (interactive Plotly HTML)
  • Gradio web UI with 4 tabs
  • CLI: run and ui commands

Verified on

  • Qwen2-0.5B, GTX 1060 6GB, Windows, CPU inference

Known limitations

  • TFLite not supported on Windows (use Colab notebook)
  • GPTQ requires 16GB+ VRAM locally (use Kaggle notebook)