The LLM Evaluation Framework
-
Updated
Dec 12, 2025 - Python
The LLM Evaluation Framework
[NeurIPS D&B '25] The one-stop repository for large language model (LLM) unlearning. Supports TOFU, MUSE, WMDP, and many unlearning methods with easy feature extensibility.
LangFair is a Python library for conducting use-case level LLM bias and fairness assessments
[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models
Run a prompt against all, or some, of your models running on Ollama. Creates web pages with the output, performance statistics and model info. All in a single Bash shell script.
Measure of estimated confidence for non-hallucinative nature of outputs generated by Large Language Models.
Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.
In this we evaluate the LLM responses and find accuracy
Tools for systematic large language model evaluations
VerifyAI is a simple UI application to test GenAI outputs
Offical implementation for Spectral Scaling Laws (EMNLP 2025)
This repo is for an streamlit application that provides a user-friendly interface for evaluating large language models (LLMs) using the beyondllm package.
Add a description, image, and links to the llm-evaluation-metrics topic page so that developers can more easily learn about it.
To associate your repository with the llm-evaluation-metrics topic, visit your repo's landing page and select "manage topics."