evaluation-llms

Here are 3 public repositories matching this topic...

OSU-NLP-Group / AttrScore

Code, datasets, models for the paper "Automatic Evaluation of Attribution by Large Language Models"

natural-language-processing attribution gpt-4 large-language-models llms chatgpt large-language-model evaluation-llms

Updated Jul 3, 2023
Python

[NeurIPS'25] MLLM-CompBench evaluates the comparative reasoning of MLLMs with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes

benchmark reasoning vision-and-language multimodal-deep-learning human-annotation foundation-models large-language-models llms vision-language-model multimodal-large-language-models evaluation-llms llms-benchmarking neurips-2024

Updated Apr 21, 2025
Jupyter Notebook

muqadasejaz / LangSmith

Star

This repo contains detailed notes on LangSmith concepts including traces, runs, observability, and integrations with LangChain, RAG, and LangGraph.

monitoring observability evaluation-llms langsmith langsmith-tracing langsmithapi langsmith-alternative langsmith-notes langsmith-langgraph

Updated May 5, 2026
Python

Improve this page

Add a description, image, and links to the evaluation-llms topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evaluation-llms topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly