A benchmark for log reduction tools (RTK, grep, tail, hybrid routers, LLM-summary) — do they preserve enough evidence for LLM root-cause diagnosis?
LogDx-CI compares 11 context providers (raw, tail, grep,
three RTK modes — rtk-read,
rtk-log, rtk-err-cat —, two real LLM summarizers —
llm-summary-v1-haiku (Anthropic) and llm-summary-v1-gpt-5-mini
(OpenAI) —, and three hybrid routers) by handing the same CI failure
log to three debugger families (Claude Haiku 4.5, Claude Sonnet 4.6,
OpenAI gpt-5-mini) and scoring the resulting root-cause diagnoses
against AI-drafted + author-verified ground truths.
It optimizes for method ranking stability across model families, not "which LLM scored highest."
Across 35 real CI failure cases and 3 model families (Claude Haiku 4.5, Claude Sonnet 4.6, OpenAI gpt-5-mini), the top-3 ∩ of the per-family rankings is
{hybrid-grep-120k-rtk-tail, hybrid-grep-120k-tail}. Bottom-4 set is also stable across all three families.
Macro diagnosis_score_v1_1 aggregated case-count-weighted across
the 35-case corpus:
| Rank | Method | Haiku 4.5 | Sonnet 4.6 | gpt-5-mini | Overall |
|---|---|---|---|---|---|
| 1 | hybrid-grep-120k-rtk-tail |
0.624 | 0.679 | 0.706 | 0.670 |
| 2 | hybrid-grep-120k-tail |
0.610 | 0.730 | 0.658 | 0.666 |
| 3 | llm-summary-v1-gpt-5-mini(new in v1.2; agent-loop #1 at 0.749) |
0.654 | 0.686 | 0.652 | 0.664 |
| 4 | grep |
0.578 | 0.684 | 0.655 | 0.639 |
| 5 | llm-summary-v1-haiku(promoted to headline in v1.1) |
0.583 | 0.704 | 0.608 | 0.632 |
| 6 | tail-200 |
0.595 | 0.624 | 0.623 | 0.614 |
| 7 | hybrid-grep-4k-rtk-err-cat(earlier 4k-threshold hybrid; replaced) |
0.552 | 0.597 | 0.571 | 0.573 |
| 8 | rtk-err-cat |
0.455 | 0.488 | 0.467 | 0.470 |
| 9 | raw |
0.324 | 0.368 | 0.367 | 0.353 |
| 10 | rtk-read |
0.329 | 0.369 | 0.349 | 0.349 |
| 11 | rtk-log |
0.238 | 0.262 | 0.249 | 0.249 |
The legacy llm-summary-v1-mock stub (used as the LLM-summary
representative through v1.1) is retained as an appendix entry on
the leaderboard, not in the headline. The top-2 hybrids replaced an
earlier 4k-threshold hybrid that was overfit during methodology
development. See the technical
report for the v1.2 paper, and
reports/legacy/e10_v1_3_to_v2_transition_study.md
for the original prototype-vs-formal corpus analysis.
Full leaderboard at https://logdx-bench.github.io/leaderboard.html.
| 🏠 Homepage | https://logdx-bench.github.io/ |
| 📊 Leaderboard | https://logdx-bench.github.io/leaderboard.html |
| 📄 Full report | reports/technical_report.md |
| 📦 Cases corpus mirror | https://huggingface.co/datasets/eyuansu71/logdx-ci |
| 📋 Release notes | latest: RELEASE_NOTES_v1_2.md · history: RELEASE_NOTES.md (v1.0), v1.1.1, v1.1.2 |
| 📑 Cite | CITATION.cff · BibTeX |
git clone https://github.com/eyuansu62/LogDx.git
cd LogDx
# Each case lives under cases/<split>/<case_id>/{raw.log,case.json,
# ground_truth.json,tags.json,privacy_audit.json}. See the dataset
# card for the schema:
# https://huggingface.co/datasets/eyuansu71/logdx-ciTo reproduce a number from the leaderboard:
python3 tools/evaluate_diagnosis.py \
--split v2/dev --diagnoser real-debugger-v3
# → results/v2/dev/eval_diagnosis_real-debugger-v3.jsonFor a fresh run that actually hits the OpenAI / Anthropic APIs (vs.
cache replay), see the reproducibility section in
RELEASE_NOTES.md.
Current release: v1.2 (preprint). We'll add cases + model
families before calling it stable.
- 35 cases (target: 50+ with broader ecosystem coverage)
- Ground truth is AI-drafted + single-author verified (not independent human annotation)
- Three model families tested (Haiku / Sonnet / gpt-5-mini); GPT-4o / Gemini / Llama are the most-leveraged follow-up
- 20 documented historical exclusions in
configs/historical_provider_error_exclusions.jsonappear as zero-score abstentions in the eval denominator
Full caveats in the technical report §5.
@misc{qin2026logdx,
title = {{LogDx-CI}: Benchmarking Log Reduction Tools
for LLM Root-Cause Diagnosis},
author = {Qin, Bowen},
year = {2026},
howpublished = {\url{https://github.com/eyuansu62/LogDx}},
note = {v1.2 release; cases corpus at
\url{https://huggingface.co/datasets/eyuansu71/logdx-ci}},
}- Code (
tools/,examples/,schemas/,configs/,prompts/, tests, scripts) — Apache-2.0 (LICENSE) - Data + reports + protocol locks (
cases/,results/,reports/,protocols/,docs/) — CC-BY-4.0 (LICENSE-DATA)
LogDx-CI benchmarks third-party log-reduction tools alongside its own baselines. Specifically:
- RTK (Rust Token Killer) by
rtk-ai — the
rtk-read,rtk-log, andrtk-err-catbaselines are three different invocations of thertkCLI binary. The hybrid routershybrid-grep-120k-rtk-tailandhybrid-grep-4k-rtk-err-catuse rtk'serr-catmode as an intermediate / fallback context provider. Seedocs/methods/rtk.mdfor setup + invocation details.
CI failure logs are sourced from publicly visible GitHub Actions runs. Diagnoses are produced by Claude (Anthropic) and gpt-5-mini (OpenAI).
New context-provider methods, debugger families, and case
contributions are welcome — see CONTRIBUTING.md
for the dev environment, repo layout, validator scripts, and the
"add a new method" checklist.
Bowen Qin · National University of Singapore · contact via GitHub Issues