LogDx-CI

A benchmark for log reduction tools (RTK, grep, tail, hybrid routers, LLM-summary) — do they preserve enough evidence for LLM root-cause diagnosis?

LogDx-CI compares 11 context providers (raw, tail, grep, three RTK modes — rtk-read, rtk-log, rtk-err-cat —, two real LLM summarizers — llm-summary-v1-haiku (Anthropic) and llm-summary-v1-gpt-5-mini (OpenAI) —, and three hybrid routers) by handing the same CI failure log to three debugger families (Claude Haiku 4.5, Claude Sonnet 4.6, OpenAI gpt-5-mini) and scoring the resulting root-cause diagnoses against AI-drafted + author-verified ground truths.

It optimizes for method ranking stability across model families, not "which LLM scored highest."

Headline finding

Across 35 real CI failure cases and 3 model families (Claude Haiku 4.5, Claude Sonnet 4.6, OpenAI gpt-5-mini), the top-3 ∩ of the per-family rankings is {hybrid-grep-120k-rtk-tail, hybrid-grep-120k-tail}. Bottom-4 set is also stable across all three families.

Macro diagnosis_score_v1_1 aggregated case-count-weighted across the 35-case corpus:

Rank	Method	Haiku 4.5	Sonnet 4.6	gpt-5-mini	Overall
1	`hybrid-grep-120k-rtk-tail`	0.624	0.679	0.706	0.670
2	`hybrid-grep-120k-tail`	0.610	0.730	0.658	0.666
3	`llm-summary-v1-gpt-5-mini` _{(new in v1.2; agent-loop #1 at 0.749)}	0.654	0.686	0.652	0.664
4	`grep`	0.578	0.684	0.655	0.639
5	`llm-summary-v1-haiku` _{(promoted to headline in v1.1)}	0.583	0.704	0.608	0.632
6	`tail-200`	0.595	0.624	0.623	0.614
7	`hybrid-grep-4k-rtk-err-cat` _{(earlier 4k-threshold hybrid; replaced)}	0.552	0.597	0.571	0.573
8	`rtk-err-cat`	0.455	0.488	0.467	0.470
9	`raw`	0.324	0.368	0.367	0.353
10	`rtk-read`	0.329	0.369	0.349	0.349
11	`rtk-log`	0.238	0.262	0.249	0.249

The legacy llm-summary-v1-mock stub (used as the LLM-summary representative through v1.1) is retained as an appendix entry on the leaderboard, not in the headline. The top-2 hybrids replaced an earlier 4k-threshold hybrid that was overfit during methodology development. See the technical report for the v1.2 paper, and reports/legacy/e10_v1_3_to_v2_transition_study.md for the original prototype-vs-formal corpus analysis.

Full leaderboard at https://logdx-bench.github.io/leaderboard.html.

Quick links


🏠 Homepage	https://logdx-bench.github.io/
📊 Leaderboard	https://logdx-bench.github.io/leaderboard.html
📄 Full report	`reports/technical_report.md`
📦 Cases corpus mirror	https://huggingface.co/datasets/eyuansu71/logdx-ci
📋 Release notes	latest: `RELEASE_NOTES_v1_2.md` · history: `RELEASE_NOTES.md` (v1.0), v1.1.1, v1.1.2
📑 Cite	`CITATION.cff` · BibTeX

Use the data

git clone https://github.com/eyuansu62/LogDx.git
cd LogDx

# Each case lives under cases/<split>/<case_id>/{raw.log,case.json,
# ground_truth.json,tags.json,privacy_audit.json}. See the dataset
# card for the schema:
# https://huggingface.co/datasets/eyuansu71/logdx-ci

To reproduce a number from the leaderboard:

python3 tools/evaluate_diagnosis.py \
    --split v2/dev --diagnoser real-debugger-v3
# → results/v2/dev/eval_diagnosis_real-debugger-v3.json

For a fresh run that actually hits the OpenAI / Anthropic APIs (vs. cache replay), see the reproducibility section in RELEASE_NOTES.md.

Caveats

Current release: v1.2 (preprint). We'll add cases + model families before calling it stable.

35 cases (target: 50+ with broader ecosystem coverage)
Ground truth is AI-drafted + single-author verified (not independent human annotation)
Three model families tested (Haiku / Sonnet / gpt-5-mini); GPT-4o / Gemini / Llama are the most-leveraged follow-up
20 documented historical exclusions in configs/historical_provider_error_exclusions.json appear as zero-score abstentions in the eval denominator

Full caveats in the technical report §5.

Cite

@misc{qin2026logdx,
  title  = {{LogDx-CI}: Benchmarking Log Reduction Tools
           for LLM Root-Cause Diagnosis},
  author = {Qin, Bowen},
  year   = {2026},
  howpublished = {\url{https://github.com/eyuansu62/LogDx}},
  note   = {v1.2 release; cases corpus at
           \url{https://huggingface.co/datasets/eyuansu71/logdx-ci}},
}

License

Code (tools/, examples/, schemas/, configs/, prompts/, tests, scripts) — Apache-2.0 (LICENSE)
Data + reports + protocol locks (cases/, results/, reports/, protocols/, docs/) — CC-BY-4.0 (LICENSE-DATA)

Acknowledgements

LogDx-CI benchmarks third-party log-reduction tools alongside its own baselines. Specifically:

RTK (Rust Token Killer) by rtk-ai — the rtk-read, rtk-log, and rtk-err-cat baselines are three different invocations of the rtk CLI binary. The hybrid routers hybrid-grep-120k-rtk-tail and hybrid-grep-4k-rtk-err-cat use rtk's err-cat mode as an intermediate / fallback context provider. See docs/methods/rtk.md for setup + invocation details.

CI failure logs are sourced from publicly visible GitHub Actions runs. Diagnoses are produced by Claude (Anthropic) and gpt-5-mini (OpenAI).

Contributing

New context-provider methods, debugger families, and case contributions are welcome — see CONTRIBUTING.md for the dev environment, repo layout, validator scripts, and the "add a new method" checklist.

Contact

Bowen Qin · National University of Singapore · contact via GitHub Issues

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LogDx-CI

Headline finding

Quick links

Use the data

Caveats

Cite

License

Acknowledgements

Contributing

Contact

About

Licenses found

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 138 Commits
.github/workflows		.github/workflows
cases		cases
configs		configs
docs		docs
examples		examples
huggingface		huggingface
prompts		prompts
protocols		protocols
reports		reports
results		results
schemas		schemas
tools		tools
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE-DATA		LICENSE-DATA
README.md		README.md
RELEASE_NOTES.md		RELEASE_NOTES.md
RELEASE_NOTES_v1_1_1.md		RELEASE_NOTES_v1_1_1.md
RELEASE_NOTES_v1_1_2.md		RELEASE_NOTES_v1_1_2.md
RELEASE_NOTES_v1_2.md		RELEASE_NOTES_v1_2.md
ROADMAP.md		ROADMAP.md

Folders and files

Latest commit

History

Repository files navigation

LogDx-CI

Headline finding

Quick links

Use the data

Caveats

Cite

License

Acknowledgements

Contributing

Contact

About

Topics

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages