Reliability Inference Drives Cue Extraction in LLMs

When do large language models compute answers versus extract them from cues in external reasoning traces?

This repository contains all experimental code, data, and analysis for studying how LLMs process external reasoning traces under varying reliability conditions.

Overview

Large language models increasingly consume external reasoning traces—retrieved solutions, tool logs, or multi-agent transcripts. We study when such traces support computation and when models instead shift toward cue extraction.

Key Findings

Consumer model determines vulnerability: The model reading the trace (not the one that generated it) determines susceptibility to misleading cues
Corruption increases cue extraction: Wrong-cue following rises from 0.5% to 8.5% as trace corruption increases (GPT-4o)
Naturalistic formats amplify vulnerability: Tool-log traces produce ~3× higher cue extraction (29.1%) vs standard format (10.1%)
Cue style doesn't matter: Implicit cues in natural conclusions are equally effective as explicit "Answer: X" markers
Discrete trust labels backfire: "Verified" tags increase cue following (19.1%) vs "Unverified" (11.1%)
Semantic corruption dominates: Meaning-altering corruptions (especially quantity shifts) are most effective
Defenses work: Simple verification prompts reduce cue following from 12% to 0%

Important Notice

History experiments (history/) are retained for transparency and are not used for the main claims. All reported results are computed from the Final experiments unless explicitly stated.

Repository Structure

cot-reliability-gating/
├── data/                 # Shared datasets (problems, traces)
├── experiments/          # Final experiment results
│   ├── final/            # Main experiments (A-S series)
│   └── N_naturalistic/   # Extension experiments (N series)
├── figures/              # Paper figures (PDF/PNG)
├── history/              # Superseded experiments (transparency only)
├── notebook/             # Jupyter notebooks (34 notebooks)
├── docs/                 # Documentation
├── EXPERIMENT_INDEX.md   # Complete experiment catalog
└── CHANGELOG.md          # Version history

Experimental Summary

Category	Experiments	Trials	Purpose
A: Baseline	4	13,532	Establish base phenomena
B: Mechanism	7	25,476	Identify causal factors
C: Generator Bias	1	3,184	Producer vs consumer effects
D: Length	1	2,388	Trace length analysis
E: Task Generalization	5	10,400	Cross-task validation
F: Attack/Future	2	3,188	Security implications
G: Defense	2	5,776	Mitigation strategies
H: Analysis	1	derived	Quantitative modeling
S: Sensitivity	1	60	Edge cases
N: Naturalistic Extension	4	7,164	Format & cue style effects
Final Total	28	71,168

History experiments: 4 (C1, E1v2, E3v2, F2) — 7,192 trials
Grand total: 32 experiments, 78,360 trials

N Series: Naturalistic Extension Experiments

ID	Name	Trials	Key Finding
N1	Naturalistic Trace	3,184	Tool-log format amplifies vulnerability ~3×
N2	Implicit Cue	1,194	Implicit = Explicit cue effect (p=0.75)
N3	Reliability Tags	2,388	"Verified" labels increase cue following
N1C	Instrumentation	398 (+6 retry)	Claude measurement integrity confirmed

Trial Definition

A trial is a single evaluation instance recorded as one row in *_results.json. Each trial corresponds to one API call to a consumer model.

Models

Claude Sonnet 4 (claude-sonnet-4-20250514)
GPT-4o (gpt-4o)

Quick Start

Requirements

pip install anthropic openai pandas numpy scipy matplotlib seaborn tqdm

Running Experiments

Notebooks are in notebook/. Each experiment has a corresponding notebook (e.g., B1b_lambda_sweep_v1.ipynb).

See docs/REPRODUCIBILITY.md for detailed instructions.

Reproducibility

All experiments use fixed random seeds
Temperature = 0 for all API calls
Complete data and code available in this repository
CI-verified pipeline (deterministic results)

Citation

@article{hideki2026reliability,
  author = {HIDEKI},
  title = {Reliability Inference Drives Cue Extraction in Large Language Models 
           Consuming External Reasoning Traces},
  year = {2026},
  note = {Preprint available at Zenodo},
  doi = {10.5281/zenodo.18203731}
}

Versions

Version	Date	Experiments	Trials	Notes
v3.0.0	2026-01-09	24	64,004	Initial submission version
v4.0.0	2026-01-10	28	71,168	Added N series (naturalistic extension)

Author

HIDEKI
Independent Researcher, Japan
ORCID: 0009-0002-0019-6608
Email: hideki@r3776.jp

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reliability Inference Drives Cue Extraction in LLMs

Overview

Key Findings

Important Notice

Repository Structure

Experimental Summary

N Series: Naturalistic Extension Experiments

Trial Definition

Models

Quick Start

Requirements

Running Experiments

Reproducibility

Citation

Versions

Author

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
docs		docs
experiments		experiments
figures		figures
history		history
notebook		notebook
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
EXPERIMENT_INDEX.md		EXPERIMENT_INDEX.md
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Reliability Inference Drives Cue Extraction in LLMs

Overview

Key Findings

Important Notice

Repository Structure

Experimental Summary

N Series: Naturalistic Extension Experiments

Trial Definition

Models

Quick Start

Requirements

Running Experiments

Reproducibility

Citation

Versions

Author

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages