When do large language models compute answers versus extract them from cues in external reasoning traces?
This repository contains all experimental code, data, and analysis for studying how LLMs process external reasoning traces under varying reliability conditions.
Large language models increasingly consume external reasoning traces—retrieved solutions, tool logs, or multi-agent transcripts. We study when such traces support computation and when models instead shift toward cue extraction.
- Consumer model determines vulnerability: The model reading the trace (not the one that generated it) determines susceptibility to misleading cues
- Corruption increases cue extraction: Wrong-cue following rises from 0.5% to 8.5% as trace corruption increases (GPT-4o)
- Naturalistic formats amplify vulnerability: Tool-log traces produce ~3× higher cue extraction (29.1%) vs standard format (10.1%)
- Cue style doesn't matter: Implicit cues in natural conclusions are equally effective as explicit "Answer: X" markers
- Discrete trust labels backfire: "Verified" tags increase cue following (19.1%) vs "Unverified" (11.1%)
- Semantic corruption dominates: Meaning-altering corruptions (especially quantity shifts) are most effective
- Defenses work: Simple verification prompts reduce cue following from 12% to 0%
History experiments (history/) are retained for transparency and are not used for the main claims. All reported results are computed from the Final experiments unless explicitly stated.
cot-reliability-gating/
├── data/ # Shared datasets (problems, traces)
├── experiments/ # Final experiment results
│ ├── final/ # Main experiments (A-S series)
│ └── N_naturalistic/ # Extension experiments (N series)
├── figures/ # Paper figures (PDF/PNG)
├── history/ # Superseded experiments (transparency only)
├── notebook/ # Jupyter notebooks (34 notebooks)
├── docs/ # Documentation
├── EXPERIMENT_INDEX.md # Complete experiment catalog
└── CHANGELOG.md # Version history
| Category | Experiments | Trials | Purpose |
|---|---|---|---|
| A: Baseline | 4 | 13,532 | Establish base phenomena |
| B: Mechanism | 7 | 25,476 | Identify causal factors |
| C: Generator Bias | 1 | 3,184 | Producer vs consumer effects |
| D: Length | 1 | 2,388 | Trace length analysis |
| E: Task Generalization | 5 | 10,400 | Cross-task validation |
| F: Attack/Future | 2 | 3,188 | Security implications |
| G: Defense | 2 | 5,776 | Mitigation strategies |
| H: Analysis | 1 | derived | Quantitative modeling |
| S: Sensitivity | 1 | 60 | Edge cases |
| N: Naturalistic Extension | 4 | 7,164 | Format & cue style effects |
| Final Total | 28 | 71,168 |
- History experiments: 4 (C1, E1v2, E3v2, F2) — 7,192 trials
- Grand total: 32 experiments, 78,360 trials
| ID | Name | Trials | Key Finding |
|---|---|---|---|
| N1 | Naturalistic Trace | 3,184 | Tool-log format amplifies vulnerability ~3× |
| N2 | Implicit Cue | 1,194 | Implicit = Explicit cue effect (p=0.75) |
| N3 | Reliability Tags | 2,388 | "Verified" labels increase cue following |
| N1C | Instrumentation | 398 (+6 retry) | Claude measurement integrity confirmed |
A trial is a single evaluation instance recorded as one row in *_results.json. Each trial corresponds to one API call to a consumer model.
- Claude Sonnet 4 (
claude-sonnet-4-20250514) - GPT-4o (
gpt-4o)
pip install anthropic openai pandas numpy scipy matplotlib seaborn tqdmNotebooks are in notebook/. Each experiment has a corresponding notebook (e.g., B1b_lambda_sweep_v1.ipynb).
See docs/REPRODUCIBILITY.md for detailed instructions.
- All experiments use fixed random seeds
- Temperature = 0 for all API calls
- Complete data and code available in this repository
- CI-verified pipeline (deterministic results)
@article{hideki2026reliability,
author = {HIDEKI},
title = {Reliability Inference Drives Cue Extraction in Large Language Models
Consuming External Reasoning Traces},
year = {2026},
note = {Preprint available at Zenodo},
doi = {10.5281/zenodo.18203731}
}| Version | Date | Experiments | Trials | Notes |
|---|---|---|---|---|
| v3.0.0 | 2026-01-09 | 24 | 64,004 | Initial submission version |
| v4.0.0 | 2026-01-10 | 28 | 71,168 | Added N series (naturalistic extension) |
HIDEKI
Independent Researcher, Japan
ORCID: 0009-0002-0019-6608
Email: hideki@r3776.jp
MIT License