Skip to content

HIDEKI-SQ/cot-reliability-gating

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reliability Inference Drives Cue Extraction in LLMs

When do large language models compute answers versus extract them from cues in external reasoning traces?

This repository contains all experimental code, data, and analysis for studying how LLMs process external reasoning traces under varying reliability conditions.


Overview

Large language models increasingly consume external reasoning traces—retrieved solutions, tool logs, or multi-agent transcripts. We study when such traces support computation and when models instead shift toward cue extraction.

Key Findings

  • Consumer model determines vulnerability: The model reading the trace (not the one that generated it) determines susceptibility to misleading cues
  • Corruption increases cue extraction: Wrong-cue following rises from 0.5% to 8.5% as trace corruption increases (GPT-4o)
  • Naturalistic formats amplify vulnerability: Tool-log traces produce ~3× higher cue extraction (29.1%) vs standard format (10.1%)
  • Cue style doesn't matter: Implicit cues in natural conclusions are equally effective as explicit "Answer: X" markers
  • Discrete trust labels backfire: "Verified" tags increase cue following (19.1%) vs "Unverified" (11.1%)
  • Semantic corruption dominates: Meaning-altering corruptions (especially quantity shifts) are most effective
  • Defenses work: Simple verification prompts reduce cue following from 12% to 0%

Important Notice

History experiments (history/) are retained for transparency and are not used for the main claims. All reported results are computed from the Final experiments unless explicitly stated.


Repository Structure

cot-reliability-gating/
├── data/                 # Shared datasets (problems, traces)
├── experiments/          # Final experiment results
│   ├── final/            # Main experiments (A-S series)
│   └── N_naturalistic/   # Extension experiments (N series)
├── figures/              # Paper figures (PDF/PNG)
├── history/              # Superseded experiments (transparency only)
├── notebook/             # Jupyter notebooks (34 notebooks)
├── docs/                 # Documentation
├── EXPERIMENT_INDEX.md   # Complete experiment catalog
└── CHANGELOG.md          # Version history

Experimental Summary

Category Experiments Trials Purpose
A: Baseline 4 13,532 Establish base phenomena
B: Mechanism 7 25,476 Identify causal factors
C: Generator Bias 1 3,184 Producer vs consumer effects
D: Length 1 2,388 Trace length analysis
E: Task Generalization 5 10,400 Cross-task validation
F: Attack/Future 2 3,188 Security implications
G: Defense 2 5,776 Mitigation strategies
H: Analysis 1 derived Quantitative modeling
S: Sensitivity 1 60 Edge cases
N: Naturalistic Extension 4 7,164 Format & cue style effects
Final Total 28 71,168
  • History experiments: 4 (C1, E1v2, E3v2, F2) — 7,192 trials
  • Grand total: 32 experiments, 78,360 trials

N Series: Naturalistic Extension Experiments

ID Name Trials Key Finding
N1 Naturalistic Trace 3,184 Tool-log format amplifies vulnerability ~3×
N2 Implicit Cue 1,194 Implicit = Explicit cue effect (p=0.75)
N3 Reliability Tags 2,388 "Verified" labels increase cue following
N1C Instrumentation 398 (+6 retry) Claude measurement integrity confirmed

Trial Definition

A trial is a single evaluation instance recorded as one row in *_results.json. Each trial corresponds to one API call to a consumer model.


Models

  • Claude Sonnet 4 (claude-sonnet-4-20250514)
  • GPT-4o (gpt-4o)

Quick Start

Requirements

pip install anthropic openai pandas numpy scipy matplotlib seaborn tqdm

Running Experiments

Notebooks are in notebook/. Each experiment has a corresponding notebook (e.g., B1b_lambda_sweep_v1.ipynb).

See docs/REPRODUCIBILITY.md for detailed instructions.


Reproducibility

  • All experiments use fixed random seeds
  • Temperature = 0 for all API calls
  • Complete data and code available in this repository
  • CI-verified pipeline (deterministic results)

Citation

@article{hideki2026reliability,
  author = {HIDEKI},
  title = {Reliability Inference Drives Cue Extraction in Large Language Models 
           Consuming External Reasoning Traces},
  year = {2026},
  note = {Preprint available at Zenodo},
  doi = {10.5281/zenodo.18203731}
}

Versions

Version Date Experiments Trials Notes
v3.0.0 2026-01-09 24 64,004 Initial submission version
v4.0.0 2026-01-10 28 71,168 Added N series (naturalistic extension)

Author

HIDEKI
Independent Researcher, Japan
ORCID: 0009-0002-0019-6608
Email: hideki@r3776.jp


License

MIT License

About

Reliability-Gated Cue Reliance in LLMs: When do models compute vs extract from external reasoning traces?

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors