Project-Ditto

An evaluation-only research prototype testing whether frontier LLMs reason better on real Pokémon Showdown battle telemetry than on shuffled versions of the same data.

Follow-up work: Project Ditto v2 — applying this methodology to programming-task trajectories.

Hypothesis

Real battle chains have causal consistency (HP trajectories, PP depletion, hidden-information reveals) that models can exploit. Shuffled chains violate this consistency. A real-vs-shuffled gap ≥ 0.05 on top-3 action-match rate (p < 0.05) is the pre-registered success threshold.

Methodology Correction (April 2026)

Post-hoc methodological review identified that the original scoring used unpaired statistical tests (two-sample proportion z-test for Layer 1, Welch's t-test for Layer 2) on data that is inherently paired — each real chain has corresponding shuffled variants produced from the same source match. The review also identified the absence of multiple-comparisons correction across the two primary model cells.

Corrected analysis applies McNemar's test (Layer 1), paired t-test (Layer 2), and Bonferroni correction across the two primary cells (Haiku, Sonnet). The corrected analysis preserves both published findings:

Sonnet 4.6: gap 0.206, Bonferroni-corrected p ≪ 10⁻²¹² (strong_positive, unchanged)
Haiku 4.5: gap 0.066, Bonferroni-corrected p ≪ 10⁻²⁵ (moderate_positive, unchanged)

Both findings clear the pre-registered minimum publishable threshold by substantial margins. The methodology correction does not change the qualitative or quantitative headline of this study; it provides statistically appropriate inference for paired data.

Implementation: src/scorer_corrected.py
Full comparison: CORRECTED_SCORING.md

Quick start

pip install -r requirements.txt

# Generate synthetic data (no internet needed)
python scripts/generate_synthetic_data.py --n 2500 --out data/raw/

# Build chains + reference distribution (from raw logs — works without chains)
python scripts/build_chains.py --data data/raw/ --out-real chains/real/ --out-shuffled chains/shuffled/
python scripts/build_reference.py build-raw --raw data/raw/ --out data/reference_dist.pkl

# Dry-run evaluation (no API key needed)
python -m src.runner --model haiku --chains chains/real/ --seed 42 --dry-run --n 5

# Full evaluation (requires ANTHROPIC_API_KEY in .env)
python -m src.runner --model haiku --chains chains/real/ --seed 42
python -m src.runner --model opus  --chains chains/real/ --seed 42

Pipeline overview

data/raw/*.jsonl  →  scripts/build_chains.py  →  chains/real/ + chains/shuffled/
                                                       ↓
                  scripts/build_reference.py  →  data/reference_dist.pkl
                                                       ↓
                          src/runner.py        →  results/raw/
                                                       ↓
                          src/scorer.py        →  results/scored.json

See CLAUDE.md for full architecture documentation and showdown prototype spec v1.pdf for the complete research specification.

Tests

pytest tests/

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data/raw		data/raw
results		results
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CORRECTED_SCORING.md		CORRECTED_SCORING.md
LICENSE		LICENSE
README.md		README.md
RESULTS.md		RESULTS.md
SESSION_LOG.md		SESSION_LOG.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
showdown prototype spec v1.pdf		showdown prototype spec v1.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project-Ditto

Hypothesis

Methodology Correction (April 2026)

Quick start

Pipeline overview

Tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Project-Ditto

Hypothesis

Methodology Correction (April 2026)

Quick start

Pipeline overview

Tests

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages