Judgment Stability Probing (JSP): Instrument, Data, and Analysis

This repository contains the complete instrument, raw data, analysis pipeline, and interactive report for the paper:

Judgment Stability Under Cultural Perturbation: Probing Eight Large Language Models for Framing Compliance

Declan Michaels (2026). Pre-registered on the Open Science Framework: osf.io/xnv5f

What This Measures

Judgment Stability Probing (JSP) presents a language model with pairs of concepts and asks it to rate their similarity on a 1 to 7 scale, then explain the relationship. The same pairs are rated under different framing conditions: no framing (baseline), four cultural framings ("In a collectivist society," etc.), and two nonsense framings ("In a geometric society," "In a glorbic society"). If the model has stable similarity judgments, they should not change in response to meaningless framing. JSP measures whether they do, how much, and what form the instability takes.

The instrument operates entirely through the API, requires no access to model internals, and produces quantitative measures (drift, Spearman rho, Procrustes distance, compliance rate) that can be compared across models and tracked over time.

Key Findings (8 models, 390,654 valid ratings)

Every model tested produces geometric framing keywords in its explanations at rates between 19.4% and 80.6%. For one model (Grok 4.20), nonsense framing produces deeper similarity reordering than any cultural framing. The one model with always-on reasoning shows 80.6% geometric compliance; its non-reasoning counterpart shows 30.8%. In the main task, no model refuses nonsense framing or flags it as meaningless.

For full results, see the paper in experiments/rcp-v2/papers/ or the interactive report.

Repository Contents

experiments/rcp-v2/
  papers/
    rcp-v2-full-paper-draft.md    The paper (markdown)
    rcp-v2-paper.pdf              The paper (PDF with figures)
    rcp-v2-preregistration.md     Pre-registration document
    rcp-v2-preregistration.pdf    Pre-registration (PDF)
  stimuli/
    concepts.json                 54 concepts across 3 domains
    probes-v2.json                1,431 concept pairs
    templates/                    7 framing preamble templates
  config.json                    Experiment configuration
  parse.py                       Response parser (raw -> structured)
  build_report.py                Analysis pipeline (produces report.json)
  split_report.py                Splits report into lite + explanations
  test_build_report.py           Tests for the analysis pipeline
  report.html                    Interactive report viewer (entry point)
  report-lite.json               Pre-built report data (no explanation text)
  report.json                    Full report data (includes explanations)
  reports/                       Report viewer JS/CSS assets
  analysis/
    permutation_tests.py          Domain-level permutation tests
    factor_analysis.py            Factor analysis on response data
    embedding_validation.py       Concept inventory embedding validation

results/rcp-v2/                  Raw response data (16 runs: 8 models x 2 temperatures)
  anthropic_claude-sonnet-4-6/   Sonnet 4.6 (pre-registered)
  openai_gpt-5.4-mini/           GPT-5.4 Mini (pre-registered)
  google_gemini-2.5-flash/       Gemini 2.5 Flash (pre-registered)
  together_meta-llama_.../       Llama 3.3 70B (pre-registered)
  xai_grok-4-1-fast-.../         Grok 4.1 Fast (pre-registered)
  anthropic_claude-opus-4-6/     Opus 4.6 (exploratory)
  openai_gpt-5.4/                GPT-5.4 (exploratory)
  xai_grok-4.20/                 Grok 4.20 (exploratory, always-on reasoning)

runner/                          Shared execution engine (multi-vendor API adapters)
models/                          Model registry (vendor configs)
ui/                              Browser-based experiment runner UI

Each model directory under results/rcp-v2/ contains two timestamped run directories: one for temperature 0 (1 iteration, 10,017 calls) and one for temperature 0.7 (5 iterations for pre-registered models, 2 for exploratory). Each run contains responses.jsonl (one JSON object per API call) and run_meta.json (run parameters and counts).

Quick Start: View the Report

No dependencies beyond Python 3 (which you already have). The interactive report is pre-built from the raw data.

chmod +x view-report.sh
./view-report.sh

This opens the report in your browser. Use the temperature toggle to switch between temperature 0 and temperature 0.7 results. Sections include data quality, cluster validation, drift analysis, FSI heatmap, permutation tests, PCA analysis, compliance gradient, Procrustes alignment, variance comparison, explanation viewer, and temperature comparison.

Rebuilding the Report from Raw Data

To verify the analysis by rebuilding the report from the raw response data in results/:

pip install numpy scipy
cd experiments/rcp-v2
python3 build_report.py
python3 split_report.py          # produces report-lite.json + explanations.json
./../../view-report.sh           # view the rebuilt report

Replicating the Experiment

To run the experiment against your own models, start.sh bootstraps a virtual environment with all dependencies and starts the experiment platform:

chmod +x start.sh
./start.sh

Open http://localhost:8002 in your browser. The UI walks you through API key configuration, model selection, and experiment execution. The start script creates a virtual environment and installs dependencies automatically. No manual pip installs required.

Requirements: Python 3.10+, API keys for the vendors you want to test.

The experiment sends 10,017 API calls per model at temperature 0 (1,431 pairs x 7 framings x 1 iteration). A temperature 0.7 pass with 5 iterations adds 50,085 calls. Budget accordingly.

Concept Inventory

54 concepts across three domains, 18 per domain:

Physical: acceleration, amplitude, buoyancy, conduction, convection, crystallization, density, diffusion, elasticity, erosion, evaporation, friction, magnetism, oscillation, refraction, sublimation, turbulence, viscosity.

Institutional: arbitration, bureaucracy, census, citizenship, constitution, federation, jurisdiction, legislation, naturalization, parliament, prosecution, ratification, referendum, regulation, republic, sovereignty, tariff, taxation.

Moral: altruism, compassion, conscience, courage, devotion, dignity, forgiveness, generosity, gratitude, honesty, honor, humility, integrity, loyalty, obedience, sacrifice, tolerance, wisdom.

The inventory was validated using dual-model embedding analysis and pilot cluster accuracy data. In the main experiment, hierarchical clustering recovers the three-domain structure with 92.6% to 100.0% accuracy across all eight models.

Framing Conditions

Each condition prepends a single sentence to the probe. No instruction to adopt, inhabit, or role-play any perspective.

Unframed (no preamble, baseline)
"In an individualist society."
"In a collectivist society."
"In a hierarchical society."
"In an egalitarian society."
"In a geometric society." (interpretable nonsense)
"In a glorbic society." (uninterpretable nonsense)

Models Tested

Model	Vendor	Status	Reasoning	Temp 0.7 Iterations
Sonnet 4.6	Anthropic	pre-registered	off	5
GPT-5.4 Mini	OpenAI	pre-registered	off	5
Gemini 2.5 Flash	Google	pre-registered	off	5
Llama 3.3 70B	Together (Meta)	pre-registered	off	5
Grok 4.1 Fast	xAI	pre-registered	off	5
Opus 4.6	Anthropic	exploratory	off	2
GPT-5.4	OpenAI	exploratory	off	2
Grok 4.20	xAI	exploratory	always on	2

Data Format

Each line in responses.jsonl is a JSON object:

{
  "experiment": "rcp-v2",
  "model": "claude-sonnet-4-6",
  "vendor": "anthropic",
  "timestamp": "2026-04-14T03:27:20Z",
  "stimulus_id": "phys_01",
  "stimulus_text": "...",
  "prompt_template": "collectivist",
  "iteration": 1,
  "raw_response": "Rating: 3\nExplanation: ...",
  "parsed": {
    "rating": 3,
    "explanation": "..."
  },
  "meta": {
    "tokens_in": 245,
    "tokens_out": 180,
    "latency_ms": 1200,
    "temperature": 0.7,
    "max_tokens": 500
  }
}

Methodology Note

This research was conducted with AI assistance (Claude, Anthropic). The methodology, analysis decisions, and writing reflect the author's judgment. AI tools were used for implementation, literature review, cross-model review, and drafting. This disclosure is made in accordance with the author's position that explicit AI-assisted methodology acknowledgment is more honest than omission.

License

Apache License 2.0. See LICENSE.

Contact

Declan Michaels declan@moral-os.com linkedin.com/in/declanmichaels

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Judgment Stability Probing (JSP): Instrument, Data, and Analysis

What This Measures

Key Findings (8 models, 390,654 valid ratings)

Repository Contents

Quick Start: View the Report

Rebuilding the Report from Raw Data

Replicating the Experiment

Concept Inventory

Framing Conditions

Models Tested

Data Format

Related

Methodology Note

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
honesty-check		honesty-check
instrument		instrument
paper		paper
report		report
stimuli		stimuli
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
view-report.sh		view-report.sh

Folders and files

Latest commit

History

Repository files navigation

Judgment Stability Probing (JSP): Instrument, Data, and Analysis

What This Measures

Key Findings (8 models, 390,654 valid ratings)

Repository Contents

Quick Start: View the Report

Rebuilding the Report from Raw Data

Replicating the Experiment

Concept Inventory

Framing Conditions

Models Tested

Data Format

Related

Methodology Note

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages