OOD deception probes via subspaces.
- Small transferable subspace exists
- Source/target rankings fail to recover it
- LLM judge partially recovers the subspace
Install the Python dependencies:
pip install -r requirements.txtSet:
export OPENAI_API_KEY=...Download the activations:
mkdir -p data/deception-activations
huggingface-cli download Yooniel/deception-activations \
--repo-type dataset \
--include "roleplaying__plain*" \
--include "insider_trading__upscale*" \
--include "insider_trading_doubledown__upscale*" \
--include "sandbagging_v2__wmdp_mmlu*" \
--local-dir data/deception-activationsGenerated roleplaying interpretation results are available on Huggingface:
Train a linear probe on the source dataset and evaluate it on target datasets.
python scripts/baseline_full.py \
--source roleplaying__plain \
--targets insider_trading__upscale insider_trading_doubledown__upscale sandbagging_v2__wmdp_mmluTrain a linear probe using target labels in source PCA basis.
python scripts/baseline_target_trained.py \
--source roleplaying__plain \
--targets insider_trading__upscale insider_trading_doubledown__upscale sandbagging_v2__wmdp_mmluGreedily select PCs using target validation, then evaluate on held-out target data.
python scripts/greedy_cv.py \
--source roleplaying__plain \
--targets insider_trading__upscale insider_trading_doubledown__upscale sandbagging_v2__wmdp_mmlu \
--output greedy_results.jsonGreedily select PCs using full target dataset.
python scripts/greedy_cv.py \
--source roleplaying__plain \
--targets insider_trading__upscale insider_trading_doubledown__upscale sandbagging_v2__wmdp_mmlu \
--selection-scope full_targetTrain target-label probes, and rank source PCs by weight contribution.
python scripts/hypothesis_target_pca.py \
--source roleplaying__plain \
--targets insider_trading__upscale insider_trading_doubledown__upscale sandbagging_v2__wmdp_mmluTrain a source probe on chosen PCs and evaluate it on targets.
python scripts/control_chosen_PCs.py \
--source roleplaying__plain \
--targets insider_trading__upscale insider_trading_doubledown__upscale sandbagging_v2__wmdp_mmlu \
--pcs 1 2 3 4 5Build an interpretation prompt for a single source PCA direction. The tokenizer can be a local tokenizer directory or a Hugging Face model id already available in your local Transformers cache.
python scripts/interpret_ood_score.py \
--source roleplaying__plain \
--pc 1 \
--tokenizer meta-llama/Llama-3.1-8B-Instruct \
--openai-model gpt-5-mini \
--output interp/roleplaying_pc1_interp.jsonThis codebase builds on Detecting Strategic Deception Using Linear Probes and The Truthfulness Spectrum Hypothesis
