TurnGate: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
TurnGate is a response-aware defense mechanism designed to detect and mitigate hidden malicious intent in multi-turn dialogue systems.
Run all training-free defenders on dataset/gpt52-gen_filter:
bash scripts/evaluate_all_baselines.shEdit the TRAINING_FREE_METHODS array in the script to enable/disable specific defenders.
scripts/eval.sh auto-detects defender type (SFT/TurnGate) and format (Full/LoRA):
# Naive SFT checkpoint
bash scripts/eval.sh checkpoints/naive_sft_full/final_model
# TurnGate checkpoint
bash scripts/eval.sh checkpoints/turngate_optimized_full/final_model
# HuggingFace repo with explicit type overrides
bash scripts/eval.sh your-org/your-model Qwen/Qwen3-4B-Instruct-2507 dataset/gpt52-gen_filter test full rlTo test trainable controls (Naive SFT, Reweighted SFT, TurnGate), use the provided scripts in the scripts/ directory.
bash scripts/train_naive_sft.sh
bash scripts/train_reweighted_sft.sh
bash scripts/train_turngate.shConfigurable parameters for each script are available in the respective files.
The online-battle/ codebase provides an online battle environment for evaluating defenders against adaptive jailbreak attacks. It runs the CKA-Agent attack method against the target model with or without a defense layer, measuring real attack success rates.
cd online-battle
# Run CKA-Agent attack without any defense
bash run_no_defense.sh
# Run CKA-Agent attack with TurnGate (RL) defense enabled
bash run_rl_defense.shSee online-battle/config/config_no_defense.yml and online-battle/config/config_rl_defense.yml for configuration details (target model, dataset, defense settings).
We include the MTID (Multi-Turn Intent Dataset) at dataset/gpt52-gen_filter. This dataset contains a collection of multi-turn interactions focused on evaluating and training defenses against correlated knowledge attacks.
The dataset is split into train, valid, and test sets for both benign and harmful categories:
- Total Unique Samples: 800 (400 Benign, 400 Harmful)
- Rollouts per Sample: 20 (Total of 16,000 trajectories)
- Format: Each line is a JSON object representing a single rollout.
If you find this repository useful for your research, please consider citing the following paper:
@misc{shen2026turnlateresponseawaredefense,
title={One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue},
author={Xinjie Shen and Rongzhe Wei and Peizhi Niu and Haoyu Wang and Ruihan Wu and Eli Chien and Bo Li and Pin-Yu Chen and Pan Li},
year={2026},
eprint={2605.05630},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.05630},
}