This is a repository for the paper "In-Context Probing for Membership Inference in Fine-Tuned Language Models", accepted by NDSS 2026. This is a cleaned-up version of our ICP-MIA framework repository to contain only essential scripts for reproducing results in this paper. Our paper is available at here.
Membership inference attacks (MIAs) pose a critical privacy threat to fine-tuned large language models (LLMs), especially when models are adapted to domain-specific tasks using sensitive data. While prior black-box MIA techniques rely on confidence scores or token likelihoods, these signals are often entangled with a sample’s intrinsic properties—such as content difficulty or rarity—leading to poor generalization and low signal-to-noise ratios. In this paper, we propose ICP-MIA, a novel MIA framework grounded in the theory of training dynamics, particularly the phenomenon of diminishing returns during optimization. We introduce the Optimization Gap as a fundamental signal of membership: at convergence, member samples exhibit minimal remaining loss-reduction potential, while non-members retain significant potential for further optimization. To estimate this gap in a black-box setting, we propose In-Context Probing (ICP)—a training-free method that simulates fine-tuning-like behavior via strategically constructed input contexts. We propose two probing strategies: reference-data-based (using semantically similar public samples) and self-perturbation (via masking or generation). Experiments on three tasks and multiple LLMs show that ICP-MIA significantly outperforms prior black-box MIAs, particularly at low false positive rates. We further analyze how reference data alignment, model type, PEFT configurations, and training schedules affect attack effectiveness. Our findings establish ICP-MIA as a practical and theoretically grounded framework for auditing privacy risks in deployed LLMs.
conda create -n LLamaFactory python=3.10
conda activate LLamaFactory
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation
Run the following command to download the dataset and split it:
python prepare_data.py --dataset lavita/ChatDoctor-HealthCareMagic-100k --output_dir ./data/healthcaremagicpython prepare_data.py --dataset lavita/AlpaCare-MedInstruct-52k --output_dir ./data/MedInstructThis will create the following files in ./data/healthcaremagic/
healthcaremagic_train.json(80% of data) - for model traininghealthcaremagic_val.json(10% of data) - for validationhealthcaremagic_test.json(10% of data) - for testinghealthcaremagic_attack.json(1000 members + 1000 non-members with labels) - for attack evaluation
Copy the data files to LLaMA-Factory's data directory:
cp ./data/healthcaremagic/*.json ./LLaMA-Factory/data/
cp ./data/MedInstruct/*.json ./LLaMA-Factory/data/Then add the following entries to ./LLaMA-Factory/data/dataset_info.json:
"healthcaremagic_train": {
"file_name": "healthcaremagic_train.json"
},
"healthcaremagic_val": {
"file_name": "healthcaremagic_val.json"
},
"healthcaremagic_test": {
"file_name": "healthcaremagic_test.json"
} "MedInstruct_train": {
"file_name": "MedInstruct_train.json"
},
"MedInstruct_val": {
"file_name": "MedInstruct_train.json"
},
"MedInstruct_test": {
"file_name": "MedInstruct_test.json"
}Train your target model using LLaMA-Factory on one GPU:
cd LLaMA-Factory
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train ../config/config_training.yamlTrain your target model using LLaMa-Factory on multi-GPUs:
First, uncomment the deepspeed in config_training.yaml
Then:
cd LLaMA-Factory
pip install deepspeed
llamafactory-cli train ../config/config_training.yamlGenerate perturbations for the attack dataset created by prepare_data.py:
python generate_perturbations.py convert \
--input ./data/healthcaremagic/healthcaremagic_attack.json \
--output ./data/healthcaremagic/healthcaremagic_attack_perturbed.json \
--mask_rate 0.7 \
--num_perturbations 20The output will be in target_example format with mask_perturbations and label fields.
We separated the attack environment from the training environment, so we need to create another attack environment.
conda deactivate
conda create -n ICPMIA python=3.10
pip install -r requirements.txt
Create a configuration file based on the provided templates:
# For similarity-based ICP
cp config/config_icp_ref.yaml your_config.yaml
# For self-perturbation ICP
cp config/config_icp_sp.yaml your_config.yamlEdit the configuration file with your model and data paths:
Please use the second saved checkpoint as target model
model:
target_model_path: "/path/to/your/model"
device: "cuda:0"
data:
train_data_path: "/path/to/your/train_data.json"
test_data_path: "/path/to/your/test_data.json"python icp_mia_attack.py --config your_config.yamltrain_data_path: Path to training data (JSON format as shown before)test_data_path: Path to test data for evaluationdata_format: Format type ("instruction" or "pretrain")test_size: Number of test samples to evaluate
enabled: Enable/disable similarity-based attackprefix_pool_source: Source dataset for finding similar prefixestop_k: Number of top similar prefixes to usemax_prefix_candidates: Maximum candidates to consideraggregation_strategy: How to aggregate scores ("max", "min", "mean", "median")embedding_model: Sentence transformer model for calculating similarity
enabled: Enable/disable self-perturbation attackperturbation_file_path: Path to file containing perturbationstop_k: Number of perturbations to useaggregation_strategy: Score aggregation method
The attack generates detailed results including:
- AUC Score: Area under the ROC curve
- TPR@FPR: True Positive Rate at specified False Positive Rates
Results are saved in the specified output directory with timestamps.
You can provide custom prefix pools for similarity-based attacks:
similarity_based_icp:
prefix_pool_source: "/path/to/custom/prefix_pool.json"- iCliniq:
lavita/ChatDoctor-iCliniq - AlpaCare-Med-52k:
lavita/AlpaCare-MedInstruct-52k - TOFU:
locuslab/TOFU