This is a solution from the SNU team for privacy-preserving genotype imputation based on Homomorphic Encryption (HE). Basically it consists of four directories: /ModHEaaN, /data_origin, /plain and /encrypted. To run our HE-based genotype imputation solution, follow the instruction described below in order.
At first, we need to download four datasets "Total", "EUR", "AMR" and "AFR". Please follow the steps below:
- Download original data files from here and save in folder
/data_origin. - Download modified data files from here and save in folder
/plain/data_mod.
- Total:
/data_origin/*.txt,/plain/data_mod/Total_mod/*.txt - EUR:
/data_origin/*_EUR.txt,/plain/data_mod/EUR_mod/*.txt - AMR:
/data_origin/*_AMR.txt,/plain/data_mod/AMR_mod/*.txt - AFR:
/data_origin/*_AFR.txt,/plain/data_mod/AFR_mod/*.txt
We generate 1-hidden layer neural network models for several datasets represented by population. For each dataset, one can choose different models determined by window_size, which denotes the number of adjacent tag SNPs for each target SNP. Experiments based on the given dataset shows that the choice window_size = 40 provides the best accuracy of genotype imputation. Note that the larger window_size implies the higher computational cost. We also provide the multi-processing option for the acceleration so that one can dynamically choose the number of processes based on his/her computer environment.
To generate the imputation models, command the following:
$ cd ./plain
$ python New_gen_model_W.py -p <population> -w <window_size> -n <number_of_processes>population: Total, EUR, AMR, AFRwindow_size: 8, 16, 24, 32, 40, 48, 56, 64, 72number_of_processes: 1, 2, 4, 8, 16
The generated plain models will be saved in the /encrypted/<population>_DNNmodels/DNNmodels_<window_size>_c directory.
We have added the model for data with low minor allele frequency (MAF). This data consists of 117k number of target SNPs. To run with low MAF data, run New_gen_model_w_117k instead of original one and population=Total. Then, the generated plain models will be saved in the /encrypted/Total_DNNmodels/DNNmodels_lowMAF_<window_size> directory.
For the encryption of test data, we use the ModHEaaN library, which is a light-version implementation of the approximate HE scheme CKKS. Contrary to the original implementation of the CKKS scheme HEAAN, our ModHEaaN library does not have any dependency on multi-precision libraries GMP and NTL, and only supports homomorphic addition and 1-depth constant multiplication (hence bootstrapping disabled).
To build ModHEaaN, command the following:
$ ./ModHEaaN/heaan
$ cmake CMakeLists.txt
$ make all$ ./encrypted/impute_dnn
$ cmake CMakeLists.txt
$ make allThe executable file is generated as enc_impute in the directory ./encrypted/impute_dnn.
The command line to execute enc_impute depends on the target dataset.
One need to choose two input arguments window_size and number_of_targetSNP. window_size is the number of adjacent tag SNPs for each target SNP, and number_of_targetSNP represents the number of target SNPs.
$ cd ./encrypted/impute_dnn
$ ./enc_impute <window_size> <number_of_targetSNP>- window_size: 8, 16, 24, 32, 40, 48, 56, 64, 72
- number_of_targetSNP: 20, 40, 80, 117
For instance, The argument below runs the genotype imputation of 80k target SNPs, with window size 40.
$ ./enc_impute 40 80When you want to deal with low MAF data, define number_of_targetSNP = 117, then the other parameters are the same as original Total data.
In the case of EUR/AMR/AFR Dataset, the command line is fixed as
$ ./enc_impute populations 80The output will be genotype scores on EUR, AMR, and AFR datasets for the fixed window_size=40 and number_of_targetSNP=80.
- NOTE: The running time linearly grows up in terms of
window_size, but the accuracy does not. For the given datasets, the choicewindow_size=40shows the best accuracy.
If you succeed to run our solution, then the genotype score results of our solution "genotype_score" will be saved in /encrypted/impute_dnn denoted by score_window<window_size>_<number_of_targetSNP>k.csv. Note that the real genotypes of test data "genotype_real" is saved in /plain denoted by real_<number_of_targetSNP>k.csv. To run evaluation.py in the ./plain directory, command
python3 evaluation.py -i <genotype_score> -t <genotype_real> -o output.pnge.g. After running $ ./enc_impute 40 80 in the previous step, then
$ cd ./plain
$ python3 evaluation.py -i ../encrypted/impute_dnn/score_window40_80k.csv -t real_80k.csv -o output.pnge.g. After running $ ./enc_impute populations 80 in the previous step, then
$ cd ./plain
$ python3 evaluation.py -i ../encrypted/impute_dnn/score_EUR_80k.csv -t real_EUR_80k.csv -o output_EUR.png
$ python3 evaluation.py -i ../encrypted/impute_dnn/score_AMR_80k.csv -t real_AMR_80k.csv -o output_AMR.png
$ python3 evaluation.py -i ../encrypted/impute_dnn/score_AFR_80k.csv -t real_AFR_80k.csv -o output_AFR.png