DeeperNull

Library and workflows for integrating spatiotemporal covariates (e.g. time of day, time of year, birth/home location) into polygenic scores. Also includes workflows for training and evaluation of different classes of null models with various covariate features and interpretation of null models to identify important covariate effects and interactions. The following describes how to fit models, compute SHAP values, and the various UK Biobank workflows used. Additional details are provided in the docstrings of the underlying files.

Preprint

Installation

Clone the repository:

git clone https://github.com/gymrek-lab/DeeperNull.git
cd DeeperNull

Install the required dependencies:

pip install matplotlib numpy pandas pytorch-lightning scikit-learn scipy seaborn shapiq "torch>=2.0" torchmetrics tqdm xgboost

See Dependencies for tested version details and workflow-specific Dockerfiles for their full dependency lists.

Training Models with fit_model.py

The deeper_null/fit_model.py script is used to train DeepNull style models on data. It supports various model types:

Scikit-learn linear and penalized linear models
XGBoost models
PyTorch neural network models

Requirements

A whitespace delimited covariate file with a header row (used as model input)
A whitespace delimited phenotype file with sample IDs and phenotype values
A model configuration JSON file describing the model to fit
An output directory for results

Key Command Line Arguments

--covar_file / -c: Path to covariate file
--pheno_file / -p: Path to phenotype file(s)
--model_config / -m: Path to model configuration JSON file
--out_dir / -o: Path to output directory (default: current directory)
--save_models: Save models to output directory
--sample_id_col / -s: Column name for sample IDs (default: 'IID')
--n_folds / -n: Number of cross-validation folds (default: 5)
--train_samples: File containing training sample IDs
--pred_samples: File(s) containing sample IDs for prediction (not training)
--train_one_fold: Train only one fold (useful for model evaluation)

Output Files

The script creates:

model_config.json: Model configuration with number of folds
ho_preds.csv: Holdout predictions for training samples (from cross-validation)
ens_preds.csv: Ensemble predictions for non-training samples (if provided)
ho_scores.json: Performance metrics for holdout predictions
Scatter plots and joint plots visualizing predictions vs. true values

Example Usage

python fit_model.py \
    --covar_file ../data/dev/covariates.tsv \
    --pheno_file ../data/dev/phenotype_0_5.tsv \
    --model_config ../data/dev/model_config.json \
    --out_dir ../../output \
    --train_samples ../data/dev/train_samples.txt \
    --pred_samples ../data/dev/val_samples.txt ../data/dev/test_samples.txt

Getting Shapley Values with get_shapley_values.py

The deeper_null/get_shapley_values.py script computes Shapley values and first-order Shapley Interaction Index (SII) values for trained models. Shapley values are local explanations calculated for each individual, helping identify important covariate effects and interactions.

Command Line Arguments

--model_files / -m: Path(s) to one or more model save files (required)
--covar_file / -c: Path to covariate file (required)
--pred_samples / -p: File with sample IDs to compute Shapley values for (optional; uses all samples if not provided)
--model_type / -t: Model type - 'linear', 'xgb', or 'nn' (required; currently only 'xgb' supported)
--out_dir / -o: Directory to save output JSON files (default: current directory)
--sample_id_col: Column name for sample IDs (default: 'IID')
--classification: Flag for classification models (default: False)

Output Files

The script generates two JSON files:

shapley_individual_values.json: Individual-level Shapley values and SII values
- Keys: model names (from save file names), 'feature_names'
- Second level: individual IDs
- Third level: 'Shapley' or '1-SII' arrays
shapley_agg_values.json: Aggregated Shapley values across all individuals and models
- Keys: 'Shapley', '1-SII', 'feature_names'
- Aggregation methods: 'mean', 'median', 'std' (of absolute values)

Example Usage

python get_shapley_values.py \
    --model_files model_fold_0.json model_fold_1.json \
    --covar_file ../data/covariates.tsv \
    --pred_samples ../data/test_samples.txt \
    --model_type xgb \
    --out_dir ../../shap_output

UK Biobank RAP Workflows (ukb_rap_workflows)

The ukb_rap_workflows directory contains workflow scripts for running analyses on the UK Biobank Research Analysis Platform (RAP). Each subdirectory serves a specific purpose:

GWAS Workflows

GWAS_plink: PLINK2 GWAS workflow
GWAS_plotting: Create visualization plots for GWAS results
PRS_PRScs_GWAS_manhattan_local: Create Manhattan plots and scatter plots for GWAS run as part of PRScs. Also creates tables of changes in GWAS hits when adding null with additional covariates.

PRS (Polygenic Risk Score) Workflows

PRS_PRScs: Launch PRScs PRS workflow with optional null model integration
PRS_basil: BASIL PRS workflow launcher with optional null model integration
PRS_score_preds: Score PRS predictions and create evaluation plots
PRS_eval_local: Local evaluation of PRS results including plotting scores, paired comparisons with bootstrap confidence intervals, and score table generation

DeepNull Model Workflows

fit_dn_model: Launch DeeperNull model fitting workflows on UK Biobank RAP
dn_eval_local: Local evaluation of DeepNull models including score plotting, score table generation, and binary classification evaluation
dn_shap: Compute Shapley values for DeepNull models on UK Biobank data
dn_shap_eval_local: Evaluate and visualize Shapley values including SII bar plots and aggregated Shapley value analysis

Comparison and Preprocessing

compare_null_and_prs_improvements: Compare performance improvements between null models and PRS methods. Plot PGS vs. null model improvements with additional covariates over baseline. Plot improvement of PGS over null model alone.
geno_prepro: Genotype data preprocessing workflows for UK Biobank

Resources

resources: Docker image resources including PLINK2 binary and PRSice-2 executable

Dependencies

Some workflows install this library before running a script. To avoid having packages install then, we do not have a requirements.txt file. The dependencies for the training and Shapley value scripts are provided below. Dependencies for the all workflows can be found in their associated Dockerfiles. The docker images created by the Makefiles can also be used to run these workflows.

Training and Shapley value dependencies

matplotlib
numpy
pandas
pytorch-lightning
scikit-learn
scipy
seaborn
shapiq
torch>=2.0
torchmetrics
tqdm
xgboost

Tested versions

The following configuration was tested and confirmed to work:

Python 3.10.20
matplotlib 3.10.8
numpy 2.2.6
pandas 2.3.3
pytorch-lightning 2.6.1
scikit-learn 1.7.2
scipy 1.15.2
seaborn 0.13.2
shapiq 1.4.1
pytorch 2.10.0
torchmetrics 1.8.2
tqdm 4.67.3
xgboost 3.2.0

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
deeper_null		deeper_null
ukb_rap_workflows		ukb_rap_workflows
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeeperNull

Table of Contents

Installation

Training Models with fit_model.py

Requirements

Key Command Line Arguments

Output Files

Example Usage

Getting Shapley Values with get_shapley_values.py

Command Line Arguments

Output Files

Example Usage

UK Biobank RAP Workflows (ukb_rap_workflows)

GWAS Workflows

PRS (Polygenic Risk Score) Workflows

DeepNull Model Workflows

Comparison and Preprocessing

Resources

Dependencies

Training and Shapley value dependencies

Tested versions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DeeperNull

Table of Contents

Installation

Training Models with fit_model.py

Requirements

Key Command Line Arguments

Output Files

Example Usage

Getting Shapley Values with get_shapley_values.py

Command Line Arguments

Output Files

Example Usage

UK Biobank RAP Workflows (ukb_rap_workflows)

GWAS Workflows

PRS (Polygenic Risk Score) Workflows

DeepNull Model Workflows

Comparison and Preprocessing

Resources

Dependencies

Training and Shapley value dependencies

Tested versions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages