Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 52 additions & 51 deletions code/SoS/mnm_analysis/mnm_methods/mnm_regression.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
"source": [
"# Advanced regression models for association analysis with individual-level data\n",
"\n",
"This is a new-user-friendly minimal working example (MWE) of the `mnm_regression` pipeline. It runs end-to-end on the bundled chr22 toy dataset and shows how to compute SuSiE fine-mapping results and multi-method TWAS weights for a list of molecular-phenotype regions, then build the ensemble TWAS weights."
"This is a new-user-friendly minimal working example (MWE) of the `mnm_regression` pipeline. It runs end-to-end on the bundled chr22 toy dataset (60 samples) and shows how to compute SuSiE fine-mapping results and multi-method TWAS weights for a list of molecular-phenotype regions, then build the ensemble (multi-context) TWAS weights."
]
},
{
Expand Down Expand Up @@ -37,12 +37,12 @@
"- **Peak / chromatin input** (`input/`, Mode B): the *same* pipeline also accepts a list of chr22 peaks (or any other molecular phenotype) - the command is identical, you just swap in the phenotype BED and its manifest.\n",
"\n",
"**Workflows used here:**\n",
"- `susie_twas` -- univariate SuSiE fine-mapping plus multi-method TWAS weights (one `*.univariate_twas_weights.rds` per region).\n",
"- `ensemble_twas_weight` -- reads the univariate weights and adds the ensemble weight (one `*.ensemble_twas_weights.rds` per region).\n",
"- `susie_twas` -- univariate SuSiE fine-mapping plus multi-method TWAS weights (one `*.univariate_twas_weights.rds` per region). Run with `--save-data` so the combined fine-mapping object is written for the next step.\n",
"- `mnm` -- multi-context fine-mapping (mvSuSiE + mr.mash) that consumes the `susie_twas --save-data` output through a `--fine-mapping-meta` table and produces the ensemble weights (one `*.multicontext_twas_weights.rds` per gene).\n",
"\n",
"The notebook also documents the `mvsusie` (multivariate) and `fsusie` (functional) workflows that the pipeline supports, but the MWE focuses on the univariate + ensemble path that is runnable on the toy data.\n",
"The notebook also documents the `mnm_genes` (multi-gene), `mvfsusie`/`fsusie` (functional), and `mvsusie` (multivariate) workflows that the pipeline supports, but the MWE focuses on the univariate (`susie_twas`) + multi-context (`mnm`) path that is runnable on the toy data.\n",
"\n",
"**Note on data:** this example uses a small synthetic chr22 toy dataset (1,995 variants after subsetting). No access-controlled individual-level human data is used."
"**Note on data:** this example uses a small synthetic chr22 toy dataset (60 samples, 1,995 variants after subsetting). No access-controlled individual-level human data is used."
]
},
{
Expand Down Expand Up @@ -78,13 +78,13 @@
"source": [
"## Steps\n",
"\n",
"Run the two steps in order. **Step 1** (`susie_twas`) produces the per-region univariate TWAS weights; **Step 2** (`ensemble_twas_weight`) reads those weights and adds the ensemble. Run each command from inside `data/example_10peaks_susie_twas/`; the pipeline notebook is referenced as `../../pipeline/mnm_regression.ipynb`.\n",
"Run the two steps in order. **Step 1** (`susie_twas`) produces the per-region univariate TWAS weights; **Step 2** (`mnm`) reads those weights and adds the multi-context ensemble. The pipeline notebook is referenced as `../../pipeline/mnm_regression.ipynb`.\n",
"\n",
"**Step 1.** SuSiE fine-mapping + multi-method TWAS weights for every region in the phenotype list.\n",
"**Step 1.** SuSiE fine-mapping + multi-method TWAS weights for every region in the phenotype list. Run with `--save-data` so the combined fine-mapping object is written for Step 2.\n",
"\n",
"**Step 2.** Build the ensemble TWAS weight from the Step-1 output. The ensemble is computed by default on the toy data (the `ensemble_r2_threshold` is relaxed so no method is silently dropped).\n",
"**Step 2.** Build the ensemble (multi-context) TWAS weight from the Step-1 output with `mnm`. The ensemble is computed by default on the toy data (the `ensemble_r2_threshold` is relaxed so no method is silently dropped).\n",
"\n",
"_Memory note:_ run the peak loop with `-j1`. Running many regions in parallel (`-j4`) -- or even all 10 serially in one process -- can exhaust memory on heavier regions and trigger `ERROR: One of the local workers has been killed`. If that happens, re-run the remaining heavy regions one at a time using a single-region phenotype list."
"_Memory note:_ run multi-region loops with `-j1`. Running many regions in parallel (`-j4`) -- or even all of them serially in one process -- can exhaust memory on heavier regions and trigger `ERROR: One of the local workers has been killed`. If that happens, re-run the remaining heavy regions one at a time using a single-region phenotype list."
]
},
{
Expand All @@ -106,9 +106,9 @@
"kernel": "SoS"
},
"source": [
"**Timing** (toy chr22 dataset):\n",
"**Timing** (toy chr22 dataset, 60 samples):\n",
"- `susie_twas`: ~3 min (chr22 toy, 1 gene)\n",
"- `ensemble_twas_weight`: ~30 sec\n"
"- `mnm`: ~30 sec"
]
},
{
Expand All @@ -119,16 +119,13 @@
},
"outputs": [],
"source": [
"# Univariate fine-mapping (susie_twas) - VERIFIED end-to-end on the toy gene-expression data.\n",
"# Run from the toy_example root. --save-data writes univariate_data.rds, which supplies\n",
"# the combined_data_sumstats that the downstream mnm_genes workflow consumes.\n",
"sos run pipeline/mnm_regression.ipynb susie_twas \\\n",
" --name test_uni --cwd output_uni \\\n",
"sos run pipeline/mnm_regression.ipynb qtl_dataset_construct+susie_twas \\\n",
" --name protocol_example --cwd output_uni \\\n",
" --genoFile input/colocboost/example.chr22.bed \\\n",
" --phenoFile input/colocboost/pheno_manifest.tsv \\\n",
" --phenoFile input/colocboost/pheno_manifest_multicontext.tsv \\\n",
" --covFile input/colocboost/example_covariates.tsv \\\n",
" --customized-association-windows input/colocboost/association_windows.bed \\\n",
" --save-data -j1"
" --region-name ENSG00000130538 --transpose-covariates --save-data -j1"
]
},
{
Expand All @@ -137,7 +134,7 @@
"kernel": "SoS"
},
"source": [
"**Step 2 (gene expression): ensemble TWAS weights.**"
"**Step 2 (gene expression): multi-context fine-mapping + ensemble TWAS weights (mnm).**"
]
},
{
Expand All @@ -148,14 +145,14 @@
},
"outputs": [],
"source": [
"sos run pipeline/mnm_regression.ipynb ensemble_twas_weight \\\n",
" --name geneexpr_twas \\\n",
" --cwd output/colocboost \\\n",
"sos run pipeline/mnm_regression.ipynb mnm \\\n",
" --name protocol_example --cwd output_mnm \\\n",
" --genoFile input/colocboost/example.chr22.bed \\\n",
" --phenoFile input/colocboost/pheno_manifest_1gene.tsv \\\n",
" --phenoFile input/colocboost/pheno_manifest_multicontext.tsv \\\n",
" --covFile input/colocboost/example_covariates.tsv \\\n",
" --customized-association-windows input/colocboost/association_windows_1gene.bed \\\n",
" -j1"
" --customized-association-windows input/colocboost/association_windows.bed \\\n",
" --fine-mapping-meta input/colocboost/fine_mapping_meta.tsv \\\n",
" --transpose-covariates --save-data -j1"
]
},
{
Expand All @@ -180,7 +177,7 @@
"outputs": [],
"source": [
"sos run pipeline/mnm_regression.ipynb susie_twas \\\n",
" --name example_10peaks_twas \\\n",
" --name protocol_example \\\n",
" --cwd output_10peaks \\\n",
" --genoFile input/genotype/protocol_example.genotype.chr22.bed \\\n",
" --phenoFile input/phenotype/protocol_example.pheno_manifest.tsv \\\n",
Expand All @@ -195,7 +192,7 @@
"kernel": "SoS"
},
"source": [
"**Step 2 (peak): ensemble TWAS weights.**"
"**Step 2 (peak): multi-context fine-mapping + ensemble TWAS weights (mnm).**"
]
},
{
Expand All @@ -206,14 +203,17 @@
},
"outputs": [],
"source": [
"sos run pipeline/mnm_regression.ipynb ensemble_twas_weight \\\n",
" --name example_10peaks_twas \\\n",
"# Step 2 (10 peaks): multi-context fine-mapping + ensemble TWAS weights (mnm).\n",
"# As above, mnm consumes the Step-1 univariate outputs via --fine-mapping-meta.\n",
"sos run pipeline/mnm_regression.ipynb mnm \\\n",
" --name protocol_example \\\n",
" --cwd output_10peaks \\\n",
" --genoFile input/genotype/protocol_example.genotype.chr22.bed \\\n",
" --phenoFile input/phenotype/protocol_example.pheno_manifest.tsv \\\n",
" --covFile input/covariate/protocol_example.covariates.tsv \\\n",
" --customized-association-windows input/finemapping/protocol_example.association_windows.bed \\\n",
" -j1\n"
" --fine-mapping-meta output_10peaks/fine_mapping_meta.tsv \\\n",
" -j1"
]
},
{
Expand All @@ -224,9 +224,9 @@
"source": [
"## Output Files\n",
"\n",
"Under `--cwd` (e.g. `output_10peaks/` or `output/colocboost/`), in the `twas_weights/` subfolder:\n",
"- `{name}.<region>.univariate_twas_weights.rds` -- Step 1 output, one per region: the SuSiE fine-mapping result plus TWAS weights from each method (e.g. lasso, enet, mrash, susie), variant names, region info and CV performance.\n",
"- `{name}.<region>.ensemble_twas_weights.rds` -- Step 2 output, one per region: the same content with the added ensemble. The ensemble is stored under an `ensemble` entry as a list of `method_coef`, `ensemble_twas_weights`, and `method_performance`.\n",
"Under `--cwd` (e.g. `output_10peaks/` or `output_uni/`):\n",
"- `twas_weights/{name}.<region>.univariate_twas_weights.rds` -- Step 1 (`susie_twas`) output, one per region: the SuSiE fine-mapping result plus TWAS weights from each method (e.g. lasso, enet, mrash, susie), variant names, region info and CV performance.\n",
"- `multivariate_twas_weights/{name}.<region>.multicontext_twas_weights.rds` -- Step 2 (`mnm`) output, one per gene: the multi-context model with the added ensemble. The ensemble is stored under an `ensemble` entry alongside `method_coef`, the ensemble weights, and `method_performance`.\n",
"\n",
"Fine-mapping objects are also written under the `fine_mapping/` subfolder, and per-region `.stdout`/`.stderr` logs accompany each run."
]
Expand All @@ -239,9 +239,9 @@
"source": [
"## Anticipated Results\n",
"\n",
"For the 10-peak run you should get 10 `*.univariate_twas_weights.rds` and 10 `*.ensemble_twas_weights.rds` files (one per peak: C22P107555, 557, 559, 560, 562, 563, 564, 565, 567, 568). Step 2 reports something like `10 items in 10 groups`. For the 1-gene gene-expression run you get one file of each type.\n",
"For the 10-peak run you should get 10 `*.univariate_twas_weights.rds` (one per peak) and 10 `*.multicontext_twas_weights.rds` from the `mnm` step. For the 1-gene gene-expression run you get one file of each type.\n",
"\n",
"You can sanity-check an ensemble in R:\n"
"You can sanity-check a multi-context ensemble weight in R:"
]
},
{
Expand All @@ -252,13 +252,10 @@
},
"outputs": [],
"source": [
"# In R: inspect one ensemble output\n",
"a <- readRDS('output_10peaks/twas_weights/example_10peaks_twas.chr22_C22P107568.ensemble_twas_weights.rds')\n",
"region <- names(a)[1]; pheno <- names(a[[region]])[1]\n",
"en <- a[[region]][[pheno]][['ensemble']]\n",
"names(en) # method_coef, ensemble_twas_weights, method_performance\n",
"length(en[['ensemble_twas_weights']]) # 1995 (one weight per variant)\n",
"length(en[['method_coef']]) # number of contributing methods"
"# In R: inspect one multi-context (mnm) ensemble output\n",
"w <- readRDS('output_uni/multivariate_twas_weights/protocol_example.ENSG00000130538.multicontext_twas_weights.rds')\n",
"names(w) # available genes / contexts\n",
"str(w, max.level = 2) # weights + per-method performance"
]
},
{
Expand Down Expand Up @@ -293,9 +290,10 @@
"\n",
"The same pipeline notebook also provides:\n",
"- `mvsusie` -- multivariate SuSiE with mr.mash TWAS weights (for jointly analyzing multiple phenotypes that share a genotype).\n",
"- `mnm_genes` -- multi-gene multivariate fine-mapping across genes sharing a window.\n",
"- `fsusie` -- functional SuSiE for functional/high-resolution modalities (e.g. methylation), with a `--fsusie-post-processing` option (`TI`, `HMM`, or `none`; `TI` is recommended).\n",
"\n",
"These follow the same `--genoFile / --phenoFile / --covFile` interface; they are documented here for completeness but the verified MWE above uses `susie_twas` + `ensemble_twas_weight`."
"These follow the same `--genoFile` / `--phenoFile` / `--covFile` interface; they are documented here for completeness but the verified MWE above uses `susie_twas` + `mnm`."
]
},
{
Expand All @@ -315,7 +313,7 @@
"source": [
"### Multivariate analysis: mvSuSiE and mr.mash (`mnm`)\n",
"\n",
"Fine-maps multiple molecular phenotypes jointly across a shared window using mvSuSiE with mr.mash priors. This step needs **multiple contexts** (e.g. the same genes measured under two or more conditions). The toy set ships a single context, so for this MWE we add a **second, synthetic context** (`example/colocboost_ctx2.bed.gz`, derived from context 1 with added noise *demo data only, not a real measurement*) and a multi-context phenotype manifest. With that in place the `mnm` step now runs end-to-end on the toy data and produces `multicontext_bvsr.rds`, `multicontext_twas_weights.rds`, and `multicontext_data.rds` per gene."
"Fine-maps multiple molecular phenotypes jointly across a shared window using mvSuSiE with mr.mash priors. This step needs **multiple contexts** (e.g. the same genes measured under two or more conditions). The toy set ships a single context, so for this MWE we add a **second, synthetic context** (`example/colocboost_ctx2.bed.gz`, derived from context 1 with added noise \u2014 *demo data only, not a real measurement*) and a multi-context phenotype manifest. With that in place the `mnm` step now runs end-to-end on the toy data and produces `multicontext_bvsr.rds`, `multicontext_twas_weights.rds`, and `multicontext_data.rds` per gene."
]
},
{
Expand All @@ -334,7 +332,7 @@
"# SAMPLE_001..) so genotype and phenotype/covariate samples match.\n",
"# - Requires the mr.mashr R package (provided by mr.mash.alpha in this env).\n",
"sos run pipeline/mnm_regression.ipynb mnm \\\n",
" --name test_mnm --cwd output_mnm3 \\\n",
" --name protocol_example --cwd output_mnm3 \\\n",
" --genoFile input/colocboost/example.chr22.bed \\\n",
" --phenoFile input/colocboost/pheno_manifest_multicontext.tsv \\\n",
" --covFile input/colocboost/example_covariates.tsv \\\n",
Expand Down Expand Up @@ -370,7 +368,7 @@
"# On this toy set all regions return a NULL multigene_bvsr (no multi-gene/trans signal),\n",
"# which is the correct, honest result for independent single-gene toy regions.\n",
"sos run pipeline/mnm_regression.ipynb mnm_genes \\\n",
" --name test_mnmgenes --cwd output_mnmgenes \\\n",
" --name protocol_example --cwd output_mnmgenes \\\n",
" --genoFile input/colocboost/example.chr22.bed \\\n",
" --phenoFile input/colocboost/pheno_manifest_multicontext.tsv \\\n",
" --covFile input/colocboost/example_covariates.tsv \\\n",
Expand Down Expand Up @@ -403,7 +401,7 @@
"# Functional SuSiE (fsusie) - VERIFIED end-to-end on the toy peak data.\n",
"# Run from the toy_example root. Produces *.fsusie_mixture_normal_TI__top_pc_weights.rds.\n",
"sos run pipeline/mnm_regression.ipynb fsusie \\\n",
" --name test_fsusie --cwd output/fsusie \\\n",
" --name protocol_example --cwd output/fsusie \\\n",
" --genoFile input/genotype/protocol_example.genotype.chr22.bed \\\n",
" --phenoFile input/phenotype/protocol_example.pheno_manifest.tsv \\\n",
" --covFile input/covariate/protocol_example.covariates.tsv \\\n",
Expand Down Expand Up @@ -433,7 +431,7 @@
"# Still under development on the toy dataset (the mvfSuSiE R step errors out).\n",
"# Command template:\n",
"sos run pipeline/mnm_regression.ipynb mvfsusie \\\n",
" --name test_mvfsusie --cwd output/mvfsusie \\\n",
" --name protocol_example --cwd output/mvfsusie \\\n",
" --genoFile input/genotype/protocol_example.genotype.chr22.bed \\\n",
" --phenoFile input/phenotype/protocol_example.pheno_manifest.tsv \\\n",
" --covFile input/covariate/protocol_example.covariates.tsv \\\n",
Expand All @@ -452,9 +450,12 @@
"| Step | Problem | Possible Reason | Solution |\n",
"| --- | --- | --- | --- |\n",
"| susie_twas | A region is skipped / no window found | 4th-column `ID` in the association-windows file does not match the phenotype list `ID` | Make the `ID` columns match exactly. |\n",
"| susie_twas | `susie_twas: pass --region-name <gene-id>` | No region selected | Pass `--region-name <gene-id>` (e.g. `ENSG00000130538`) or a `--region-list`. |\n",
"| susie_twas | `ERROR: One of the local workers has been killed` | Out-of-memory: too many regions in one process (`-j4`, or all 10 serially) | Use `-j1`; re-run heavy regions one at a time with a single-region phenotype list. |\n",
"| ensemble_twas_weight | `Missing upstream TWAS weights file` | Step 1 was not run for that region | Run `susie_twas` first so the `*.univariate_twas_weights.rds` exists. |\n",
"| both | Sample IDs do not overlap between genotype/covariates and phenotype | Genotype `.fam` IDs differ from the phenotype matrix sample names | Rename one side so they match (e.g. the gene-expression mode renames genotype/covariate IDs to `SAMPLE_001..049`). |"
"| mnm | `Target unavailable: ...qtl_dataset.rds` | `qtl_dataset_construct` not run first | Chain it: `qtl_dataset_construct+susie_twas`, then `mnm`. |\n",
"| mnm | `--phenotype-manifest missing required column(s): cond` | Single-context manifest used | Use the multi-context manifest (`pheno_manifest_multicontext.tsv`). |\n",
"| both | `No shared samples between phenotype and phenotype-covariate` | Covariates are in QTLtools (transposed) layout | Add `--transpose-covariates`. |\n",
"| both | Sample IDs do not overlap between genotype/covariates and phenotype | Genotype `.fam` IDs differ from the phenotype matrix sample names | Rename one side so they match (e.g. the gene-expression mode renames genotype/covariate IDs to `SAMPLE_001..060`). |"
]
},
{
Expand Down Expand Up @@ -1329,4 +1330,4 @@
},
"nbformat": 4,
"nbformat_minor": 4
}
}
Loading
Loading