diff --git a/code/SoS/enrichment/sldsc_enrichment.ipynb b/code/SoS/enrichment/sldsc_enrichment.ipynb index d2145a19..e022ec8a 100644 --- a/code/SoS/enrichment/sldsc_enrichment.ipynb +++ b/code/SoS/enrichment/sldsc_enrichment.ipynb @@ -10,7 +10,7 @@ "\n", "Minimal working-example driver for the S-LDSC functional-enrichment pipeline. The **Steps** section below gives one ready-to-run `sos run` command per workflow, using the toy inputs symlinked under `input/`.\n", "\n", - "> **Environment note.** Steps 1–2 (`make_annotation_files_ldscore`, `get_heritability`) wrap the external **polyfun** toolkit (`compute_ldscores.py`, `ldsc.py`, `munge_polyfun_sumstats.py`) and require pre-computed reference-panel files (baseline-LD scores, LD weights, `.frq`, and PLINK `.bed/.bim/.fam`). polyfun is **not installed in this environment** and the reference panel is not shipped with the toy example, so those two steps cannot be executed here; their commands are provided for use on a system where polyfun and a matching panel are available. Steps 3–4 (`postprocess`, `meta_subset`) use `pecotmr::sldsc_postprocessing_pipeline` (available here) and read the `.results`/`.log` files produced by Step 2.\n" + "> **Environment note.** Steps 1\u20132 (`make_annotation_files_ldscore`, `get_heritability`) wrap the external **polyfun** toolkit (`compute_ldscores.py`, `ldsc.py`, `munge_polyfun_sumstats.py`) and require pre-computed reference-panel files (baseline-LD scores, LD weights, `.frq`, and PLINK `.bed/.bim/.fam`). polyfun is **not installed in this environment** and the reference panel is not shipped with the toy example, so those two steps cannot be executed here; their commands are provided for use on a system where polyfun and a matching panel are available. Steps 3\u20134 (`postprocess`, `meta_subset`) use `pecotmr::sldsc_postprocessing_pipeline` (available here) and read the `.results`/`.log` files produced by Step 2.\n" ] }, { @@ -137,8 +137,8 @@ "\n", "The pipeline has two computational layers:\n", "\n", - "- **Regression layer** — the S-LDSC regression itself, performed by the [polyfun](https://github.com/omerwe/polyfun) engine. We do not re-implement this.\n", - "- **Post-processing layer** — standardization, differential per-SNP heritability, binary/continuous detection, and random-effects meta-analysis across traits. Implemented in the [`pecotmr`](https://github.com/StatFunGen/pecotmr) R package (`R/sldsc_wrapper.R`).\n", + "- **Regression layer** \u2014 the S-LDSC regression itself, performed by the [polyfun](https://github.com/omerwe/polyfun) engine. We do not re-implement this.\n", + "- **Post-processing layer** \u2014 standardization, differential per-SNP heritability, binary/continuous detection, and random-effects meta-analysis across traits. Implemented in the [`pecotmr`](https://github.com/StatFunGen/pecotmr) R package (`R/sldsc_wrapper.R`).\n", "\n", "The notation below tags each modeling quantity as **(polyfun)** or **(pecotmr)**.\n", "\n", @@ -151,7 +151,7 @@ "\n", "#### Reference panel and MAF cutoff\n", "\n", - "All LD-derived quantities — partitioned LD scores for the 97 baseline annotations and for our $K$ target annotations, the LD-score-regression weights, allele frequencies, and the SNP set — are computed against our own LD reference panel. We do not mix in pre-computed quantities from external panels (e.g. 1000G); $M_{\\mathrm{ref}}$ throughout this notebook denotes the number of common SNPs in our panel.\n", + "All LD-derived quantities \u2014 partitioned LD scores for the 97 baseline annotations and for our $K$ target annotations, the LD-score-regression weights, allele frequencies, and the SNP set \u2014 are computed against our own LD reference panel. We do not mix in pre-computed quantities from external panels (e.g. 1000G); $M_{\\mathrm{ref}}$ throughout this notebook denotes the number of common SNPs in our panel.\n", "\n", "By default we restrict to MAF $> 5\\%$ per the sLDSC recommendation: rare-variant LD is unstable and HapMap3-style regression weights are common-variant by construction. The cutoff is exposed as the SoS parameter `maf_cutoff` (default $0.05$); the regression, the standardized $sd_C$, and $M_{\\mathrm{ref}}$ are all evaluated on the same MAF $>$ cutoff SNP set. If allele-frequency files are not available the pipeline fails; the user must explicitly set `maf_cutoff = 0` to opt out (not recommended).\n", "\n", @@ -159,30 +159,30 @@ "\n", "Solving Equation (2) jointly across annotations, with 200-block genomic jackknife for inference, is performed by polyfun's `ldsc.py`. From each polyfun run we obtain, per annotation:\n", "\n", - "- $\\tau_C$ and its standard error — **(polyfun)**.\n", - "- $\\pi^{h^2}_C$ and $\\pi^{M}_C$ — **(polyfun)**.\n", - "- $E_C = \\pi^{h^2}_C / \\pi^{M}_C$ and its standard error — **(polyfun)**.\n", - "- The p-value of the differential per-SNP heritability test (defined below) — **(polyfun)**, computed internally with the full coefficient covariance matrix.\n", + "- $\\tau_C$ and its standard error \u2014 **(polyfun)**.\n", + "- $\\pi^{h^2}_C$ and $\\pi^{M}_C$ \u2014 **(polyfun)**.\n", + "- $E_C = \\pi^{h^2}_C / \\pi^{M}_C$ and its standard error \u2014 **(polyfun)**.\n", + "- The p-value of the differential per-SNP heritability test (defined below) \u2014 **(polyfun)**, computed internally with the full coefficient covariance matrix.\n", "\n", "We also obtain, per run:\n", "\n", - "- The total trait heritability $h^2_g$ — **(polyfun)**.\n", - "- The 200-block jackknife delete-values of $\\tau_C$ — **(polyfun)**.\n", + "- The total trait heritability $h^2_g$ \u2014 **(polyfun)**.\n", + "- The 200-block jackknife delete-values of $\\tau_C$ \u2014 **(polyfun)**.\n", "\n", "#### Quantities from the post-processing layer (pecotmr)\n", "\n", "From the polyfun outputs above plus our reference panel, the post-processing layer computes:\n", "\n", - "- $sd_C$ — per-annotation standard deviation over MAF $>$ cutoff SNPs — **(pecotmr: `compute_sldsc_annot_sd`)**.\n", - "- $M_{\\mathrm{ref}}$ — reference SNP count at the MAF cutoff — **(pecotmr: `compute_sldsc_M_ref`)**.\n", - "- Whether each annotation is binary or continuous — **(pecotmr: `is_binary_sldsc_annot`)**.\n", - "- $\\tau^*_C$ point estimate and per-block $\\tau^*_C$ — **(pecotmr: `standardize_sldsc_trait`)**.\n", - "- EnrichStat point estimate and its standard error (formula below) — **(pecotmr: `standardize_sldsc_trait`)**.\n", - "- DerSimonian-Laird random-effects meta-analysis of $\\tau^*_C$, $E_C$, or EnrichStat across traits — **(pecotmr: `meta_sldsc_random`)**.\n", + "- $sd_C$ \u2014 per-annotation standard deviation over MAF $>$ cutoff SNPs \u2014 **(pecotmr: `compute_sldsc_annot_sd`)**.\n", + "- $M_{\\mathrm{ref}}$ \u2014 reference SNP count at the MAF cutoff \u2014 **(pecotmr: `compute_sldsc_M_ref`)**.\n", + "- Whether each annotation is binary or continuous \u2014 **(pecotmr: `is_binary_sldsc_annot`)**.\n", + "- $\\tau^*_C$ point estimate and per-block $\\tau^*_C$ \u2014 **(pecotmr: `standardize_sldsc_trait`)**.\n", + "- EnrichStat point estimate and its standard error (formula below) \u2014 **(pecotmr: `standardize_sldsc_trait`)**.\n", + "- DerSimonian-Laird random-effects meta-analysis of $\\tau^*_C$, $E_C$, or EnrichStat across traits \u2014 **(pecotmr: `meta_sldsc_random`)**.\n", "\n", "The top-level entry point `pecotmr::sldsc_postprocessing_pipeline` orchestrates all of the above.\n", "\n", - "#### Standardized tau ($\\tau^*$) — (pecotmr)\n", + "#### Standardized tau ($\\tau^*$) \u2014 (pecotmr)\n", "\n", "$\\tau_C$ has units that depend on the scale of the annotation and on the total heritability of the trait, so raw $\\tau$ is not directly comparable across annotations or across traits. We compute the standardized version (Gazal et al. 2017)\n", "\n", @@ -194,7 +194,7 @@ "$$SE^{\\text{jackknife}}(\\tau^*_C) \\;=\\; \\sqrt{\\,\\tfrac{(B-1)^2}{B}\\, \\mathrm{Var}_b(\\tau^*_{C,(b)})\\,}$$\n", "with $B = 200$, used as the per-trait input to cross-trait meta-analysis.\n", "\n", - "#### Differential per-SNP heritability (\"EnrichStat\") — (polyfun + pecotmr)\n", + "#### Differential per-SNP heritability (\"EnrichStat\") \u2014 (polyfun + pecotmr)\n", "\n", "To test whether the per-SNP heritability *inside* annotation $C$ differs from *outside* it (Finucane et al. 2015):\n", "\n", @@ -206,7 +206,7 @@ "\n", "This per-trait point + SE is the input to cross-trait meta-analysis.\n", "\n", - "#### Reporting: binary vs. continuous annotations — (pecotmr)\n", + "#### Reporting: binary vs. continuous annotations \u2014 (pecotmr)\n", "\n", "The estimation machinery applies to both annotation types, but the *headline* quantity to report **within each type** differs.\n", "\n", @@ -222,17 +222,17 @@ ">\n", "> The pipeline always passes `--print-coefficients` to polyfun for this reason.\n", "\n", - "#### Cross-type comparison: always use $\\tau^*_C$ — (pecotmr)\n", + "#### Cross-type comparison: always use $\\tau^*_C$ \u2014 (pecotmr)\n", "\n", - "For an apple-to-apple comparison **across binary and continuous annotations** — ranking annotations on a single axis, meta-analyzing a mixed set, or reporting a leaderboard that pools both types — use $\\tau^*_C$. The standardization in Gazal et al. (2017) was designed for exactly this purpose: $sd_C = \\sqrt{p(1-p)}$ for a binary annotation (where $p$ is the proportion in the category) and $sd_C = $ empirical standard deviation for a continuous annotation, so the resulting $\\tau^*_C$ is dimensionless and has the same interpretation in both cases — additive change in per-SNP heritability per 1 SD increase in the annotation, normalized by the average per-SNP heritability. $E_C$ does not have this property and must not be compared across types.\n", + "For an apple-to-apple comparison **across binary and continuous annotations** \u2014 ranking annotations on a single axis, meta-analyzing a mixed set, or reporting a leaderboard that pools both types \u2014 use $\\tau^*_C$. The standardization in Gazal et al. (2017) was designed for exactly this purpose: $sd_C = \\sqrt{p(1-p)}$ for a binary annotation (where $p$ is the proportion in the category) and $sd_C = $ empirical standard deviation for a continuous annotation, so the resulting $\\tau^*_C$ is dimensionless and has the same interpretation in both cases \u2014 additive change in per-SNP heritability per 1 SD increase in the annotation, normalized by the average per-SNP heritability. $E_C$ does not have this property and must not be compared across types.\n", "\n", "The pipeline emits both $E_C$ and $\\tau^*_C$ for every annotation, with the binary/continuous flag, so callers can pick the right column for the comparison they are making.\n", "\n", - "#### Joint analysis — (polyfun runs the regression; pecotmr standardizes both modes)\n", + "#### Joint analysis \u2014 (polyfun runs the regression; pecotmr standardizes both modes)\n", "\n", "For **joint analysis** (multiple annotations fit together), both $\\tau$ and $E$ are conditional on the other annotations in the model. We report joint $\\tau^*_C$ as the independent contribution of annotation $C$ after controlling for the others. The annotation-prep step exposes two independent toggles, `compute_single` and `compute_joint` (both default `True`), so the user can produce the $N$ single-target outputs, the joint output, or both in one invocation. With both defaults the post-processing layer reads all $N+1$ regression outputs per trait and presents single + joint side-by-side. When the joint subset is decided after looking at single-target results (exploratory $\\rightarrow$ conditional workflow), the user runs the annotation-prep step a second time with `compute_single=False` on the curated subset.\n", "\n", - "### Meta-Analysis across Traits (Random Effects) — (pecotmr)\n", + "### Meta-Analysis across Traits (Random Effects) \u2014 (pecotmr)\n", "\n", "DerSimonian-Laird random-effects meta-analysis of per-annotation estimates across traits, implemented in `pecotmr::meta_sldsc_random` (which delegates the numerics to `rmeta::meta.summaries(..., method = \"random\")`):\n", "\n", @@ -449,7 +449,7 @@ "to produce per-trait standardized tables and the default random-effects\n", "meta across all traits.\n", "\n", - "Use `--target-categories-label` (same order as `--target-categories`) to give the target annotations friendly names in the output — e.g. `--target-categories ANNOT_1_0 ANNOT_2_0 --target-categories-label quantile_eQTL eQTL` makes the `target` column read `quantile_eQTL` / `eQTL` instead of `ANNOT_1_0` / `ANNOT_2_0` (the original names are kept in `params$target_categories_orig`). Omit it to keep the polyfun `.results` names.\n" + "Use `--target-categories-label` (same order as `--target-categories`) to give the target annotations friendly names in the output \u2014 e.g. `--target-categories ANNOT_1_0 ANNOT_2_0 --target-categories-label quantile_eQTL eQTL` makes the `target` column read `quantile_eQTL` / `eQTL` instead of `ANNOT_1_0` / `ANNOT_2_0` (the original names are kept in `params$target_categories_orig`). Omit it to keep the polyfun `.results` names.\n" ] }, { @@ -541,10 +541,10 @@ "\n", "Each workflow writes into its `--cwd`:\n", "\n", - "- **make_annotation_files_ldscore** — polyfun `.annot.gz` files plus per-annotation LD-score directories (`.l2.ldscore.{gz,parquet}`, `.l2.M`, `.l2.M_5_50`). One single-target directory per annotation, plus (when more than one annotation) a joint directory.\n", - "- **get_heritability** — per trait and per target directory, the S-LDSC regression outputs `.{results,log,part_delete}`. The `.results` `Category` column carries the annotation name with a `_` suffix.\n", - "- **postprocess** — a single `.sldsc_postprocess.rds` containing per-trait tables (Gazal-style tau*, EnrichStat with back-solved jackknife SE) and three DerSimonian–Laird random-effects meta tables (tau*, E, EnrichStat).\n", - "- **meta_subset** — a re-meta of the cached `.sldsc_postprocess.rds` over a user-defined trait subset (lightweight; no regression re-run).\n" + "- **make_annotation_files_ldscore** \u2014 polyfun `.annot.gz` files plus per-annotation LD-score directories (`.l2.ldscore.{gz,parquet}`, `.l2.M`, `.l2.M_5_50`). One single-target directory per annotation, plus (when more than one annotation) a joint directory.\n", + "- **get_heritability** \u2014 per trait and per target directory, the S-LDSC regression outputs `.{results,log,part_delete}`. The `.results` `Category` column carries the annotation name with a `_` suffix.\n", + "- **postprocess** \u2014 a single `.sldsc_postprocess.rds` containing per-trait tables (Gazal-style tau*, EnrichStat with back-solved jackknife SE) and three DerSimonian\u2013Laird random-effects meta tables (tau*, E, EnrichStat).\n", + "- **meta_subset** \u2014 a re-meta of the cached `.sldsc_postprocess.rds` over a user-defined trait subset (lightweight; no regression re-run).\n" ] }, { @@ -1092,7 +1092,7 @@ "# index 1 = the baseline file -> \"base_1\",\"Coding_UCSC_1\", ... (the 97 baseline annots)\n", "# So in this pipeline the suffix is only ever 0 (target) or 1 (baseline); it would\n", "# continue 0,1,2,... only if you handed `ldsc.py --ref-ld-chr` more than two sources.\n", - "# (Why ANNOT_0 vs L2_0: see the [make_annotation_files_ldscore] header — ldsc.py's\n", + "# (Why ANNOT_0 vs L2_0: see the [make_annotation_files_ldscore] header \u2014 ldsc.py's\n", "# \"n_annot == 1 -> column name 'L2'\" quirk vs compute_ldscores.py keeping the annot\n", "# column name.) [postprocess] auto-detects the target Category; if you instead pass\n", "# --target-categories, the names must match this column exactly.\n", @@ -1222,7 +1222,7 @@ "outputs": [], "source": [ "[postprocess]\n", - "# Post-processing of polyfun outputs via pecotmr::sldsc_postprocessing_pipeline.\n", + "# Post-processing of polyfun outputs via pecotmr::sldscPostprocessingPipeline.\n", "# Reads .results / .log / .part_delete for all traits in `traits_file`, both\n", "# single-target and (when present) joint-target runs, computes Gazal-style\n", "# tau*, EnrichStat with back-solved jackknife SE, and runs the default\n", @@ -1275,15 +1275,15 @@ " trait_joint_prefix <- setNames(rep(NA_character_, length(traits)), traits)\n", " }\n", "\n", - " res <- sldsc_postprocessing_pipeline(\n", - " trait_single_prefixes = trait_single_prefixes,\n", - " trait_joint_prefix = trait_joint_prefix,\n", - " target_anno_dir = \"${target_anno_dir}\",\n", - " frqfile_dir = \"${frqfile_dir}\",\n", - " plink_name = \"${plink_name}\",\n", - " maf_cutoff = ${maf_cutoff},\n", - " target_categories = if (length(target_cats) > 0) target_cats else NULL,\n", - " target_labels = if (length(target_lab) > 0) target_lab else NULL\n", + " res <- sldscPostprocessingPipeline(\n", + " traitSinglePrefixes = trait_single_prefixes,\n", + " traitJointPrefix = trait_joint_prefix,\n", + " targetAnnoDir = \"${target_anno_dir}\",\n", + " frqfileDir = \"${frqfile_dir}\",\n", + " plinkName = \"${plink_name}\",\n", + " mafCutoff = ${maf_cutoff},\n", + " targetCategories = if (length(target_cats) > 0) target_cats else NULL,\n", + " targetLabels = if (length(target_lab) > 0) target_lab else NULL\n", " )\n", "\n", " saveRDS(res, \"${_output[0]}\")\n", @@ -1324,15 +1324,15 @@ "\n", " subset_per_trait <- res$per_trait[subset_traits]\n", "\n", - " # Map wide names (tau_star_single/joint) to bare names meta_sldsc_random expects.\n", - " view_single <- pecotmr:::.sldsc_view_for_meta(subset_per_trait, \"single\")\n", - " view_joint <- pecotmr:::.sldsc_view_for_meta(subset_per_trait, \"joint\")\n", + " # Map wide names (tau_star_single/joint) to bare names metaSldscRandom expects.\n", + " view_single <- pecotmr:::.sldscViewForMeta(subset_per_trait, \"single\")\n", + " view_joint <- pecotmr:::.sldscViewForMeta(subset_per_trait, \"joint\")\n", "\n", " out <- list(\n", - " tau_star_single = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_single, c, \"tau_star\")), target_cats),\n", - " tau_star_joint = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_joint, c, \"tau_star\")), target_cats),\n", - " enrichment = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_single, c, \"enrichment\")), target_cats),\n", - " enrichstat = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_single, c, \"enrichstat\")), target_cats)\n", + " tau_star_single = setNames(lapply(target_cats, function(c) metaSldscRandom(view_single, c, \"tauStar\")), target_cats),\n", + " tau_star_joint = setNames(lapply(target_cats, function(c) metaSldscRandom(view_joint, c, \"tauStar\")), target_cats),\n", + " enrichment = setNames(lapply(target_cats, function(c) metaSldscRandom(view_single, c, \"enrichment\")), target_cats),\n", + " enrichstat = setNames(lapply(target_cats, function(c) metaSldscRandom(view_single, c, \"enrichstat\")), target_cats)\n", " )\n", "\n", " saveRDS(out, \"${_output[0]}\")\n", @@ -1380,4 +1380,4 @@ }, "nbformat": 4, "nbformat_minor": 4 -} +} \ No newline at end of file diff --git a/code/SoS/mnm_analysis/multivariate_fine_mapping_vignette.ipynb b/code/SoS/mnm_analysis/multivariate_fine_mapping_vignette.ipynb index 7bfafb9c..bcab898b 100644 --- a/code/SoS/mnm_analysis/multivariate_fine_mapping_vignette.ipynb +++ b/code/SoS/mnm_analysis/multivariate_fine_mapping_vignette.ipynb @@ -52,7 +52,7 @@ "id": "25e3b8e3-4434-487c-b8ad-eb34905540dc", "metadata": {}, "source": [ - "- `{name}.{chr}_{gene_id}.multicontext_data.rds` — one RDS per gene, the multivariate (mvSuSiE / mr.mash) fine-mapping result that jointly models the effect across all conditions/contexts (e.g. `Ast`, `Exc`, `Inh`, `Mic`).\n\nInspecting the object shows the per-context posterior summaries:\n```\n> str(readRDS(\"/restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/mnm_old/data/test_mnm.chr11_ENSG00000073921.multicontext_data.rds\"))\nList of 8\n $ residual_Y : num [1:150, 1:6] 0.1241 0.2981 -0.6054 0.0435 0.0388 ...\n ..- attr(*, \"dimnames\")=List of 2\n .. ..$ : chr [1:150] \"sample0\" \"sample1\" \"sample10\" \"sample100\" ...\n .. ..$ : chr [1:6] \"Ast\" \"Exc\" \"Inh\" \"Mic\" ...\n $ residual_Y_scalar: num [1:6] 1 1 1 1 1 1\n $ dropped_sample :List of 3\n ..$ X :List of 6\n .. ..$ : chr(0) \n .. ..$ : chr(0) \n .. ..$ : chr(0) \n .. ..$ : chr(0) \n .. ..$ : chr(0) \n .. ..$ : chr(0) \n ..$ Y :List of 6\n .. ..$ : chr \"strand\"\n .. ..$ : chr \"strand\"\n .. ..$ : chr \"strand\"\n .. ..$ : chr \"strand\"\n .. ..$ : chr \"strand\"\n .. ..$ : chr \"strand\"\n ..$ covar:List of 6\n .. ..$ : chr(0) \n .. ..$ : chr(0) \n .. ..$ : chr(0) \n .. ..$ : chr(0) \n .. ..$ : chr(0) \n .. ..$ : chr(0) \n $ X : num [1:150, 1:9970] 0 0 1 1 1 0 0 1 1 1 ...\n ..- attr(*, \"dimnames\")=List of 2\n .. ..$ : chr [1:150] \"sample0\" \"sample1\" \"sample10\" \"sample100\" ...\n .. ..$ : chr [1:9970] \"chr11:84957209:G:C\" \"chr11:84957879:A:G\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" ...\n $ maf : Named num [1:9970] 0.33 0.327 0.103 0.343 0.343 ...\n ..- attr(*, \"names\")= chr [1:9970] \"chr11:84957209:G:C\" \"chr11:84957879:A:G\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" ...\n $ chrom : chr \"chr11\"\n $ grange : chr [1:2] \"85957174\" \"86069881\"\n $ X_variance : Named num [1:9970] 0.441 0.443 0.192 0.458 0.458 ...\n ..- attr(*, \"names\")= chr [1:9970] \"chr11:84957209:G:C\" \"chr11:84957879:A:G\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" ...\n```\n\n* `*.multicontext_bvsr.rds`\n```\n> str(readRDS(\"/restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/mnm_old/multivariate_fine_mapping/test_mnm.chr11_ENSG00000073921.multicontext_bvsr.rds\"),max.level=2)\nList of 1\n $ ENSG00000073921:List of 10\n ..$ mrmash_fitted :List of 14\n .. ..- attr(*, \"class\")= chr [1:2] \"mr.mash\" \"list\"\n ..$ reweighted_mixture_prior :Classes 'MashInitializer', 'R6' \n Public:\n clone: function (deep = FALSE) \n compute_prior_inv: function () \n initialize: function (Ulist, grid, prior_weights = NULL, null_weight = 0, \n n_component: active binding\n n_condition: active binding\n precompute_cov_matrices: function (d, algorithm = c(\"R\", \"cpp\")) \n precomputed: active binding\n prior_variance: active binding\n remove_precomputed: function () \n scale_prior_variance: function (sigma) \n Private:\n inv_mats: NULL\n U: NULL\n xU: list \n ..$ reweighted_mixture_prior_cv:List of 5\n ..$ mvsusie_fitted :List of 28\n .. ..- attr(*, \"class\")= chr \"mvsusie\"\n ..$ variant_names : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n ..$ analysis_script : chr \"\\nlibrary(pecotmr)\\n\\nphenotype_files = c(\\\"/restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/p\"| __truncated__\n ..$ other_quantities :List of 1\n ..$ context_names : chr [1:6] \"Ast\" \"Exc\" \"Inh\" \"Mic\" ...\n ..$ total_time_elapsed : 'proc_time' Named num [1:5] 104.38 0.653 105.828 0 0\n .. ..- attr(*, \"names\")= chr [1:5] \"user.self\" \"sys.self\" \"elapsed\" \"user.child\" ...\n ..$ region_info :List of 3\n```\n\n* `*.multicontext_twas_weights.rds`\n```\n> str(readRDS(\"/restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/mnm_old/multivariate_twas_weights/test_mnm.chr11_ENSG00000073921.multicontext_twas_weights.rds\"), max.level=3)\nList of 1\n $ ENSG00000073921:List of 6\n ..$ Ast_ENSG00000073921:List of 6\n .. ..$ twas_weights :List of 2\n .. ..$ twas_predictions :List of 2\n .. ..$ variant_names : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n .. ..$ twas_cv_result :List of 4\n .. ..$ total_time_elapsed: 'proc_time' Named num [1:5] 1645.11 3.4 1652.73 0.01 0.02\n .. .. ..- attr(*, \"names\")= chr [1:5] \"user.self\" \"sys.self\" \"elapsed\" \"user.child\" ...\n .. ..$ region_info :List of 3\n ..$ Exc_ENSG00000073921:List of 6\n .. ..$ twas_weights :List of 2\n .. ..$ twas_predictions :List of 2\n .. ..$ variant_names : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n .. ..$ twas_cv_result :List of 4\n .. ..$ total_time_elapsed: 'proc_time' Named num [1:5] 1645.11 3.4 1652.73 0.01 0.02\n .. .. ..- attr(*, \"names\")= chr [1:5] \"user.self\" \"sys.self\" \"elapsed\" \"user.child\" ...\n .. ..$ region_info :List of 3\n ..$ Inh_ENSG00000073921:List of 6\n .. ..$ twas_weights :List of 2\n .. ..$ twas_predictions :List of 2\n .. ..$ variant_names : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n .. ..$ twas_cv_result :List of 4\n .. ..$ total_time_elapsed: 'proc_time' Named num [1:5] 1645.11 3.4 1652.73 0.01 0.02\n .. .. ..- attr(*, \"names\")= chr [1:5] \"user.self\" \"sys.self\" \"elapsed\" \"user.child\" ...\n .. ..$ region_info :List of 3\n ..$ Mic_ENSG00000073921:List of 6\n .. ..$ twas_weights :List of 2\n .. ..$ twas_predictions :List of 2\n .. ..$ variant_names : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n .. ..$ twas_cv_result :List of 4\n .. ..$ total_time_elapsed: 'proc_time' Named num [1:5] 1645.11 3.4 1652.73 0.01 0.02\n .. .. ..- attr(*, \"names\")= chr [1:5] \"user.self\" \"sys.self\" \"elapsed\" \"user.child\" ...\n .. ..$ region_info :List of 3\n ..$ OPC_ENSG00000073921:List of 6\n .. ..$ twas_weights :List of 2\n .. ..$ twas_predictions :List of 2\n .. ..$ variant_names : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n .. ..$ twas_cv_result :List of 4\n .. ..$ total_time_elapsed: 'proc_time' Named num [1:5] 1645.11 3.4 1652.73 0.01 0.02\n .. .. ..- attr(*, \"names\")= chr [1:5] \"user.self\" \"sys.self\" \"elapsed\" \"user.child\" ...\n .. ..$ region_info :List of 3\n ..$ Oli_ENSG00000073921:List of 6\n .. ..$ twas_weights :List of 2\n .. ..$ twas_predictions :List of 2\n .. ..$ variant_names : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n .. ..$ twas_cv_result :List of 4\n .. ..$ total_time_elapsed: 'proc_time' Named num [1:5] 1645.11 3.4 1652.73 0.01 0.02\n .. .. ..- attr(*, \"names\")= chr [1:5] \"user.self\" \"sys.self\" \"elapsed\" \"user.child\" ...\n .. ..$ region_info :List of 3\n```" + "- `{name}.{chr}_{gene_id}.multicontext_data.rds` \u2014 one RDS per gene, the multivariate (mvSuSiE / mr.mash) fine-mapping result that jointly models the effect across all conditions/contexts (e.g. `Ast`, `Exc`, `Inh`, `Mic`).\n\nInspecting the object shows the per-context posterior summaries:\n```\n> str(readRDS(\"/restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/mnm_old/data/test_mnm.chr11_ENSG00000073921.multicontext_data.rds\"))\nList of 8\n $ residual_Y : num [1:150, 1:6] 0.1241 0.2981 -0.6054 0.0435 0.0388 ...\n ..- attr(*, \"dimnames\")=List of 2\n .. ..$ : chr [1:150] \"sample0\" \"sample1\" \"sample10\" \"sample100\" ...\n .. ..$ : chr [1:6] \"Ast\" \"Exc\" \"Inh\" \"Mic\" ...\n $ residual_Y_scalar: num [1:6] 1 1 1 1 1 1\n $ dropped_sample :List of 3\n ..$ X :List of 6\n .. ..$ : chr(0) \n .. ..$ : chr(0) \n .. ..$ : chr(0) \n .. ..$ : chr(0) \n .. ..$ : chr(0) \n .. ..$ : chr(0) \n ..$ Y :List of 6\n .. ..$ : chr \"strand\"\n .. ..$ : chr \"strand\"\n .. ..$ : chr \"strand\"\n .. ..$ : chr \"strand\"\n .. ..$ : chr \"strand\"\n .. ..$ : chr \"strand\"\n ..$ covar:List of 6\n .. ..$ : chr(0) \n .. ..$ : chr(0) \n .. ..$ : chr(0) \n .. ..$ : chr(0) \n .. ..$ : chr(0) \n .. ..$ : chr(0) \n $ X : num [1:150, 1:9970] 0 0 1 1 1 0 0 1 1 1 ...\n ..- attr(*, \"dimnames\")=List of 2\n .. ..$ : chr [1:150] \"sample0\" \"sample1\" \"sample10\" \"sample100\" ...\n .. ..$ : chr [1:9970] \"chr11:84957209:G:C\" \"chr11:84957879:A:G\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" ...\n $ maf : Named num [1:9970] 0.33 0.327 0.103 0.343 0.343 ...\n ..- attr(*, \"names\")= chr [1:9970] \"chr11:84957209:G:C\" \"chr11:84957879:A:G\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" ...\n $ chrom : chr \"chr11\"\n $ grange : chr [1:2] \"85957174\" \"86069881\"\n $ X_variance : Named num [1:9970] 0.441 0.443 0.192 0.458 0.458 ...\n ..- attr(*, \"names\")= chr [1:9970] \"chr11:84957209:G:C\" \"chr11:84957879:A:G\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" ...\n```\n\n* `*.multicontext_bvsr.rds`\n```\n> str(readRDS(\"/restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/mnm_old/multivariate_fine_mapping/test_mnm.chr11_ENSG00000073921.multicontext_bvsr.rds\"),max.level=2)\nList of 1\n $ ENSG00000073921:List of 10\n ..$ mrmash_fitted :List of 14\n .. ..- attr(*, \"class\")= chr [1:2] \"mr.mash\" \"list\"\n ..$ reweighted_mixture_prior :Classes 'MashInitializer', 'R6' \n Public:\n clone: function (deep = FALSE) \n compute_prior_inv: function () \n initialize: function (Ulist, grid, prior_weights = NULL, null_weight = 0, \n n_component: active binding\n n_condition: active binding\n precompute_cov_matrices: function (d, algorithm = c(\"R\", \"cpp\")) \n precomputed: active binding\n prior_variance: active binding\n remove_precomputed: function () \n scale_prior_variance: function (sigma) \n Private:\n inv_mats: NULL\n U: NULL\n xU: list \n ..$ reweighted_mixture_prior_cv:List of 5\n ..$ mvsusie_fitted :List of 28\n .. ..- attr(*, \"class\")= chr \"mvsusie\"\n ..$ variant_names : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n ..$ analysis_script : chr \"\\nlibrary(pecotmr)\\n\\nphenotype_files = c(\\\"/restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/p\"| __truncated__\n ..$ other_quantities :List of 1\n ..$ context_names : chr [1:6] \"Ast\" \"Exc\" \"Inh\" \"Mic\" ...\n ..$ total_time_elapsed : 'proc_time' Named num [1:5] 104.38 0.653 105.828 0 0\n .. ..- attr(*, \"names\")= chr [1:5] \"user.self\" \"sys.self\" \"elapsed\" \"user.child\" ...\n ..$ region_info :List of 3\n```\n\n* `*.multicontext_twas_weights.rds`\n```\n> str(readRDS(\"/restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/mnm_old/multivariate_twas_weights/test_mnm.chr11_ENSG00000073921.multicontext_twas_weights.rds\"), max.level=3)\nList of 1\n $ ENSG00000073921:List of 6\n ..$ Ast_ENSG00000073921:List of 6\n .. ..$ twas_weights :List of 2\n .. ..$ twas_predictions :List of 2\n .. ..$ variant_names : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n .. ..$ twas_cv_result :List of 4\n .. ..$ total_time_elapsed: 'proc_time' Named num [1:5] 1645.11 3.4 1652.73 0.01 0.02\n .. .. ..- attr(*, \"names\")= chr [1:5] \"user.self\" \"sys.self\" \"elapsed\" \"user.child\" ...\n .. ..$ region_info :List of 3\n ..$ Exc_ENSG00000073921:List of 6\n .. ..$ twas_weights :List of 2\n .. ..$ twas_predictions :List of 2\n .. ..$ variant_names : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n .. ..$ twas_cv_result :List of 4\n .. ..$ total_time_elapsed: 'proc_time' Named num [1:5] 1645.11 3.4 1652.73 0.01 0.02\n .. .. ..- attr(*, \"names\")= chr [1:5] \"user.self\" \"sys.self\" \"elapsed\" \"user.child\" ...\n .. ..$ region_info :List of 3\n ..$ Inh_ENSG00000073921:List of 6\n .. ..$ twas_weights :List of 2\n .. ..$ twas_predictions :List of 2\n .. ..$ variant_names : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n .. ..$ twas_cv_result :List of 4\n .. ..$ total_time_elapsed: 'proc_time' Named num [1:5] 1645.11 3.4 1652.73 0.01 0.02\n .. .. ..- attr(*, \"names\")= chr [1:5] \"user.self\" \"sys.self\" \"elapsed\" \"user.child\" ...\n .. ..$ region_info :List of 3\n ..$ Mic_ENSG00000073921:List of 6\n .. ..$ twas_weights :List of 2\n .. ..$ twas_predictions :List of 2\n .. ..$ variant_names : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n .. ..$ twas_cv_result :List of 4\n .. ..$ total_time_elapsed: 'proc_time' Named num [1:5] 1645.11 3.4 1652.73 0.01 0.02\n .. .. ..- attr(*, \"names\")= chr [1:5] \"user.self\" \"sys.self\" \"elapsed\" \"user.child\" ...\n .. ..$ region_info :List of 3\n ..$ OPC_ENSG00000073921:List of 6\n .. ..$ twas_weights :List of 2\n .. ..$ twas_predictions :List of 2\n .. ..$ variant_names : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n .. ..$ twas_cv_result :List of 4\n .. ..$ total_time_elapsed: 'proc_time' Named num [1:5] 1645.11 3.4 1652.73 0.01 0.02\n .. .. ..- attr(*, \"names\")= chr [1:5] \"user.self\" \"sys.self\" \"elapsed\" \"user.child\" ...\n .. ..$ region_info :List of 3\n ..$ Oli_ENSG00000073921:List of 6\n .. ..$ twas_weights :List of 2\n .. ..$ twas_predictions :List of 2\n .. ..$ variant_names : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n .. ..$ twas_cv_result :List of 4\n .. ..$ total_time_elapsed: 'proc_time' Named num [1:5] 1645.11 3.4 1652.73 0.01 0.02\n .. .. ..- attr(*, \"names\")= chr [1:5] \"user.self\" \"sys.self\" \"elapsed\" \"user.child\" ...\n .. ..$ region_info :List of 3\n```" ] }, { @@ -78,27 +78,27 @@ "metadata": {}, "outputs": [], "source": [ - "sos run pipeline/mnm_regression.ipynb mnm \\\n", - " --name test_mnm --cwd output/mnm \\\n", - " --genoFile output/genotype_by_chrom/wgs.merged.plink_qc.genotype_by_chrom_files.txt \\\n", - " --phenoFile output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.region_list.txt \\\n", - " output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.region_list.txt \\\n", - " output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.region_list.txt \\\n", - " output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.region_list.txt \\\n", - " output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.region_list.txt \\\n", - " output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.region_list.txt \\\n", - " --covFile output/covariate/bulk_rnaseq_tpm_matrix.low_expression_filtered.outlier_removed.tmm.expression.covariates.wgs.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \\\n", - " output/covariate/bulk_rnaseq_tpm_matrix.low_expression_filtered.outlier_removed.tmm.expression.covariates.wgs.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \\\n", - " output/covariate/bulk_rnaseq_tpm_matrix.low_expression_filtered.outlier_removed.tmm.expression.covariates.wgs.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \\\n", - " output/covariate/bulk_rnaseq_tpm_matrix.low_expression_filtered.outlier_removed.tmm.expression.covariates.wgs.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \\\n", - " output/covariate/bulk_rnaseq_tpm_matrix.low_expression_filtered.outlier_removed.tmm.expression.covariates.wgs.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \\\n", - " output/covariate/bulk_rnaseq_tpm_matrix.low_expression_filtered.outlier_removed.tmm.expression.covariates.wgs.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \\\n", - " --customized-association-windows reference_data/TAD/TADB_enhanced_cis.bed \\\n", - " --region-name ENSG00000073921 --save-data --no-skip-twas-weights \\\n", - " --phenotype-names Ast Exc Inh Mic OPC Oli \\\n", - " --mixture_prior output/multivariate_mixture/MWE_ed_bovy.EE.prior.rds \\\n", - " --max_cv_variants 5000 \\\n", - "\t--ld_reference_meta_file data/ld_meta_file_with_bim.tsv " + "# Step 1: build the QtlDataset RDS (one per study).\n", + "sos run pipeline/qtl_dataset.ipynb qtl_dataset_construct \\\n", + " --cwd output/qtl_dataset \\\n", + " --study test_mnm \\\n", + " --genotype-prefix output/genotype_by_chrom/wgs.merged.plink_qc.chr1 \\\n", + " --phenotype-manifest data/test_mnm.pheno_manifest.tsv\n", + "\n", + "# Step 2: per-region mvSuSiE joint fine-mapping.\n", + "sos run pipeline/multivariate_fine_mapping.ipynb multivariate_fine_mapping \\\n", + " --cwd output/mnm \\\n", + " --study test_mnm \\\n", + " --qtl-dataset output/qtl_dataset/test_mnm.qtl_dataset.rds \\\n", + " --regions chr1:1000000-2000000\n", + "\n", + "# Step 3 (optional): mr.mash + mvSuSiE TWAS weights on the same regions.\n", + "sos run pipeline/twas_weights.ipynb twas_weights \\\n", + " --cwd output/mnm \\\n", + " --study test_mnm \\\n", + " --qtl-dataset output/qtl_dataset/test_mnm.qtl_dataset.rds \\\n", + " --regions chr1:1000000-2000000 \\\n", + " --methods mrmash,mvsusie" ] }, { diff --git a/code/SoS/mnm_analysis/multivariate_multigene_fine_mapping_vignette.ipynb b/code/SoS/mnm_analysis/multivariate_multigene_fine_mapping_vignette.ipynb index f467e1b0..46e0ca3f 100644 --- a/code/SoS/mnm_analysis/multivariate_multigene_fine_mapping_vignette.ipynb +++ b/code/SoS/mnm_analysis/multivariate_multigene_fine_mapping_vignette.ipynb @@ -138,7 +138,7 @@ "id": "c6344399", "metadata": {}, "source": [ - "- `{region_name}.{chr}_{region_id}.multigene_bvsr.rds` — one RDS per region, the fitted multivariate (mvSuSiE / mr.mash) Bayesian variable-selection model jointly fine-mapping all genes in that region.\n", + "- `{region_name}.{chr}_{region_id}.multigene_bvsr.rds` \u2014 one RDS per region, the fitted multivariate (mvSuSiE / mr.mash) Bayesian variable-selection model jointly fine-mapping all genes in that region.\n", "\n", "Inspecting the object shows the per-region fit (posterior effect estimates, PVE, credible sets and convergence trace):\n", "\n", @@ -193,27 +193,20 @@ "metadata": {}, "outputs": [], "source": [ - "sos run pipeline/mnm_regression.ipynb mnm_genes \\\n", - " --name ROSMAP_Ast_mega_eQTL \\\n", - " --genoFile data/mnm_genes/ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.11.bed \\\n", - " --phenoFile data/mnm_genes/snuc_pseudo_bulk.Ast.mega.normalized.log2cpm.region_list.txt \\\n", - " --covFile data/mnm_genes/snuc_pseudo_bulk.Ast.mega.normalized.log2cpm.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.snuc_pseudo_bulk_mega.related.plink_qc.extracted.pca.projected.Marchenko_PC.gz \\\n", - " --customized-association-windows data/mnm_genes/extended_TADB.bed \\\n", - " --phenotype-names Ast_mega_eQTL \\\n", - " --max-cv-variants 5000 \\\n", - " --ld_reference_meta_file data/ld_meta_file_with_bim.tsv \\\n", - " --independent_variant_list data/mnm_genes/ld_pruned_variants.txt.gz \\\n", - " --fine_mapping_meta data/mnm_genes/combined_data_updated.tsv \\\n", - " --phenoIDFile data/mnm_genes/phenoIDFile_extended_TADB.bed \\\n", - " --region-name chr11_77324757_82556425 \\\n", - " --skip-analysis-pip-cutoff 0 \\\n", - " --maf 0.01 \\\n", - " --coverage 0.95 \\\n", - " --pheno_id_map_file data/mnm_genes/pheno_id_map_file.txt \\\n", - " --prior-canonical-matrices \\\n", - " --twas-cv-folds 0 \\\n", - " --trans-analysis \\\n", - " --cwd output/mnm_regression/mnm_genes -s build" + "# Step 1: build the QtlDataset RDS for the joint-gene region.\n", + "sos run pipeline/qtl_dataset.ipynb qtl_dataset_construct \\\n", + " --cwd output/qtl_dataset \\\n", + " --study ROSMAP_Ast_mega_eQTL \\\n", + " --genotype-prefix data/mnm_genes/ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.11 \\\n", + " --phenotype-manifest data/mnm_genes/snuc_pseudo_bulk.Ast.mega.pheno_manifest.tsv\n", + "\n", + "# Step 2: per-region mvSuSiE joint fine-mapping. Use a region span that\n", + "# covers multiple genes (multi-trait joint mvSuSiE).\n", + "sos run pipeline/multivariate_fine_mapping.ipynb multivariate_fine_mapping \\\n", + " --cwd output/mnm_genes \\\n", + " --study ROSMAP_Ast_mega_eQTL \\\n", + " --qtl-dataset output/qtl_dataset/ROSMAP_Ast_mega_eQTL.qtl_dataset.rds \\\n", + " --regions chr11:65000000-67000000" ] }, { @@ -286,4 +279,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} +} \ No newline at end of file diff --git a/code/SoS/mnm_analysis/summary_stats_finemapping_vignette.ipynb b/code/SoS/mnm_analysis/summary_stats_finemapping_vignette.ipynb index 6741f137..b6441d42 100644 --- a/code/SoS/mnm_analysis/summary_stats_finemapping_vignette.ipynb +++ b/code/SoS/mnm_analysis/summary_stats_finemapping_vignette.ipynb @@ -53,7 +53,7 @@ "id": "6991377f-a6ef-45fe-ae78-20feb19fba9d", "metadata": {}, "source": [ - "- `{prefix}.sumstats.tsv.gz` — the harmonized, RAISS-imputed GWAS summary statistics (z-scores, effect sizes, allele frequencies and LD scores) used as the fine-mapping input.\n- `{name}.univariate_susie_rss.rds` — the SuSiE-RSS fine-mapping result computed from the summary statistics and the LD reference, one per region.\n\nThe summary-statistics table and fitted object look like:\n```\n> less RSS_QC_RAISS_imputed.chr11_84267999_86714492.AD_Bellenguez_2022.sumstats.tsv.gz | head\nchrom pos variant_id A1 A2 Var.x raiss_ld_score.x raiss_R2.x Var.y raiss_ld_score.y raiss_R2.y pvalue effect_allele_frequency odds_ratio ci_lower ci_upper n_case n_control het_isq het_pvalue beta se variants_id_original Var raiss_ld_score z\n11 84267999 11:84267999:T:C C T 0.131232630771865 35.033674731767 0.868767369228135 0.462675463930821\n11 84268181 11:84268181:CTG:C C CTG -1 Inf 0.6264 0.9965 1.039 0.891 1.211 83196 386481 0 0.4727 11:84268181:CTG:C -1 Inf -0.486555697823303\n11 84268207 11:84268207:T:C C T -1 Inf 0.6347 0.9641 0.99 0.948 1.033 85934 401577 52.6 0.0164 11:84268207:T:C -1 Inf 0.475113122171946\n11 84268259 11:84268259:C:T T C -1 Inf 0.5809 6e-04 0.875 0.544 1.407 69576 360279 69.8 0.06861 11:84268259:C:T -1 Inf -0.551980198019802\n11 84268265 11:84268265:C:A A C -1 Inf 0.1207 0.035 0.966 0.924 1.009 85934 401577 0 0.6535 11:84268265:C:A -1 Inf -1.54910714285714\n11 84268309 11:84268309:T:TC TC T -1 Inf 0.425 0.9995 0.833 0.532 1.305 69576 360279 0 0.5788 11:84268309:T:TC -1 Inf 0.79763986013986\n11 84268394 11:84268394:A:T T A -1 Inf 0.8437 0.3875 1.002 0.985 1.018 85934 401577 2.1 0.4233 11:84268394:A:T -1 Inf -0.202380952380952\n11 84268618 11:84268618:C:T T C -1 Inf 0.629 0.0104 1.02 0.941 1.106 85502 400497 0 0.497 11:84268618:C:T -1 Inf 0.483091787439614\n11 84268622 11:84268622:C:T T C -1 Inf 0.6138 0.0356 1.011 0.968 1.056 85934 401577 51.5 0.01969 11:84268622:C:T -1 Inf 0.504504504504505\n```\n\n* `*.univariate_susie_rss.rds`\n```\n> str(readRDS(\"/restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/rss_analysis/univariate_rss/RSS_QC_RAISS_imputed.chr11_84267999_86714492.univariate_susie_rss.rds\"))\nList of 1\n $ chr11_84267999_86714492:List of 1\n ..$ AD_Bellenguez_2022:List of 2\n .. ..$ susie_rss_RSS_QC_RAISS_imputed:List of 6\n .. .. ..$ variant_names : chr [1:10779] \"11:84267999:T:C\" \"11:84268181:CTG:C\" \"11:84268207:T:C\" \"11:84268259:C:T\" ...\n .. .. ..$ analysis_script : chr \"library(pecotmr)\\nlibrary(dplyr)\\nlibrary(data.table)\\nskip_region = c(\\\"6:25000000-35000000\\\")\\nstudies = c(\\\"\"| __truncated__\n .. .. ..$ sumstats :List of 1\n .. .. .. ..$ z: num [1:10779] 0.463 -0.487 0.475 -0.552 -1.549 ...\n .. .. ..$ top_loci :'data.frame':\t10 obs. of 6 variables:\n .. .. .. ..$ variant_id : chr [1:10] \"11:86139201:C:T\" \"11:86141937:G:A\" \"11:86144030:ATCT:A\" \"11:86144034:AC:A\" ...\n .. .. .. ..$ z : num [1:10] 12 12 11.6 11.6 11.6 ...\n .. .. .. ..$ pip : num [1:10] 0.2735 0.356 0.0339 0.0336 0.0339 ...\n .. .. .. ..$ cs_coverage_0.95: num [1:10] 2 2 2 0 2 2 2 2 1 1\n .. .. .. ..$ cs_coverage_0.7 : num [1:10] 2 2 0 0 0 0 0 2 1 1\n .. .. .. ..$ cs_coverage_0.5 : num [1:10] 2 2 0 0 0 0 0 0 1 0\n .. .. ..$ susie_result_trimmed:List of 9\n .. .. .. ..$ pip : num [1:10779] 1.63e-10 1.72e-10 1.64e-10 1.65e-10 2.45e-10 ...\n .. .. .. ..$ sets :List of 5\n .. .. .. .. ..$ cs :List of 2\n .. .. .. .. .. ..$ L1: int [1:2] 7797 7798\n .. .. .. .. .. ..$ L2: int [1:7] 7725 7735 7742 7744 7754 7760 7770\n .. .. .. .. ..$ purity :'data.frame':\t2 obs. of 3 variables:\n .. .. .. .. .. ..$ min.abs.corr : num [1:2] 0.994 0.954\n .. .. .. .. .. ..$ mean.abs.corr : num [1:2] 0.994 0.982\n .. .. .. .. .. ..$ median.abs.corr: num [1:2] 0.994 0.99\n .. .. .. .. ..$ cs_index : int [1:2] 1 2\n .. .. .. .. ..$ coverage : num [1:2] 0.992 0.958\n .. .. .. .. ..$ requested_coverage: num 0.95\n .. .. .. ..$ cs_corr : num [1:2, 1:2] 1 0.64 0.64 1\n .. .. .. .. ..- attr(*, \"dimnames\")=List of 2\n .. .. .. .. .. ..$ : chr [1:2] \"L1\" \"L2\"\n .. .. .. .. .. ..$ : chr [1:2] \"L1\" \"L2\"\n .. .. .. ..$ sets_secondary:List of 2\n .. .. .. .. ..$ coverage_0.7:List of 2\n .. .. .. .. .. ..$ sets :List of 5\n .. .. .. .. .. .. ..$ cs :List of 2\n .. .. .. .. .. .. .. ..$ L1: int [1:2] 7797 7798\n .. .. .. .. .. .. .. ..$ L2: int [1:3] 7725 7735 7770\n .. .. .. .. .. .. ..$ purity :'data.frame':\t2 obs. of 3 variables:\n .. .. .. .. .. .. .. ..$ min.abs.corr : num [1:2] 0.994 0.954\n .. .. .. .. .. .. .. ..$ mean.abs.corr : num [1:2] 0.994 0.968\n .. .. .. .. .. .. .. ..$ median.abs.corr: num [1:2] 0.994 0.96\n .. .. .. .. .. .. ..$ cs_index : int [1:2] 1 2\n .. .. .. .. .. .. ..$ coverage : num [1:2] 0.992 0.73\n .. .. .. .. .. .. ..$ requested_coverage: num 0.7\n .. .. .. .. .. ..$ cs_corr: num [1:2, 1:2] 1 0.64 0.64 1\n .. .. .. .. .. .. ..- attr(*, \"dimnames\")=List of 2\n .. .. .. .. .. .. .. ..$ : chr [1:2] \"L1\" \"L2\"\n .. .. .. .. .. .. .. ..$ : chr [1:2] \"L1\" \"L2\"\n .. .. .. .. ..$ coverage_0.5:List of 2\n .. .. .. .. .. ..$ sets :List of 5\n .. .. .. .. .. .. ..$ cs :List of 2\n .. .. .. .. .. .. .. ..$ L1: int 7797\n .. .. .. .. .. .. .. ..$ L2: int [1:2] 7725 7735\n .. .. .. .. .. .. ..$ purity :'data.frame':\t2 obs. of 3 variables:\n .. .. .. .. .. .. .. ..$ min.abs.corr : num [1:2] 1 0.96\n .. .. .. .. .. .. .. ..$ mean.abs.corr : num [1:2] 1 0.96\n .. .. .. .. .. .. .. ..$ median.abs.corr: num [1:2] 1 0.96\n .. .. .. .. .. .. ..$ cs_index : int [1:2] 1 2\n .. .. .. .. .. .. ..$ coverage : num [1:2] 0.558 0.629\n .. .. .. .. .. .. ..$ requested_coverage: num 0.5\n .. .. .. .. .. ..$ cs_corr: num [1:2, 1:2] 1 0.64 0.64 1\n .. .. .. .. .. .. ..- attr(*, \"dimnames\")=List of 2\n .. .. .. .. .. .. .. ..$ : chr [1:2] \"L1\" \"L2\"\n .. .. .. .. .. .. .. ..$ : chr [1:2] \"L1\" \"L2\"\n .. .. .. ..$ alpha : num [1:2, 1:10779] 2.29e-16 1.63e-10 2.09e-16 1.72e-10 2.07e-16 ...\n .. .. .. .. ..- attr(*, \"dimnames\")=List of 2\n .. .. .. .. .. ..$ : NULL\n .. .. .. .. .. ..$ : chr [1:10779] \"11:84267999:T:C\" \"11:84268181:CTG:C\" \"11:84268207:T:C\" \"11:84268259:C:T\" ...\n .. .. .. ..$ lbf_variable : num [1:2, 1:10779] -1.97 -1.77 -2.06 -1.72 -2.07 ...\n .. .. .. .. ..- attr(*, \"dimnames\")=List of 2\n .. .. .. .. .. ..$ : NULL\n .. .. .. .. .. ..$ : chr [1:10779] \"11:84267999:T:C\" \"11:84268181:CTG:C\" \"11:84268207:T:C\" \"11:84268259:C:T\" ...\n .. .. .. ..$ V : num [1:2] 8.98e-05 5.29e-05\n .. .. .. ..$ niter : int 13\n .. .. .. ..$ max_L : int 5\n .. .. .. ..- attr(*, \"class\")= chr \"susie\"\n .. .. ..$ outlier_number : int 0\n .. ..$ rss_data_analyzed :'data.frame':\t10779 obs. of 26 variables:\n .. .. ..$ chrom : chr [1:10779] \"11\" \"11\" \"11\" \"11\" ...\n .. .. ..$ pos : int [1:10779] 84267999 84268181 84268207 84268259 84268265 84268309 84268394 84268618 84268622 84268650 ...\n .. .. ..$ variant_id : chr [1:10779] \"11:84267999:T:C\" \"11:84268181:CTG:C\" \"11:84268207:T:C\" \"11:84268259:C:T\" ...\n .. .. ..$ A1 : chr [1:10779] \"C\" \"C\" \"C\" \"T\" ...\n .. .. ..$ A2 : chr [1:10779] \"T\" \"CTG\" \"T\" \"C\" ...\n .. .. ..$ Var.x : num [1:10779] 0.131 NA NA NA NA ...\n .. .. ..$ raiss_ld_score.x : num [1:10779] 35 NA NA NA NA ...\n .. .. ..$ raiss_R2.x : num [1:10779] 0.869 NA NA NA NA ...\n .. .. ..$ Var.y : num [1:10779] NA -1 -1 -1 -1 -1 -1 -1 -1 -1 ...\n .. .. ..$ raiss_ld_score.y : num [1:10779] NA Inf Inf Inf Inf ...\n .. .. ..$ raiss_R2.y : num [1:10779] NA NA NA NA NA NA NA NA NA NA ...\n .. .. ..$ pvalue : num [1:10779] NA 0.626 0.635 0.581 0.121 ...\n .. .. ..$ effect_allele_frequency: num [1:10779] NA 0.9965 0.9641 0.0006 0.035 ...\n .. .. ..$ odds_ratio : num [1:10779] NA 1.039 0.99 0.875 0.966 ...\n .. .. ..$ ci_lower : num [1:10779] NA 0.891 0.948 0.544 0.924 0.532 0.985 0.941 0.968 0.982 ...\n .. .. ..$ ci_upper : num [1:10779] NA 1.21 1.03 1.41 1.01 ...\n .. .. ..$ n_case : num [1:10779] NA 83196 85934 69576 85934 ...\n .. .. ..$ n_control : num [1:10779] NA 386481 401577 360279 401577 ...\n .. .. ..$ het_isq : num [1:10779] NA 0 52.6 69.8 0 0 2.1 0 51.5 0 ...\n .. .. ..$ het_pvalue : num [1:10779] NA 0.4727 0.0164 0.0686 0.6535 ...\n .. .. ..$ beta : num [1:10779] NA NA NA NA NA NA NA NA NA NA ...\n .. .. ..$ se : num [1:10779] NA NA NA NA NA NA NA NA NA NA ...\n .. .. ..$ variants_id_original : chr [1:10779] NA \"11:84268181:CTG:C\" \"11:84268207:T:C\" \"11:84268259:C:T\" ...\n .. .. ..$ Var : num [1:10779] NA -1 -1 -1 -1 -1 -1 -1 -1 -1 ...\n .. .. ..$ raiss_ld_score : num [1:10779] NA Inf Inf Inf Inf ...\n .. .. ..$ z : num [1:10779] 0.463 -0.487 0.475 -0.552 -1.549 ..\n \n``` " + "- `{prefix}.sumstats.tsv.gz` \u2014 the harmonized, RAISS-imputed GWAS summary statistics (z-scores, effect sizes, allele frequencies and LD scores) used as the fine-mapping input.\n- `{name}.univariate_susie_rss.rds` \u2014 the SuSiE-RSS fine-mapping result computed from the summary statistics and the LD reference, one per region.\n\nThe summary-statistics table and fitted object look like:\n```\n> less RSS_QC_RAISS_imputed.chr11_84267999_86714492.AD_Bellenguez_2022.sumstats.tsv.gz | head\nchrom pos variant_id A1 A2 Var.x raiss_ld_score.x raiss_R2.x Var.y raiss_ld_score.y raiss_R2.y pvalue effect_allele_frequency odds_ratio ci_lower ci_upper n_case n_control het_isq het_pvalue beta se variants_id_original Var raiss_ld_score z\n11 84267999 11:84267999:T:C C T 0.131232630771865 35.033674731767 0.868767369228135 0.462675463930821\n11 84268181 11:84268181:CTG:C C CTG -1 Inf 0.6264 0.9965 1.039 0.891 1.211 83196 386481 0 0.4727 11:84268181:CTG:C -1 Inf -0.486555697823303\n11 84268207 11:84268207:T:C C T -1 Inf 0.6347 0.9641 0.99 0.948 1.033 85934 401577 52.6 0.0164 11:84268207:T:C -1 Inf 0.475113122171946\n11 84268259 11:84268259:C:T T C -1 Inf 0.5809 6e-04 0.875 0.544 1.407 69576 360279 69.8 0.06861 11:84268259:C:T -1 Inf -0.551980198019802\n11 84268265 11:84268265:C:A A C -1 Inf 0.1207 0.035 0.966 0.924 1.009 85934 401577 0 0.6535 11:84268265:C:A -1 Inf -1.54910714285714\n11 84268309 11:84268309:T:TC TC T -1 Inf 0.425 0.9995 0.833 0.532 1.305 69576 360279 0 0.5788 11:84268309:T:TC -1 Inf 0.79763986013986\n11 84268394 11:84268394:A:T T A -1 Inf 0.8437 0.3875 1.002 0.985 1.018 85934 401577 2.1 0.4233 11:84268394:A:T -1 Inf -0.202380952380952\n11 84268618 11:84268618:C:T T C -1 Inf 0.629 0.0104 1.02 0.941 1.106 85502 400497 0 0.497 11:84268618:C:T -1 Inf 0.483091787439614\n11 84268622 11:84268622:C:T T C -1 Inf 0.6138 0.0356 1.011 0.968 1.056 85934 401577 51.5 0.01969 11:84268622:C:T -1 Inf 0.504504504504505\n```\n\n* `*.univariate_susie_rss.rds`\n```\n> str(readRDS(\"/restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/rss_analysis/univariate_rss/RSS_QC_RAISS_imputed.chr11_84267999_86714492.univariate_susie_rss.rds\"))\nList of 1\n $ chr11_84267999_86714492:List of 1\n ..$ AD_Bellenguez_2022:List of 2\n .. ..$ susie_rss_RSS_QC_RAISS_imputed:List of 6\n .. .. ..$ variant_names : chr [1:10779] \"11:84267999:T:C\" \"11:84268181:CTG:C\" \"11:84268207:T:C\" \"11:84268259:C:T\" ...\n .. .. ..$ analysis_script : chr \"library(pecotmr)\\nlibrary(dplyr)\\nlibrary(data.table)\\nskip_region = c(\\\"6:25000000-35000000\\\")\\nstudies = c(\\\"\"| __truncated__\n .. .. ..$ sumstats :List of 1\n .. .. .. ..$ z: num [1:10779] 0.463 -0.487 0.475 -0.552 -1.549 ...\n .. .. ..$ top_loci :'data.frame':\t10 obs. of 6 variables:\n .. .. .. ..$ variant_id : chr [1:10] \"11:86139201:C:T\" \"11:86141937:G:A\" \"11:86144030:ATCT:A\" \"11:86144034:AC:A\" ...\n .. .. .. ..$ z : num [1:10] 12 12 11.6 11.6 11.6 ...\n .. .. .. ..$ pip : num [1:10] 0.2735 0.356 0.0339 0.0336 0.0339 ...\n .. .. .. ..$ cs_coverage_0.95: num [1:10] 2 2 2 0 2 2 2 2 1 1\n .. .. .. ..$ cs_coverage_0.7 : num [1:10] 2 2 0 0 0 0 0 2 1 1\n .. .. .. ..$ cs_coverage_0.5 : num [1:10] 2 2 0 0 0 0 0 0 1 0\n .. .. ..$ susie_result_trimmed:List of 9\n .. .. .. ..$ pip : num [1:10779] 1.63e-10 1.72e-10 1.64e-10 1.65e-10 2.45e-10 ...\n .. .. .. ..$ sets :List of 5\n .. .. .. .. ..$ cs :List of 2\n .. .. .. .. .. ..$ L1: int [1:2] 7797 7798\n .. .. .. .. .. ..$ L2: int [1:7] 7725 7735 7742 7744 7754 7760 7770\n .. .. .. .. ..$ purity :'data.frame':\t2 obs. of 3 variables:\n .. .. .. .. .. ..$ min.abs.corr : num [1:2] 0.994 0.954\n .. .. .. .. .. ..$ mean.abs.corr : num [1:2] 0.994 0.982\n .. .. .. .. .. ..$ median.abs.corr: num [1:2] 0.994 0.99\n .. .. .. .. ..$ cs_index : int [1:2] 1 2\n .. .. .. .. ..$ coverage : num [1:2] 0.992 0.958\n .. .. .. .. ..$ requested_coverage: num 0.95\n .. .. .. ..$ cs_corr : num [1:2, 1:2] 1 0.64 0.64 1\n .. .. .. .. ..- attr(*, \"dimnames\")=List of 2\n .. .. .. .. .. ..$ : chr [1:2] \"L1\" \"L2\"\n .. .. .. .. .. ..$ : chr [1:2] \"L1\" \"L2\"\n .. .. .. ..$ sets_secondary:List of 2\n .. .. .. .. ..$ coverage_0.7:List of 2\n .. .. .. .. .. ..$ sets :List of 5\n .. .. .. .. .. .. ..$ cs :List of 2\n .. .. .. .. .. .. .. ..$ L1: int [1:2] 7797 7798\n .. .. .. .. .. .. .. ..$ L2: int [1:3] 7725 7735 7770\n .. .. .. .. .. .. ..$ purity :'data.frame':\t2 obs. of 3 variables:\n .. .. .. .. .. .. .. ..$ min.abs.corr : num [1:2] 0.994 0.954\n .. .. .. .. .. .. .. ..$ mean.abs.corr : num [1:2] 0.994 0.968\n .. .. .. .. .. .. .. ..$ median.abs.corr: num [1:2] 0.994 0.96\n .. .. .. .. .. .. ..$ cs_index : int [1:2] 1 2\n .. .. .. .. .. .. ..$ coverage : num [1:2] 0.992 0.73\n .. .. .. .. .. .. ..$ requested_coverage: num 0.7\n .. .. .. .. .. ..$ cs_corr: num [1:2, 1:2] 1 0.64 0.64 1\n .. .. .. .. .. .. ..- attr(*, \"dimnames\")=List of 2\n .. .. .. .. .. .. .. ..$ : chr [1:2] \"L1\" \"L2\"\n .. .. .. .. .. .. .. ..$ : chr [1:2] \"L1\" \"L2\"\n .. .. .. .. ..$ coverage_0.5:List of 2\n .. .. .. .. .. ..$ sets :List of 5\n .. .. .. .. .. .. ..$ cs :List of 2\n .. .. .. .. .. .. .. ..$ L1: int 7797\n .. .. .. .. .. .. .. ..$ L2: int [1:2] 7725 7735\n .. .. .. .. .. .. ..$ purity :'data.frame':\t2 obs. of 3 variables:\n .. .. .. .. .. .. .. ..$ min.abs.corr : num [1:2] 1 0.96\n .. .. .. .. .. .. .. ..$ mean.abs.corr : num [1:2] 1 0.96\n .. .. .. .. .. .. .. ..$ median.abs.corr: num [1:2] 1 0.96\n .. .. .. .. .. .. ..$ cs_index : int [1:2] 1 2\n .. .. .. .. .. .. ..$ coverage : num [1:2] 0.558 0.629\n .. .. .. .. .. .. ..$ requested_coverage: num 0.5\n .. .. .. .. .. ..$ cs_corr: num [1:2, 1:2] 1 0.64 0.64 1\n .. .. .. .. .. .. ..- attr(*, \"dimnames\")=List of 2\n .. .. .. .. .. .. .. ..$ : chr [1:2] \"L1\" \"L2\"\n .. .. .. .. .. .. .. ..$ : chr [1:2] \"L1\" \"L2\"\n .. .. .. ..$ alpha : num [1:2, 1:10779] 2.29e-16 1.63e-10 2.09e-16 1.72e-10 2.07e-16 ...\n .. .. .. .. ..- attr(*, \"dimnames\")=List of 2\n .. .. .. .. .. ..$ : NULL\n .. .. .. .. .. ..$ : chr [1:10779] \"11:84267999:T:C\" \"11:84268181:CTG:C\" \"11:84268207:T:C\" \"11:84268259:C:T\" ...\n .. .. .. ..$ lbf_variable : num [1:2, 1:10779] -1.97 -1.77 -2.06 -1.72 -2.07 ...\n .. .. .. .. ..- attr(*, \"dimnames\")=List of 2\n .. .. .. .. .. ..$ : NULL\n .. .. .. .. .. ..$ : chr [1:10779] \"11:84267999:T:C\" \"11:84268181:CTG:C\" \"11:84268207:T:C\" \"11:84268259:C:T\" ...\n .. .. .. ..$ V : num [1:2] 8.98e-05 5.29e-05\n .. .. .. ..$ niter : int 13\n .. .. .. ..$ max_L : int 5\n .. .. .. ..- attr(*, \"class\")= chr \"susie\"\n .. .. ..$ outlier_number : int 0\n .. ..$ rss_data_analyzed :'data.frame':\t10779 obs. of 26 variables:\n .. .. ..$ chrom : chr [1:10779] \"11\" \"11\" \"11\" \"11\" ...\n .. .. ..$ pos : int [1:10779] 84267999 84268181 84268207 84268259 84268265 84268309 84268394 84268618 84268622 84268650 ...\n .. .. ..$ variant_id : chr [1:10779] \"11:84267999:T:C\" \"11:84268181:CTG:C\" \"11:84268207:T:C\" \"11:84268259:C:T\" ...\n .. .. ..$ A1 : chr [1:10779] \"C\" \"C\" \"C\" \"T\" ...\n .. .. ..$ A2 : chr [1:10779] \"T\" \"CTG\" \"T\" \"C\" ...\n .. .. ..$ Var.x : num [1:10779] 0.131 NA NA NA NA ...\n .. .. ..$ raiss_ld_score.x : num [1:10779] 35 NA NA NA NA ...\n .. .. ..$ raiss_R2.x : num [1:10779] 0.869 NA NA NA NA ...\n .. .. ..$ Var.y : num [1:10779] NA -1 -1 -1 -1 -1 -1 -1 -1 -1 ...\n .. .. ..$ raiss_ld_score.y : num [1:10779] NA Inf Inf Inf Inf ...\n .. .. ..$ raiss_R2.y : num [1:10779] NA NA NA NA NA NA NA NA NA NA ...\n .. .. ..$ pvalue : num [1:10779] NA 0.626 0.635 0.581 0.121 ...\n .. .. ..$ effect_allele_frequency: num [1:10779] NA 0.9965 0.9641 0.0006 0.035 ...\n .. .. ..$ odds_ratio : num [1:10779] NA 1.039 0.99 0.875 0.966 ...\n .. .. ..$ ci_lower : num [1:10779] NA 0.891 0.948 0.544 0.924 0.532 0.985 0.941 0.968 0.982 ...\n .. .. ..$ ci_upper : num [1:10779] NA 1.21 1.03 1.41 1.01 ...\n .. .. ..$ n_case : num [1:10779] NA 83196 85934 69576 85934 ...\n .. .. ..$ n_control : num [1:10779] NA 386481 401577 360279 401577 ...\n .. .. ..$ het_isq : num [1:10779] NA 0 52.6 69.8 0 0 2.1 0 51.5 0 ...\n .. .. ..$ het_pvalue : num [1:10779] NA 0.4727 0.0164 0.0686 0.6535 ...\n .. .. ..$ beta : num [1:10779] NA NA NA NA NA NA NA NA NA NA ...\n .. .. ..$ se : num [1:10779] NA NA NA NA NA NA NA NA NA NA ...\n .. .. ..$ variants_id_original : chr [1:10779] NA \"11:84268181:CTG:C\" \"11:84268207:T:C\" \"11:84268259:C:T\" ...\n .. .. ..$ Var : num [1:10779] NA -1 -1 -1 -1 -1 -1 -1 -1 -1 ...\n .. .. ..$ raiss_ld_score : num [1:10779] NA Inf Inf Inf Inf ...\n .. .. ..$ z : num [1:10779] 0.463 -0.487 0.475 -0.552 -1.549 ..\n \n``` " ] }, { @@ -79,15 +79,15 @@ "metadata": {}, "outputs": [], "source": [ - "sos run pipeline/rss_analysis.ipynb univariate_rss \\\n", - " --ld-meta-data data/ld_meta_file_with_bim.tsv \\\n", - " --gwas-meta-data data/mnm_regression/gwas_meta_data.txt \\\n", - " --qc_method \"rss_qc\" --impute \\\n", - " --finemapping_method \"susie_rss\" \\\n", + "# Build per-(study, LD-block) GwasSumStats, run SuSiE-RSS fine-mapping,\n", + "# and render PIP plots end-to-end with the new pecotmr-integration pipeline:\n", + "sos run pipeline/gwas_rss_fine_mapping.ipynb \\\n", + " generate_gwas_sumstats+gwas_fine_mapping+gwas_rss_plot \\\n", " --cwd output/rss_analysis \\\n", - " --skip_analysis_pip_cutoff 0 \\\n", - " --skip_regions 6:25000000-35000000 \\\n", - " --region_name 22:49355984-50799822" + " --gwas-meta data/mnm_regression/gwas_meta_data.txt \\\n", + " --regions chr22:49355984-50799822 \\\n", + " --ld-meta data/ld_meta_file_with_bim.tsv \\\n", + " --qc-method slalom --impute" ] }, { diff --git a/code/SoS/mnm_analysis/univariate_fine_mapping_fsusie_vignette.ipynb b/code/SoS/mnm_analysis/univariate_fine_mapping_fsusie_vignette.ipynb index 33da62df..5a05681e 100644 --- a/code/SoS/mnm_analysis/univariate_fine_mapping_fsusie_vignette.ipynb +++ b/code/SoS/mnm_analysis/univariate_fine_mapping_fsusie_vignette.ipynb @@ -73,8 +73,8 @@ "id": "436a39bb-b182-49c0-9ec9-c4b899c8b684", "metadata": {}, "source": [ - "- `{name}.chr{N}_{start}_{end}.{n}_marks.dataset.rds` — the per-region functional dataset (the epigenomic marks, residualized phenotype and genotype matrices) prepared for fSuSiE.\n", - "- `{name}.chr{N}_..._top_pc_weights.rds` — the fitted fSuSiE result with top-PC TWAS weights, fine-mapping the functional (epigenomic) signal in that region.\n", + "- `{name}.chr{N}_{start}_{end}.{n}_marks.dataset.rds` \u2014 the per-region functional dataset (the epigenomic marks, residualized phenotype and genotype matrices) prepared for fSuSiE.\n", + "- `{name}.chr{N}_..._top_pc_weights.rds` \u2014 the fitted fSuSiE result with top-PC TWAS weights, fine-mapping the functional (epigenomic) signal in that region.\n", "\n", "Inspecting either object shows its internal structure:\n", "```\n", @@ -113,7 +113,7 @@ " .. .. ..- attr(*, \"names\")= chr [1:15857] \"chr7:139302775:AACACACACAC:AACACACACACAC\" \"chr7:139302775:AACACACACAC:AACACACACACACAC\" \"chr7:139304706:G:GGT\" \"chr7:139305695:G:T\" ...\n", " ..$ grange : chr [1:2] \"139293693\" \"145380632\"\n", " ..$ Y_coordinates :List of 1\n", - " .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':\t166 obs. of 3 variables:\n", + " .. ..$ :Classes \u2018tbl_df\u2019, \u2018tbl\u2019 and 'data.frame':\t166 obs. of 3 variables:\n", " .. .. ..$ #chr : chr [1:166] \"chr7\" \"chr7\" \"chr7\" \"chr7\" ...\n", " .. .. ..$ start: num [1:166] 1.39e+08 1.39e+08 1.39e+08 1.39e+08 1.39e+08 ...\n", " .. .. ..$ end : num [1:166] 1.39e+08 1.39e+08 1.39e+08 1.39e+08 1.39e+08 ...\n", @@ -146,7 +146,7 @@ " .. .. ..- attr(*, \"names\")= chr [1:5] \"user.self\" \"sys.self\" \"elapsed\" \"user.child\" ...\n", " .. ..$ fsusie_result :List of 34\n", " .. .. ..- attr(*, \"class\")= chr \"susiF\"\n", - " .. ..$ Y_coordinates :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':\t166 obs. of 3 variables:\n", + " .. ..$ Y_coordinates :Classes \u2018tbl_df\u2019, \u2018tbl\u2019 and 'data.frame':\t166 obs. of 3 variables:\n", " .. ..$ fsusie_summary :List of 5\n", " .. ..$ region_info :List of 3\n", "```" @@ -175,17 +175,19 @@ "metadata": {}, "outputs": [], "source": [ - "sos run pipeline/mnm_regression.ipynb fsusie \\\n", - " --cwd output/fsusie/ \\\n", - " --name Mic \\\n", - " --genoFile data/fsusie/mwe.genotype_by_chrom_files.txt \\\n", - " --phenoFile data/fsusie/mwe.pheno.region_list \\\n", - " --covFile data/fsusie/mwe.chr7_139293693_145380632.Marchenko_PC.anon.gz \\\n", - " --cis-window 0 --max-cv-variants 5000 \\\n", - " --susie_top_pc 0 --phenotype-names ROSMAP_Mic_snATACQTL --maf 0.01 \\\n", - " --save-data \\\n", - " --numThreads 8 \\\n", - " --post_processing \"none\" --small-sample-correction " + "# Step 1: build the QtlDataset RDS for the functional phenotype.\n", + "sos run pipeline/qtl_dataset.ipynb qtl_dataset_construct \\\n", + " --cwd output/qtl_dataset \\\n", + " --study Mic \\\n", + " --genotype-prefix data/fsusie/mwe.chr7 \\\n", + " --phenotype-manifest data/fsusie/mwe.pheno_manifest.tsv\n", + "\n", + "# Step 2: per-region fSuSiE joint fine-mapping over the bins in the LD block.\n", + "sos run pipeline/functional_fine_mapping.ipynb functional_fine_mapping \\\n", + " --cwd output/fsusie \\\n", + " --study Mic \\\n", + " --qtl-dataset output/qtl_dataset/Mic.qtl_dataset.rds \\\n", + " --regions chr7:139293693-145380632" ] }, { @@ -256,4 +258,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} +} \ No newline at end of file diff --git a/code/SoS/mnm_analysis/univariate_fine_mapping_twas_vignette.ipynb b/code/SoS/mnm_analysis/univariate_fine_mapping_twas_vignette.ipynb index cf0118e0..6072cd50 100644 --- a/code/SoS/mnm_analysis/univariate_fine_mapping_twas_vignette.ipynb +++ b/code/SoS/mnm_analysis/univariate_fine_mapping_twas_vignette.ipynb @@ -17,7 +17,7 @@ "\n", "This pipeline performs per-region univariate fine-mapping with SuSiE and computes TWAS\n", "weights, via the `susie_twas` workflow of `pipeline/mnm_regression.ipynb`. For each\n", - "region the workflow fits SuSiE on the covariate-residualized phenotype × genotype\n", + "region the workflow fits SuSiE on the covariate-residualized phenotype \u00d7 genotype\n", "matrix and then cross-validates five prediction models (**enet**, **lasso**,\n", "**bayes_r**, **mrash**, **susie**) to pick TWAS weights.\n", "\n", @@ -32,10 +32,10 @@ "`start/end`). One `susie_twas` call fits one SuSiE + TWAS model per region.\n", "\n", "* **Gene-based phenotypes.** Manifest row = one gene; window = gene's containing TAD (or\n", - " ± cis-window around the TSS).\n", + " \u00b1 cis-window around the TSS).\n", "* **Peak-based phenotypes (caQTL, mQTL).** Manifest row = one peak/CpG; window = the\n", " peak's containing [extended-TAD](./../reference_data/generalized_TADB.html). A peak is\n", - " 2–5 kb and cannot serve as its own window.\n" + " 2\u20135 kb and cannot serve as its own window.\n" ] }, { @@ -53,9 +53,9 @@ "\n", "The pipeline expects two derived files per experiment:\n", "\n", - "* `pheno_manifest.tsv` — 5 columns (`#chr start end ID path`), one row per region.\n", + "* `pheno_manifest.tsv` \u2014 5 columns (`#chr start end ID path`), one row per region.\n", " `path` points to a bgzipped+tabixed phenotype BED restricted to that region.\n", - "* `association_windows.bed` — 4 columns (`chr start end ID`), one row per region.\n", + "* `association_windows.bed` \u2014 4 columns (`chr start end ID`), one row per region.\n", " `start/end` define the cis search window; `ID` must match a row in the manifest.\n", "\n", "For gene-based phenotypes these files are produced by the standard phenotype\n", @@ -99,18 +99,27 @@ "metadata": {}, "outputs": [], "source": [ - "sos run pipeline/mnm_regression.ipynb susie_twas \\\n", - " --name test_susie_twas \\\n", - " --cwd output/mnm_regression/susie_twas \\\n", - " --genoFile output/genotype_by_chrom/wgs.merged.plink_qc.1.bed \\\n", - " --phenoFile output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.region_list.txt \\\n", - " --covFile output/covariate/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.covariates.wgs.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \\\n", - " --customized-association-windows reference_data/TAD/TADB_enhanced_cis.bed \\\n", - " --phenotype-names test_pheno \\\n", - " --max-cv-variants 5000 \\\n", - " --ld_reference_meta_file data/ld_meta_file_with_bim.tsv \\\n", - " --region-name ENSG00000049246 ENSG00000054116 ENSG00000116678 \\\n", - " --save-data\n" + "# Step 1: build the QtlDataset RDS.\n", + "sos run pipeline/qtl_dataset.ipynb qtl_dataset_construct \\\n", + " --cwd output/qtl_dataset \\\n", + " --study test_susie_twas \\\n", + " --genotype-prefix output/genotype_by_chrom/wgs.merged.plink_qc.chr1 \\\n", + " --phenotype-manifest data/test_susie_twas.pheno_manifest.tsv\n", + "\n", + "# Step 2: per-region univariate SuSiE fine-mapping.\n", + "sos run pipeline/fine_mapping.ipynb fine_mapping \\\n", + " --cwd output/fine_mapping \\\n", + " --study test_susie_twas \\\n", + " --qtl-dataset output/qtl_dataset/test_susie_twas.qtl_dataset.rds \\\n", + " --regions chr1:1000000-2000000\n", + "\n", + "# Step 3: TWAS weights (default per-method panel + SR-TWAS ensemble inline).\n", + "sos run pipeline/twas_weights.ipynb twas_weights \\\n", + " --cwd output/twas_weights \\\n", + " --study test_susie_twas \\\n", + " --qtl-dataset output/qtl_dataset/test_susie_twas.qtl_dataset.rds \\\n", + " --regions chr1:1000000-2000000 \\\n", + " --fine-mapping-result output/fine_mapping/test_susie_twas.chr1_1000000_2000000.finemap.rds" ] }, { @@ -126,16 +135,24 @@ "metadata": {}, "outputs": [], "source": [ - "sos run pipeline/mnm_regression.ipynb susie_twas \\\n", - " --name example_10peaks_twas \\\n", - " --cwd output/susie_twas_out \\\n", - " --genoFile input/example.chr22.bed \\\n", - " --phenoFile output/inputs_ready/pheno_manifest.tsv \\\n", - " --covFile input/example_covariates.tsv \\\n", - " --customized-association-windows output/inputs_ready/association_windows.bed \\\n", - " --phenotype-names caQTL \\\n", - " --max-cv-variants 5000 \\\n", - " --save-data\n" + "# Smaller example: same chain, different data.\n", + "sos run pipeline/qtl_dataset.ipynb qtl_dataset_construct \\\n", + " --cwd output/qtl_dataset \\\n", + " --study example_10peaks_twas \\\n", + " --genotype-prefix input/example.chr22 \\\n", + " --phenotype-manifest output/inputs_ready/pheno_manifest.tsv\n", + "\n", + "sos run pipeline/fine_mapping.ipynb fine_mapping \\\n", + " --cwd output/fine_mapping \\\n", + " --study example_10peaks_twas \\\n", + " --qtl-dataset output/qtl_dataset/example_10peaks_twas.qtl_dataset.rds \\\n", + " --regions chr22:15000000-17000000\n", + "\n", + "sos run pipeline/twas_weights.ipynb twas_weights \\\n", + " --cwd output/twas_weights \\\n", + " --study example_10peaks_twas \\\n", + " --qtl-dataset output/qtl_dataset/example_10peaks_twas.qtl_dataset.rds \\\n", + " --regions chr22:15000000-17000000" ] }, { @@ -155,7 +172,7 @@ "...\n", "INFO: susie_twas output: /fine_mapping/._.univariate_bvsr.rds\n", " /twas_weights/._.univariate_twas_weights.rds\n", - " ... (2 × N_regions items)\n", + " ... (2 \u00d7 N_regions items)\n", "INFO: Workflow susie_twas is executed successfully.\n", "```\n" ] @@ -173,7 +190,8 @@ "metadata": {}, "outputs": [], "source": [ - "sos run pipeline/mnm_regression.ipynb susie_twas -h\n" + "sos run pipeline/fine_mapping.ipynb -h\n", + "sos run pipeline/twas_weights.ipynb -h" ] }, { @@ -182,29 +200,29 @@ "source": [ "Flags consumed by `susie_twas`:\n", "\n", - "* `--genoFile` — path to a PLINK1 `.bed` file (pass the `.bed` path, not the prefix).\n", + "* `--genoFile` \u2014 path to a PLINK1 `.bed` file (pass the `.bed` path, not the prefix).\n", " FID/IID in the `.fam` must match the phenotype/covariate sample IDs.\n", - "* `--phenoFile` — 5-column `#chr start end ID path` manifest. One row per region; `path`\n", + "* `--phenoFile` \u2014 5-column `#chr start end ID path` manifest. One row per region; `path`\n", " points at a bgzipped+tabixed phenotype BED restricted to that region. Pass multiple\n", " files for multi-condition analyses.\n", - "* `--covFile` — tab-delimited, literal `ID` header in column 1; covariate names in rows,\n", + "* `--covFile` \u2014 tab-delimited, literal `ID` header in column 1; covariate names in rows,\n", " sample IDs across the header row.\n", - "* `--customized-association-windows` — 4-column `chr start end ID` BED. Required unless\n", + "* `--customized-association-windows` \u2014 4-column `chr start end ID` BED. Required unless\n", " `--cis-window` is non-negative.\n", - "* `--cis-window` — symmetric cis radius (bp) around each region's `start/end`. Default\n", + "* `--cis-window` \u2014 symmetric cis radius (bp) around each region's `start/end`. Default\n", " `-1` (forces the customized-windows path).\n", - "* `--phenotype-names` — condition name(s), one per `--phenoFile`.\n", - "* `--max-cv-variants` — cap on variants used in cross-validation (default 5,000).\n", - "* `--ld_reference_meta_file` — summary-stats / RSS mode only; omit for individual-level.\n", - "* `--region-name` — restrict to one or more region IDs for sanity checks.\n", - "* `--save-data` — also emit `*.univariate_data.rds` with the residualized X/Y matrices.\n" + "* `--phenotype-names` \u2014 condition name(s), one per `--phenoFile`.\n", + "* `--max-cv-variants` \u2014 cap on variants used in cross-validation (default 5,000).\n", + "* `--ld_reference_meta_file` \u2014 summary-stats / RSS mode only; omit for individual-level.\n", + "* `--region-name` \u2014 restrict to one or more region IDs for sanity checks.\n", + "* `--save-data` \u2014 also emit `*.univariate_data.rds` with the residualized X/Y matrices.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Output format\n\nAll results are written under `--cwd` (e.g. `output/mnm_regression/susie_twas/twas_weights/`), one set of files per region:\n\n- `{name}.{chr}_{region_id}.univariate_twas_weights.rds` — the fitted SuSiE fine-mapping result plus the trained TWAS weights for that region/gene; this is the main output consumed by downstream TWAS.\n- a companion fine-mapping RDS keyed by region `ID` and phenotype condition (and a third data RDS when `--save-data` is set), holding the per-region SuSiE posterior summaries and credible sets.\n" + "## Output format\n\nAll results are written under `--cwd` (e.g. `output/mnm_regression/susie_twas/twas_weights/`), one set of files per region:\n\n- `{name}.{chr}_{region_id}.univariate_twas_weights.rds` \u2014 the fitted SuSiE fine-mapping result plus the trained TWAS weights for that region/gene; this is the main output consumed by downstream TWAS.\n- a companion fine-mapping RDS keyed by region `ID` and phenotype condition (and a third data RDS when `--save-data` is set), holding the per-region SuSiE posterior summaries and credible sets.\n" ] }, { @@ -213,7 +231,7 @@ "source": [ "## Anticipated Results\n", "\n", - "Quick sanity check — inspect cross-validation performance for the first region:" + "Quick sanity check \u2014 inspect cross-validation performance for the first region:" ] }, { @@ -222,12 +240,14 @@ "metadata": {}, "outputs": [], "source": [ + "# Inspect the new S4 TwasWeights collection via the accessor surface:\n", "Rscript -e '\n", - "fp <- \"output/mnm_regression/susie_twas/twas_weights/test_susie_twas.chr1_ENSG00000049246.univariate_twas_weights.rds\"\n", - "x <- readRDS(fp)\n", - "cat(\"methods:\", names(x[[1]][[1]]$twas_weights), \"\\n\")\n", - "print(x[[1]][[1]]$twas_cv_result$performance)\n", - "'\n" + "suppressPackageStartupMessages(library(pecotmr))\n", + "tw <- readRDS(\"output/twas_weights/test_susie_twas.chr1_1000000_2000000.twas_weights.rds\")\n", + "cat(\"methods present:\", paste(unique(as.character(tw$method)), collapse = \", \"), \"\\n\")\n", + "cat(\"rows:\", nrow(tw), \"\\n\")\n", + "print(getR2(tw)) # per-method CV R^2 for this region\n", + "'" ] }, { @@ -236,16 +256,16 @@ "source": [ "Interpretation:\n", "\n", - "* `twas_weights$` — use for TWAS (`Z ≈ weights · LD · GWAS z-scores`).\n", - "* `twas_cv_result$performance` — pick the method with highest `adj_rsq` before applying.\n", - "* `top_loci` in `univariate_bvsr.rds` — lead variants and credible-set coverages.\n", - "* `susie_result_trimmed$sets$cs` — indices of variants in each 95% credible set.\n", + "* `twas_weights$` \u2014 use for TWAS (`Z \u2248 weights \u00b7 LD \u00b7 GWAS z-scores`).\n", + "* `twas_cv_result$performance` \u2014 pick the method with highest `adj_rsq` before applying.\n", + "* `top_loci` in `univariate_bvsr.rds` \u2014 lead variants and credible-set coverages.\n", + "* `susie_result_trimmed$sets$cs` \u2014 indices of variants in each 95% credible set.\n", "\n", "### Common pitfalls\n", "\n", "* **Sample-ID drift.** The phenotype BED's header columns (5+), the bfile FID/IID, and\n", " the covariate TSV header must all be drawn from the same set. Sos does not warn on\n", - " missing samples — it silently drops them and may report zero overlap.\n", + " missing samples \u2014 it silently drops them and may report zero overlap.\n", "* **Empty regions.** If a region has no variants in its association window (e.g. the\n", " manifest points at a window with no bfile coverage) sos marks it completed with an\n", " empty output. Verify the expected file count in the `INFO: susie_twas output:` line.\n", diff --git a/code/SoS/pecotmr_integration/colocboost.ipynb b/code/SoS/pecotmr_integration/colocboost.ipynb index f0e2e4b0..c66822cf 100644 --- a/code/SoS/pecotmr_integration/colocboost.ipynb +++ b/code/SoS/pecotmr_integration/colocboost.ipynb @@ -37,7 +37,27 @@ }, "outputs": [], "source": [ - "[colocboost]\nif bool(genes) == bool(regions):\n raise ValueError(\"Specify exactly one of --genes (trait IDs) or --regions (chr:start-end strings).\")\nif (joint_gwas or separate_gwas) and not gwas_sumstats.is_file():\n raise ValueError(\"--joint-gwas / --separate-gwas require --gwas-sumstats pointing at a QC'd GwasSumStats RDS.\")\nif not (xqtl_coloc or joint_gwas or separate_gwas):\n raise ValueError(\"Enable at least one of xqtl_coloc, joint_gwas, separate_gwas.\")\nfanout_items = genes if genes else regions\nfanout_kind = 'gene' if genes else 'region'\ndef _fanout_safe(s):\n return s.replace(':', '_').replace('-', '_')\ninput: qtl_dataset, for_each = 'fanout_items'\noutput: f\"{cwd}/{study}.{_fanout_safe(_fanout_items)}.colocboost.rds\"\ntask: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\nbash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n Rscript ${modular_script_dir}/pecotmr_integration/colocboost.R \\\n --qtl-dataset ${_input} \\\n ${('--gene-id ' + _fanout_items + ' --cis-window ' + str(cis_window)) if fanout_kind == 'gene' else ('--region ' + _fanout_items)} \\\n ${'--gwas-sumstats ' + str(gwas_sumstats) if gwas_sumstats.is_file() else ''} \\\n ${'--no-xqtl-coloc' if not xqtl_coloc else ''} \\\n ${'--joint-gwas' if joint_gwas else ''} \\\n ${'--separate-gwas' if separate_gwas else ''} \\\n --output ${_output}\n" + "[colocboost]\n", + "if bool(genes) == bool(regions):\n", + " raise ValueError(\"Specify exactly one of --genes (trait IDs) or --regions (chr:start-end strings).\")\n", + "if (joint_gwas or separate_gwas) and not gwas_sumstats.is_file():\n", + " raise ValueError(\"--joint-gwas / --separate-gwas require --gwas-sumstats pointing at a QC'd GwasSumStats RDS.\")\n", + "if not (xqtl_coloc or joint_gwas or separate_gwas):\n", + " raise ValueError(\"Enable at least one of xqtl_coloc, joint_gwas, separate_gwas.\")\n", + "fanout_items = genes if genes else regions\n", + "fanout_kind = 'gene' if genes else 'region'\n", + "input: qtl_dataset, for_each = 'fanout_items'\n", + "output: f\"{cwd}/{study}.{_fanout_items.replace(':', '_').replace('-', '_')}.colocboost.rds\"\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\n", + "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", + " Rscript ${modular_script_dir}/pecotmr_integration/colocboost.R \\\n", + " --qtl-dataset ${_input} \\\n", + " ${('--gene-id ' + _fanout_items + ' --cis-window ' + str(cis_window)) if fanout_kind == 'gene' else ('--region ' + _fanout_items)} \\\n", + " ${'--gwas-sumstats ' + str(gwas_sumstats) if gwas_sumstats.is_file() else ''} \\\n", + " ${'--no-xqtl-coloc' if not xqtl_coloc else ''} \\\n", + " ${'--joint-gwas' if joint_gwas else ''} \\\n", + " ${'--separate-gwas' if separate_gwas else ''} \\\n", + " --output ${_output}\n" ] } ], diff --git a/code/SoS/pecotmr_integration/fine_mapping.ipynb b/code/SoS/pecotmr_integration/fine_mapping.ipynb index 31f36f25..9c25c57d 100644 --- a/code/SoS/pecotmr_integration/fine_mapping.ipynb +++ b/code/SoS/pecotmr_integration/fine_mapping.ipynb @@ -5,7 +5,7 @@ "metadata": { "kernel": "SoS" }, - "source": "# SuSiE fine-mapping (QTL or GWAS)\n\n## Description\n\nTwo workflows, both backed by `pecotmr::fineMappingPipeline` (which dispatches on the S4 input class):\n\n- **`fine_mapping`** — per-gene / per-region fine-mapping over a pre-built `QtlDataset` RDS. Fan-out goes by `--genes` (list of trait IDs, gene mode) **or** `--regions` (list of `chr:start-end` strings, region mode); exactly one must be supplied. Each task loads the same QtlDataset RDS and fine-maps one unit.\n- **`fine_mapping_gwas`** — per-block fine-mapping over a list of per-block `GwasSumStats` RDS paths (typically produced by `gwas_sumstats_construct.R` / `gwas_sumstats.ipynb`). One task per RDS; each task produces one `GwasFineMappingResult` RDS. Downstream consumers (e.g. `enloc.ipynb`) concatenate the per-block outputs as needed.\n\n## Inputs\n\n### `fine_mapping` (QTL)\n\n- `--qtl-dataset` — path to the QtlDataset RDS produced by `qtl_dataset.ipynb`.\n- `--genes ID1 ID2 …` **or** `--regions chr:start-end …` — fan-out targets (mutually exclusive).\n- `--cis-window` — bp window around each gene's TSS (gene mode only). Default 1,000,000.\n- `--methods` — comma-separated fine-mapping method tokens. Default `susie`.\n- `--coverage` — SuSiE credible-set coverage. Default 0.95.\n\n### `fine_mapping_gwas` (GWAS)\n\n- `--gwas-sumstats-list S1.rds S2.rds …` — per-block GwasSumStats RDS paths to fan out over.\n- `--methods` — comma-separated fine-mapping method tokens. Default `susie`.\n- `--coverage` — SuSiE credible-set coverage. Default 0.95.\n\n### Both\n\n- `--cwd` — output directory. Default `output`.\n- `--study` — study label used in the output filename.\n- `--modular-script-dir` — directory containing the per-task worker R scripts. Default `code/script`; override when SoS is invoked from a working directory that doesn't contain them.\n\n## Outputs\n\n- `fine_mapping`: `{cwd}/{study}.{gene|region}.finemap.rds` per fan-out unit (region strings are sanitised: `:` and `-` become `_`).\n- `fine_mapping_gwas`: `{cwd}/{study}.{bn-of-gwas-sumstats-rds}.gwas_finemap.rds` per input RDS.\n" + "source": "# SuSiE fine-mapping (QTL or GWAS)\n\n## Description\n\nTwo workflows, both backed by `pecotmr::fineMappingPipeline` (which dispatches on the S4 input class):\n\n- **`fine_mapping`** — per-gene / per-region fine-mapping over a pre-built `QtlDataset` RDS. Fan-out goes by `--genes` (list of trait IDs, gene mode) **or** `--regions` (list of `chr:start-end` strings, region mode); exactly one must be supplied. Each task loads the same QtlDataset RDS and fine-maps one unit.\n- **`fine_mapping_gwas`** — per-block fine-mapping over a list of per-block `GwasSumStats` RDS paths (typically produced by `gwas_sumstats_construct.R` / `gwas_sumstats.ipynb`). One task per RDS; each task produces one `GwasFineMappingResult` RDS. Downstream consumers (e.g. `enloc.ipynb`) concatenate the per-block outputs as needed.\n\n## Inputs\n\n### `fine_mapping` (QTL)\n\n- `--qtl-dataset` — path to the QtlDataset RDS produced by `qtl_dataset.ipynb`.\n- `--genes ID1 ID2 \u2026` **or** `--regions chr:start-end \u2026` — fan-out targets (mutually exclusive).\n- `--cis-window` — bp window around each gene's TSS (gene mode only). Default 1,000,000.\n- `--methods` — comma-separated fine-mapping method tokens. Default `susie`.\n- `--coverage` — SuSiE credible-set coverage. Default 0.95.\n\n### `fine_mapping_gwas` (GWAS)\n\n- `--gwas-sumstats-list S1.rds S2.rds \u2026` — per-block GwasSumStats RDS paths to fan out over.\n- `--methods` — comma-separated fine-mapping method tokens. Default `susie`.\n- `--coverage` — SuSiE credible-set coverage. Default 0.95.\n\n### Both\n\n- `--cwd` — output directory. Default `output`.\n- `--study` — study label used in the output filename.\n- `--modular-script-dir` — directory containing the per-task worker R scripts. Default `code/script`; override when SoS is invoked from a working directory that doesn't contain them.\n\n## Outputs\n\n- `fine_mapping`: `{cwd}/{study}.{gene|region}.finemap.rds` per fan-out unit (region strings are sanitised: `:` and `-` become `_`).\n- `fine_mapping_gwas`: `{cwd}/{study}.{bn-of-gwas-sumstats-rds}.gwas_finemap.rds` per input RDS.\n" }, { "cell_type": "markdown", @@ -30,7 +30,25 @@ "kernel": "SoS" }, "outputs": [], - "source": "[fine_mapping]\nif not qtl_dataset.is_file():\n raise ValueError(\"fine_mapping requires --qtl-dataset to point at an existing QtlDataset RDS.\")\nif bool(genes) == bool(regions):\n raise ValueError(\"Specify exactly one of --genes (trait IDs) or --regions (chr:start-end strings).\")\nfanout_items = genes if genes else regions\nfanout_kind = 'gene' if genes else 'region'\ndef _fanout_safe(s):\n return s.replace(':', '_').replace('-', '_')\ninput: qtl_dataset, for_each = 'fanout_items'\noutput: f\"{cwd}/{study}.{_fanout_safe(_fanout_items)}.finemap.rds\"\ntask: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\nbash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n Rscript ${modular_script_dir}/pecotmr_integration/fine_mapping.R \\\n --qtl-dataset ${_input} \\\n ${('--gene-id ' + _fanout_items + ' --cis-window ' + str(cis_window)) if fanout_kind == 'gene' else ('--region ' + _fanout_items)} \\\n --methods ${methods} \\\n --coverage ${coverage} \\\n --output ${_output}\n" + "source": [ + "[fine_mapping]\n", + "if not qtl_dataset.is_file():\n", + " raise ValueError(\"fine_mapping requires --qtl-dataset to point at an existing QtlDataset RDS.\")\n", + "if bool(genes) == bool(regions):\n", + " raise ValueError(\"Specify exactly one of --genes (trait IDs) or --regions (chr:start-end strings).\")\n", + "fanout_items = genes if genes else regions\n", + "fanout_kind = 'gene' if genes else 'region'\n", + "input: qtl_dataset, for_each = 'fanout_items'\n", + "output: f\"{cwd}/{study}.{_fanout_items.replace(':', '_').replace('-', '_')}.finemap.rds\"\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\n", + "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", + " Rscript ${modular_script_dir}/pecotmr_integration/fine_mapping.R \\\n", + " --qtl-dataset ${_input} \\\n", + " ${('--gene-id ' + _fanout_items + ' --cis-window ' + str(cis_window)) if fanout_kind == 'gene' else ('--region ' + _fanout_items)} \\\n", + " --methods ${methods} \\\n", + " --coverage ${coverage} \\\n", + " --output ${_output}\n" + ] }, { "cell_type": "code", diff --git a/code/SoS/pecotmr_integration/fine_mapping_postprocessing.ipynb b/code/SoS/pecotmr_integration/fine_mapping_postprocessing.ipynb new file mode 100644 index 00000000..a628ad5b --- /dev/null +++ b/code/SoS/pecotmr_integration/fine_mapping_postprocessing.ipynb @@ -0,0 +1,248 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Fine-mapping postprocessing\n", + "\n", + "## Description\n", + "\n", + "Postprocessing utilities for the S4 `QtlFineMappingResult` / `GwasFineMappingResult` RDSes produced by `fine_mapping.ipynb`, `multivariate_fine_mapping.ipynb`, `functional_fine_mapping.ipynb`, or `gwas_rss_fine_mapping.ipynb`. Replaces the legacy `mnm_analysis/mnm_postprocessing.ipynb` (which consumed the older list-shaped RDS layout).\n", + "\n", + "All steps consume FMR RDSes through the pecotmr S4 accessor surface (`getPip`, `getCs`, `getTopLoci`, `getMarginalEffects`, `getVariantIds`) so they work uniformly for QTL and GWAS outputs.\n", + "\n", + "Workflow steps (all independent — pick what you need):\n", + "\n", + "- **`[export_top_loci]`** — concatenate every input RDS's top-loci table into one TSV (one row per variant, with `study / context / trait / region_id / method / source` identifier columns).\n", + "- **`[export_credible_sets]`** — same shape, but the per-row CS membership.\n", + "- **`[export_pip]`** — raw `variant_id / pip` long table across all inputs (filterable with `--signal-cutoff`).\n", + "- **`[export_marginals]`** — per-variant marginal effects (`beta / se / z / p`) across all inputs.\n", + "- **`[pip_landscape_plot]`** — per-RDS PIP-vs-position scatter, one facet per FMR row.\n", + "- **`[upset_cs_overlap]`** — UpSet plot of credible-set variant overlap across one or more RDSes (one set per `(study, context, trait, method)` tuple for QTL; per `(study, method, region_id)` for GWAS).\n", + "\n", + "## Inputs\n", + "\n", + "- `--fmr-list [ ...]` — one or more FineMappingResult RDS files (mix QTL and GWAS freely; identifier columns adapt per class).\n", + "- `--name` — output filename prefix.\n", + "- `--signal-cutoff` — PIP cutoff for `export_top_loci` / `export_pip`. Default 0 (no filter); pass 0.025 to mirror the legacy susie default.\n", + "- `--max-sets` — cap on UpSet set count (default 20).\n", + "- `--cwd` / `--modular-script-dir` — standard SoS infra.\n", + "\n", + "## Outputs\n", + "\n", + "Per-step:\n", + "\n", + "- `{cwd}/{name}.topLoci.tsv.gz`\n", + "- `{cwd}/{name}.cs.tsv.gz`\n", + "- `{cwd}/{name}.pip.tsv.gz`\n", + "- `{cwd}/{name}.marginals.tsv.gz`\n", + "- `{cwd}/plots/{name}.{rds-basename}.pip_landscape.png` (one per input RDS)\n", + "- `{cwd}/plots/{name}.upset.png`\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Example\n", + "\n", + "```bash\n", + "# Top-loci TSV + UpSet plot across every fine-mapping RDS in a run:\n", + "sos run pipeline/fine_mapping_postprocessing.ipynb \\\n", + " export_top_loci+upset_cs_overlap \\\n", + " --cwd output/postprocessing \\\n", + " --modular-script-dir /path/to/code/script \\\n", + " --name myrun \\\n", + " --signal-cutoff 0.025 \\\n", + " --fmr-list output/fine_mapping/*.finemap.rds\n", + "\n", + "# All four TSV exports + per-RDS PIP plot:\n", + "sos run pipeline/fine_mapping_postprocessing.ipynb \\\n", + " export_top_loci+export_credible_sets+export_pip+export_marginals+pip_landscape_plot \\\n", + " --cwd output/postprocessing --name myrun \\\n", + " --fmr-list output/fine_mapping/*.finemap.rds\n", + "```\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[global]\n", + "parameter: cwd = path('output')\n", + "parameter: modular_script_dir = path('code/script')\n", + "parameter: name = str\n", + "# One or more FineMappingResult RDS paths (mix QTL + GWAS freely).\n", + "parameter: fmr_list = []\n", + "parameter: signal_cutoff = 0.0\n", + "parameter: max_sets = 20\n", + "parameter: container = ''\n", + "parameter: job_size = 1\n", + "parameter: walltime = '30m'\n", + "parameter: mem = '8G'\n", + "parameter: numThreads = 1\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[export_top_loci]\n", + "stop_if(not fmr_list, '--fmr-list requires at least one RDS path.')\n", + "paths = [str(p) for p in fmr_list]\n", + "input: paths\n", + "output: f\"{cwd}/{name}.topLoci.tsv.gz\"\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\n", + "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", + " Rscript ${modular_script_dir}/pecotmr_integration/fine_mapping_export.R \\\n", + " --input ${_input} \\\n", + " --view topLoci \\\n", + " --signal-cutoff ${signal_cutoff} \\\n", + " --output ${_output}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[export_credible_sets]\n", + "stop_if(not fmr_list, '--fmr-list requires at least one RDS path.')\n", + "paths = [str(p) for p in fmr_list]\n", + "input: paths\n", + "output: f\"{cwd}/{name}.cs.tsv.gz\"\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\n", + "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", + " Rscript ${modular_script_dir}/pecotmr_integration/fine_mapping_export.R \\\n", + " --input ${_input} \\\n", + " --view cs \\\n", + " --output ${_output}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[export_pip]\n", + "stop_if(not fmr_list, '--fmr-list requires at least one RDS path.')\n", + "paths = [str(p) for p in fmr_list]\n", + "input: paths\n", + "output: f\"{cwd}/{name}.pip.tsv.gz\"\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\n", + "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", + " Rscript ${modular_script_dir}/pecotmr_integration/fine_mapping_export.R \\\n", + " --input ${_input} \\\n", + " --view pip \\\n", + " --signal-cutoff ${signal_cutoff} \\\n", + " --output ${_output}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[export_marginals]\n", + "stop_if(not fmr_list, '--fmr-list requires at least one RDS path.')\n", + "paths = [str(p) for p in fmr_list]\n", + "input: paths\n", + "output: f\"{cwd}/{name}.marginals.tsv.gz\"\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\n", + "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", + " Rscript ${modular_script_dir}/pecotmr_integration/fine_mapping_export.R \\\n", + " --input ${_input} \\\n", + " --view marginals \\\n", + " --output ${_output}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[pip_landscape_plot]\n", + "stop_if(not fmr_list, '--fmr-list requires at least one RDS path.')\n", + "paths = [str(p) for p in fmr_list]\n", + "# One PNG per input RDS (per-row fan-out).\n", + "input: paths, group_by = 1\n", + "output: f\"{cwd}/plots/{name}.{_input:bn}.pip_landscape.png\"\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\n", + "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", + " Rscript ${modular_script_dir}/pecotmr_integration/fine_mapping_pip_plot.R \\\n", + " --input ${_input} \\\n", + " --output ${_output}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[upset_cs_overlap]\n", + "stop_if(not fmr_list, '--fmr-list requires at least one RDS path.')\n", + "paths = [str(p) for p in fmr_list]\n", + "input: paths\n", + "output: f\"{cwd}/plots/{name}.upset.png\"\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\n", + "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", + " Rscript ${modular_script_dir}/pecotmr_integration/fine_mapping_upset.R \\\n", + " --input ${_input} \\\n", + " --max-sets ${max_sets} \\\n", + " --output ${_output}" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "SoS", + "language": "sos", + "name": "sos" + }, + "language_info": { + "codemirror_mode": "sos", + "file_extension": ".sos", + "mimetype": "text/x-sos", + "name": "sos", + "pygments_lexer": "python", + "sos": { + "kernels": [ + [ + "SoS", + "sos", + "", + "" + ] + ], + "version": "0.22.4" + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file diff --git a/code/SoS/pecotmr_integration/functional_fine_mapping.ipynb b/code/SoS/pecotmr_integration/functional_fine_mapping.ipynb new file mode 100644 index 00000000..31cf4334 --- /dev/null +++ b/code/SoS/pecotmr_integration/functional_fine_mapping.ipynb @@ -0,0 +1,141 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Functional fine-mapping (fSuSiE)\n", + "\n", + "## Description\n", + "\n", + "Per-region functional fine-mapping via `pecotmr::fineMappingPipeline(qtlDataset, methods = \"fsusie\")`. fSuSiE jointly fine-maps multiple correlated phenotypes (typically epigenomic bins within an LD block) in one region, producing one `QtlFineMappingResult` per fan-out unit. Each task loads the same `QtlDataset` RDS and fits one unit.\n", + "\n", + "This replaces the `[fsusie]` and `[mvfsusie]` steps of the legacy `mnm_regression.ipynb`. The same `fine_mapping.R` worker that drives univariate SuSiE drives fSuSiE here — only the `--methods` value differs.\n", + "\n", + "## Inputs\n", + "\n", + "- `--qtl-dataset` — path to the `QtlDataset` RDS produced by `qtl_dataset.ipynb`. Each context's phenotype matrix typically holds multiple correlated bins per region (e.g. epigenomic peaks); fSuSiE fits them jointly.\n", + "- `--regions chr:start-end ...` **or** `--genes ID1 ID2 ...` — fan-out targets (mutually exclusive). Region mode is the standard use.\n", + "- `--cis-window` — bp window (gene mode only). Default 1,000,000.\n", + "- `--methods` — comma-separated method tokens. Default `fsusie`.\n", + "- `--coverage` — credible-set coverage. Default 0.95.\n", + "- `--method-args` — optional JSON object spliced into `fineMappingPipeline()` for fSuSiE-specific knobs (e.g. `prior`, `max_SNP_EM`, `max_scale`, `min_purity`) and other pipeline knobs not exposed as flags.\n", + "- `--cwd` — output directory. Default `output`.\n", + "- `--study` — study label used in the output filename.\n", + "- `--modular-script-dir` — directory holding the per-task R workers. Default `code/script`.\n", + "\n", + "## Outputs\n", + "\n", + "- `{cwd}/{study}.{gene|region}.fn_finemap.rds` — one `QtlFineMappingResult` per fan-out unit. Region strings are sanitised (`:` and `-` become `_`).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Example\n", + "\n", + "Region mode (canonical, joint multi-bin):\n", + "```bash\n", + "sos run pipeline/functional_fine_mapping.ipynb functional_fine_mapping \\\n", + " --cwd output \\\n", + " --modular-script-dir /path/to/xqtl-protocol/code/script \\\n", + " --study TEST_STUDY \\\n", + " --qtl-dataset output/TEST_STUDY.qtl_dataset.rds \\\n", + " --regions chr7:139293693-145380632\n", + "```\n", + "\n", + "With fSuSiE prior + max-iter overrides:\n", + "```bash\n", + "sos run pipeline/functional_fine_mapping.ipynb functional_fine_mapping \\\n", + " --cwd output --study TEST_STUDY \\\n", + " --qtl-dataset output/TEST_STUDY.qtl_dataset.rds \\\n", + " --regions chr7:139293693-145380632 \\\n", + " --method-args '{\"signalCutoff\":0.01}'\n", + "```\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[global]\n", + "parameter: cwd = path('output')\n", + "parameter: study = str\n", + "parameter: modular_script_dir = path('code/script')\n", + "parameter: qtl_dataset = path('.')\n", + "parameter: genes = []\n", + "parameter: regions = []\n", + "parameter: cis_window = 1000000\n", + "parameter: methods = 'fsusie'\n", + "parameter: coverage = 0.95\n", + "# JSON object spliced into fineMappingPipeline() via do.call. Empty disables.\n", + "parameter: method_args = ''\n", + "parameter: container = ''\n", + "parameter: job_size = 1\n", + "parameter: walltime = '1h'\n", + "parameter: mem = '16G'\n", + "parameter: numThreads = 1" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[functional_fine_mapping]\n", + "if not qtl_dataset.is_file():\n", + " raise ValueError(\"functional_fine_mapping requires --qtl-dataset to point at an existing QtlDataset RDS.\")\n", + "if bool(genes) == bool(regions):\n", + " raise ValueError(\"Specify exactly one of --genes (trait IDs) or --regions (chr:start-end strings).\")\n", + "fanout_items = genes if genes else regions\n", + "fanout_kind = 'gene' if genes else 'region'\n", + "input: qtl_dataset, for_each = 'fanout_items'\n", + "output: f\"{cwd}/{study}.{_fanout_items.replace(':', '_').replace('-', '_')}.fn_finemap.rds\"\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\n", + "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", + " Rscript ${modular_script_dir}/pecotmr_integration/fine_mapping.R \\\n", + " --qtl-dataset ${_input} \\\n", + " ${('--gene-id ' + _fanout_items + ' --cis-window ' + str(cis_window)) if fanout_kind == 'gene' else ('--region ' + _fanout_items)} \\\n", + " --methods ${methods} \\\n", + " --coverage ${coverage} \\\n", + " ${('--method-args ' + repr(method_args)) if method_args else ''} \\\n", + " --output ${_output}" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "SoS", + "language": "sos", + "name": "sos" + }, + "language_info": { + "codemirror_mode": "sos", + "file_extension": ".sos", + "mimetype": "text/x-sos", + "name": "sos", + "pygments_lexer": "python", + "sos": { + "kernels": [ + [ + "SoS", + "sos", + "", + "" + ] + ], + "version": "0.22.4" + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file diff --git a/code/SoS/pecotmr_integration/gwas_rss_fine_mapping.ipynb b/code/SoS/pecotmr_integration/gwas_rss_fine_mapping.ipynb new file mode 100644 index 00000000..83f397da --- /dev/null +++ b/code/SoS/pecotmr_integration/gwas_rss_fine_mapping.ipynb @@ -0,0 +1,251 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# GWAS RSS fine-mapping (per study \u00d7 per LD block)\n", + "\n", + "## Description\n", + "\n", + "End-to-end SuSiE-RSS fine-mapping over GWAS summary statistics. Four SoS steps:\n", + "\n", + "1. **`[generate_manifest]`** — calls `gwas_rss_manifest.R` to resolve `--gwas-meta` + `--gwas-tsv-list` against `--region-list` + `--regions` into a single TSV manifest (one row per study x region job). All cross-product / dedup / sanitisation logic lives in R.\n", + "2. **`[generate_gwas_sumstats]`** — per (study \u00d7 LD block) fan-out. Calls `gwas_sumstats_construct.R` to build one `GwasSumStats` RDS per study \u00d7 block, with `summaryStatsQc()` (optionally including SLALOM/DENTIST QC and RAISS imputation).\n", + "3. **`[gwas_fine_mapping]`** — per RDS, runs `fine_mapping.R --gwas-sumstats` which dispatches to `pecotmr::fineMappingPipeline(gwasSumStats, methods = ...)` (default `susie`; the SuSiE-RSS variant inside pecotmr).\n", + "4. **`[gwas_rss_plot]`** (optional, opt-out via `--no-plot`) — per fine-mapping RDS, renders a PIP plot via `gwas_rss_plot.R`.\n", + "\n", + "Replaces the legacy `mnm_analysis/mnm_methods/rss_analysis.ipynb` two-step workflow. Reuses `fine_mapping.R` (no new fine-mapping worker required) and the new `--method-args` / `--qc-args` JSON passthroughs for per-method and per-QC tuning.\n", + "\n", + "## Inputs\n", + "\n", + "### Per-study inputs (one or both forms; the workflow fans out over their union)\n", + "\n", + "- `--gwas-meta` — TSV with columns `study_id`, `path` (relative to the meta file or absolute), and optionally `column_mapping` (path to a per-study YAML; see `gwas_sumstats_construct.R --column-mapping`). One row per study.\n", + "- `--gwas-tsv-list S1=path1 S2=path2 ...` — explicit STUDY=PATH pairs (alternative to / additional to `--gwas-meta`).\n", + "\n", + "### Region inputs (one or both; their union is the fan-out target)\n", + "\n", + "- `--region-list` — BED-like TSV (`#chr`, `start`, `end`, ...) listing LD blocks.\n", + "- `--regions chr:start-end ...` — explicit region strings.\n", + "\n", + "### LD reference\n", + "\n", + "- `--ld-meta` — LD-reference meta TSV (`#chr`, `start`, `end`, `path`). One LD block per row, or one row with `start = end = 0` to mean “whole chromosome.”\n", + "\n", + "### QC knobs (forwarded to `gwas_sumstats_construct.R`)\n", + "\n", + "- `--qc-method` — `none` (default), `slalom`, or `dentist`.\n", + "- `--impute` — flag to enable RAISS sumstat imputation.\n", + "- `--qc-args '{\"key\":value, ...}'` — JSON passthrough for any other `summaryStatsQc()` kwarg.\n", + "\n", + "### Fine-mapping knobs (forwarded to `fine_mapping.R --gwas-sumstats`)\n", + "\n", + "- `--methods` — comma-separated method tokens. Default `susie` (= SuSiE-RSS internally for `GwasSumStats` input).\n", + "- `--coverage` — SuSiE credible-set coverage. Default 0.95.\n", + "- `--method-args '{\"susie\":{\"L\":1}}'` — per-method kwargs spliced into `fineMappingPipeline()`.\n", + "\n", + "### Infrastructure\n", + "\n", + "- `--cwd` — output directory (default `output`).\n", + "- `--modular-script-dir` — directory holding the per-task R workers (default `code/script`).\n", + "- `--genome` — genome build label (default `GRCh38`).\n", + "- `--no-plot` — skip the PIP plot step.\n", + "\n", + "## Outputs\n", + "\n", + "- `{cwd}/sumstats/{study}.{chrN_S_E}.gwas_sumstats.rds` — one per (study, LD block).\n", + "- `{cwd}/fine_mapping/{study}.{chrN_S_E}.gwas_finemap.rds` — one `GwasFineMappingResult` per (study, LD block).\n", + "- `{cwd}/plots/{study}.{chrN_S_E}.pip_plot.png` — per (study, LD block) PIP plot (when `--no-plot` is unset).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Example\n", + "\n", + "Minimal run with explicit inputs:\n", + "```bash\n", + "sos run pipeline/gwas_rss_fine_mapping.ipynb gwas_rss_fine_mapping \\\n", + " --cwd output --modular-script-dir /path/to/xqtl-protocol/code/script \\\n", + " --gwas-tsv-list MyGWAS=input/twas/protocol_example.twas.gwas_sumstats.chr22.tsv.gz \\\n", + " --regions chr22:10516173-17414263 \\\n", + " --ld-meta input/rss_analysis/protocol_example.ld_meta.tsv\n", + "```\n", + "\n", + "With SLALOM QC + RAISS imputation + per-method susie tuning:\n", + "```bash\n", + "sos run pipeline/gwas_rss_fine_mapping.ipynb gwas_rss_fine_mapping \\\n", + " --cwd output --modular-script-dir /path/to/code/script \\\n", + " --gwas-meta input/gwas/gwas_meta.tsv \\\n", + " --region-list input/rss_analysis/ld_blocks.bed \\\n", + " --ld-meta input/rss_analysis/ld_meta.tsv \\\n", + " --qc-method slalom --impute --qc-args '{\"mafCutoff\":0.01}' \\\n", + " --methods susie --method-args '{\"susie\":{\"L\":10}}'\n", + "```\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[global]\n", + "parameter: cwd = path('output')\n", + "parameter: modular_script_dir = path('code/script')\n", + "# --- per-study GWAS inputs (use either or both) ---------------------\n", + "parameter: gwas_meta = path('.')\n", + "parameter: gwas_tsv_list = [] # list of STUDY=PATH items\n", + "# --- region inputs (use either or both) -----------------------------\n", + "parameter: region_list = path('.')\n", + "parameter: regions = [] # list of chr:start-end strings\n", + "# --- LD reference + genome ------------------------------------------\n", + "parameter: ld_meta = path\n", + "parameter: genome = 'GRCh38'\n", + "# --- QC knobs (forwarded to gwas_sumstats_construct.R) --------------\n", + "parameter: qc_method = 'none' # none | slalom | dentist\n", + "parameter: impute = False\n", + "parameter: qc_args = '' # JSON object spliced into summaryStatsQc()\n", + "# --- fine-mapping knobs (forwarded to fine_mapping.R) ---------------\n", + "parameter: methods = 'susie'\n", + "parameter: coverage = 0.95\n", + "parameter: method_args = '' # nested per-method JSON object\n", + "# --- plot opt-out ---------------------------------------------------\n", + "parameter: no_plot = False\n", + "# --- output prefix for the manifest file ----------------------------\n", + "parameter: manifest_name = 'gwas_rss'\n", + "# --- infrastructure -------------------------------------------------\n", + "parameter: container = ''\n", + "parameter: job_size = 1\n", + "parameter: walltime = '2h'\n", + "parameter: mem = '16G'\n", + "parameter: numThreads = 1" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[generate_manifest]\n", + "# Resolve --gwas-meta / --gwas-tsv-list x --region-list / --regions into a\n", + "# single manifest TSV via gwas_rss_manifest.R. Downstream steps fan out\n", + "# over its rows; no Python parsing in this notebook.\n", + "input: None\n", + "output: f\"{cwd}/{manifest_name}.manifest.tsv\"\n", + "task: trunk_workers = 1, trunk_size = 1, walltime = '15m', mem = '2G', cores = 1, tags = f\"{step_name}_{_output:bn}\"\n", + "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", + " Rscript ${modular_script_dir}/pecotmr_integration/gwas_rss_manifest.R \\\n", + " ${('--gwas-meta ' + str(gwas_meta)) if gwas_meta.is_file() else ''} \\\n", + " ${('--gwas-tsv-list ' + ' '.join(str(x) for x in gwas_tsv_list)) if gwas_tsv_list else ''} \\\n", + " ${('--region-list ' + str(region_list)) if region_list.is_file() else ''} \\\n", + " ${('--regions ' + ' '.join(str(x) for x in regions)) if regions else ''} \\\n", + " --output ${_output}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[generate_gwas_sumstats]\n", + "# Fan out over the manifest's rows: one (study, region) GwasSumStats per\n", + "# row. Manifest columns: study_id, gwas_tsv, column_mapping, chr, start,\n", + "# end, region_id, gwas_tsv_basename.\n", + "import csv\n", + "jobs = list(csv.DictReader(open(f\"{cwd}/{manifest_name}.manifest.tsv\"), delimiter='\\t'))\n", + "input: for_each = 'jobs'\n", + "output: f\"{cwd}/sumstats/{_jobs['study_id']}.{_jobs['region_id']}.gwas_sumstats.rds\"\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\n", + "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", + " Rscript ${modular_script_dir}/pecotmr_integration/gwas_sumstats_construct.R \\\n", + " --study ${_jobs['study_id']} \\\n", + " --gwas-tsv ${_jobs['gwas_tsv']} \\\n", + " --ld-block ${_jobs['chr']}:${_jobs['start']}-${_jobs['end']} \\\n", + " --ld-meta ${ld_meta} \\\n", + " --genome ${genome} \\\n", + " ${('--column-mapping ' + _jobs['column_mapping']) if _jobs['column_mapping'] else ''} \\\n", + " --qc-method ${qc_method} \\\n", + " ${'--impute' if impute else ''} \\\n", + " ${('--qc-args ' + repr(qc_args)) if qc_args else ''} \\\n", + " --output ${_output}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[gwas_fine_mapping]\n", + "# Per-RDS fan-out: one fine-mapping task per (study, region) GwasSumStats.\n", + "output: f\"{cwd}/fine_mapping/{_input:bnn}.gwas_finemap.rds\", group_by = 1\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\n", + "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", + " Rscript ${modular_script_dir}/pecotmr_integration/fine_mapping.R \\\n", + " --gwas-sumstats ${_input} \\\n", + " --methods ${methods} \\\n", + " --coverage ${coverage} \\\n", + " ${('--method-args ' + repr(method_args)) if method_args else ''} \\\n", + " --output ${_output}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[gwas_rss_plot]\n", + "stop_if(no_plot, '--no-plot set; skipping PIP plot step.')\n", + "output: f\"{cwd}/plots/{_input:bnn}.pip_plot.png\", group_by = 1\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = '30m', mem = '4G', cores = 1, tags = f\"{step_name}_{_output:bn}\"\n", + "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", + " Rscript ${modular_script_dir}/pecotmr_integration/gwas_rss_plot.R \\\n", + " --input ${_input} \\\n", + " --output ${_output}" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "SoS", + "language": "sos", + "name": "sos" + }, + "language_info": { + "codemirror_mode": "sos", + "file_extension": ".sos", + "mimetype": "text/x-sos", + "name": "sos", + "pygments_lexer": "python", + "sos": { + "kernels": [ + [ + "SoS", + "sos", + "", + "" + ] + ], + "version": "0.22.4" + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file diff --git a/code/SoS/pecotmr_integration/gwas_sumstats.ipynb b/code/SoS/pecotmr_integration/gwas_sumstats.ipynb index 9b908b01..cc206ac6 100644 --- a/code/SoS/pecotmr_integration/gwas_sumstats.ipynb +++ b/code/SoS/pecotmr_integration/gwas_sumstats.ipynb @@ -26,7 +26,18 @@ }, "outputs": [], "source": [ - "[global]\nparameter: cwd = path('output')\nparameter: study = str\nparameter: gwas_tsv = path\nparameter: ld_blocks = path\nparameter: ld_meta = path\nparameter: genome = 'hg19'\nparameter: modular_script_dir = path('code/script')\nparameter: container = ''\nparameter: walltime = '30m'\nparameter: mem = '8G'\nparameter: numThreads = 1\n" + "[global]\n", + "parameter: cwd = path('output')\n", + "parameter: study = str\n", + "parameter: gwas_tsv = path\n", + "parameter: ld_blocks = path\n", + "parameter: ld_meta = path\n", + "parameter: genome = 'hg19'\n", + "parameter: modular_script_dir = path('code/script')\n", + "parameter: container = ''\n", + "parameter: walltime = '30m'\n", + "parameter: mem = '8G'\n", + "parameter: numThreads = 1" ] }, { @@ -37,7 +48,42 @@ }, "outputs": [], "source": [ - "[gwas_sumstats_construct]\n# Parse LD-blocks BED (header line uses 'chr/chrom', 'start', 'stop/end').\ndef _read_blocks(path):\n out = []\n with open(path) as f:\n hdr = next(f).rstrip('\\n').lstrip('#').split('\\t')\n for line in f:\n row = dict(zip(hdr, line.rstrip('\\n').split('\\t')))\n chrom = row.get('chr', row.get('chrom'))\n start = int(row.get('start', row.get('Start', 0)))\n stop = int(row.get('stop', row.get('end', row.get('End', 0))))\n out.append(f'{chrom}:{start}-{stop}')\n return out\nblocks = _read_blocks(ld_blocks)\nif not blocks:\n raise ValueError(f\"No LD blocks found in {ld_blocks}\")\ndef _safe(s):\n return s.replace(':', '_').replace('-', '_')\ninput: gwas_tsv, for_each = 'blocks'\noutput: f\"{cwd}/{study}.{_safe(_blocks)}.gwas_sumstats.rds\"\ntask: trunk_workers = 1, trunk_size = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\nbash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n Rscript ${modular_script_dir}/pecotmr_integration/gwas_sumstats_construct.R \\\n --study ${study} \\\n --gwas-tsv ${_input} \\\n --ld-block ${_blocks} \\\n --ld-meta ${ld_meta} \\\n --genome ${genome} \\\n --output ${_output}\n" + "[generate_manifest]\n", + "# Parse the LD-blocks BED into a (region, region_id) TSV manifest via\n", + "# ld_blocks_to_manifest.R. Downstream fans out over its rows; no Python\n", + "# parsing in this notebook.\n", + "input: ld_blocks\n", + "output: f\"{cwd}/{study}.blocks_manifest.tsv\"\n", + "task: trunk_workers = 1, trunk_size = 1, walltime = '15m', mem = '2G', cores = 1, tags = f\"{step_name}_{_output:bn}\"\n", + "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", + " Rscript ${modular_script_dir}/pecotmr_integration/ld_blocks_to_manifest.R \\\n", + " --ld-blocks ${_input} \\\n", + " --output ${_output}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[gwas_sumstats_construct]\n", + "# Fan out over the manifest's rows: one GwasSumStats per LD block.\n", + "import csv\n", + "jobs = list(csv.DictReader(open(f\"{cwd}/{study}.blocks_manifest.tsv\"), delimiter='\\t'))\n", + "input: gwas_tsv, for_each = 'jobs'\n", + "output: f\"{cwd}/{study}.{_jobs['region_id']}.gwas_sumstats.rds\"\n", + "task: trunk_workers = 1, trunk_size = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\n", + "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", + " Rscript ${modular_script_dir}/pecotmr_integration/gwas_sumstats_construct.R \\\n", + " --study ${study} \\\n", + " --gwas-tsv ${_input} \\\n", + " --ld-block ${_jobs['region']} \\\n", + " --ld-meta ${ld_meta} \\\n", + " --genome ${genome} \\\n", + " --output ${_output}" ] } ], diff --git a/code/SoS/pecotmr_integration/mash_preprocessing.ipynb b/code/SoS/pecotmr_integration/mash_preprocessing.ipynb new file mode 100644 index 00000000..d57d0133 --- /dev/null +++ b/code/SoS/pecotmr_integration/mash_preprocessing.ipynb @@ -0,0 +1,190 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# MASH preprocessing\n", + "\n", + "## Description\n", + "\n", + "Build MASH (Multivariate Adaptive Shrinkage) inputs from multi-context tensorqtl summary statistics and estimate the mixture-component covariance + weights via `pecotmr::mashPipeline()`. Three SoS steps:\n", + "\n", + "- **`[generate_manifest]`** — calls `mash_manifest.R` to resolve `--region-file` + `--sum-files` + `--conditions` into a single per-region manifest TSV (one row per region, with chrom-matched per-condition tensorqtl paths already resolved). All region parsing / per-chrom matching lives in R.\n", + "- **`[mash_inputs_from_tensorqtl]`** — per-region fan-out. For each region in `--region-file`, calls `mash_sumstats_construct.R` to read the per-condition tensorqtl files, inner-join on `variant_id`, and write a per-region `list(region_id = list(z = ))` RDS.\n", + "- **`[mash_estimation]`** — concatenates every per-region RDS, runs `pecotmr::mashRandNullSample` to derive strong / random / null subsets, wraps each partition as a per-context `QtlSumStats`, and calls `pecotmr::mashPipeline()`. The output RDS carries the MASH `U` (covariance matrices) and `w` (mixture weights).\n", + "\n", + "Replaces the workflow in `multivariate_genome/MASH/mash_preprocessing.ipynb` (which depended on the now-deprecated `load_multitrait_*_sumstat` loaders). The legacy notebook is left in place; new work should use this one.\n", + "\n", + "## Inputs\n", + "\n", + "- `--region-file` — TSV with columns `chr start end gene_id` listing the LD-block regions to process.\n", + "- `--sum-files file1 [file2 ...]` — text files, one per condition; each contains one tensorqtl summary-statistics path per line. The notebook assumes the per-chromosome layout already used by `mash_ran_null_sample` callers (path filename embeds the chromosome, matching the legacy `.CHR.` convention).\n", + "- `--conditions c1,c2,...` — condition labels, one per `--sum-files` entry (same order).\n", + "- `--ld-sketch ` — `GenotypeHandle` RDS (required by `mashPipeline()`'s QC gate; the synthesised `QtlSumStats` carries it through `summaryStatsQc()`).\n", + "- `--n-random / --n-null` — subset sizes for `mashRandNullSample`. Defaults 4000 / 4000.\n", + "- `--exclude-condition c1,c2` — conditions to drop before MASH. Default none.\n", + "- `--alpha` — alpha forwarded to `mashPipeline()`. Default 0 (standard scale).\n", + "- `--seed` — RNG seed. Default 999.\n", + "- `--name` — output filename prefix.\n", + "\n", + "## Outputs\n", + "\n", + "- `{cwd}/{name}_cache/{name}.{gene_id}.mash_input.rds` — one per region.\n", + "- `{cwd}/{name}.mash_result.rds` — final `mashPipeline()` output (`list(U, w)`).\n", + "\n", + "## MWE note\n", + "\n", + "The legacy fixture in `input/finemapping/protocol_example.sumstats_db.rds` was custom-massaged; the new pipeline expects standard per-condition tensorqtl outputs plus a `GenotypeHandle` RDS for the LD reference. Provide both before running." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Example\n", + "\n", + "```bash\n", + "sos run pipeline/mash_preprocessing.ipynb \\\n", + " mash_inputs_from_tensorqtl+mash_estimation \\\n", + " --cwd output/mash --modular-script-dir /path/to/code/script \\\n", + " --name protocol_example_mash \\\n", + " --region-file input/finemapping/protocol_example.region \\\n", + " --sum-files input/mash/condA.txt input/mash/condB.txt \\\n", + " --conditions condA,condB \\\n", + " --ld-sketch input/ld/protocol_example.ld_sketch.rds \\\n", + " --n-random 4000 --n-null 4000 \\\n", + " --alpha 0 --seed 999\n", + "```\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[global]\n", + "parameter: cwd = path('output')\n", + "parameter: modular_script_dir = path('code/script')\n", + "parameter: name = str\n", + "# --- per-region input ----------------------------------------------\n", + "parameter: region_file = path\n", + "parameter: sum_files = paths # one txt per condition; each line is one tensorqtl file path\n", + "parameter: conditions = '' # comma-separated condition labels\n", + "# --- mash estimation knobs ------------------------------------------\n", + "parameter: ld_sketch = path\n", + "parameter: n_random = 4000\n", + "parameter: n_null = 4000\n", + "parameter: exclude_condition = '' # comma-separated\n", + "parameter: alpha = 0.0\n", + "parameter: seed = 999\n", + "# --- infrastructure -------------------------------------------------\n", + "parameter: container = ''\n", + "parameter: job_size = 1\n", + "parameter: walltime = '2h'\n", + "parameter: mem = '16G'\n", + "parameter: numThreads = 1" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[generate_manifest]\n", + "# Resolve --region-file x --sum-files x --conditions into one manifest\n", + "# TSV via mash_manifest.R. Downstream steps fan out over its rows; no\n", + "# Python parsing in this notebook.\n", + "input: None\n", + "output: f\"{cwd}/{name}.mash_manifest.tsv\"\n", + "task: trunk_workers = 1, trunk_size = 1, walltime = '15m', mem = '2G', cores = 1, tags = f\"{step_name}_{_output:bn}\"\n", + "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", + " Rscript ${modular_script_dir}/pecotmr_integration/mash_manifest.R \\\n", + " --region-file ${region_file} \\\n", + " --sum-files ${' '.join(str(x) for x in sum_files)} \\\n", + " --conditions ${conditions} \\\n", + " --output ${_output}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[mash_inputs_from_tensorqtl]\n", + "# Fan out over the manifest's rows; one per-region task each.\n", + "import csv\n", + "jobs = list(csv.DictReader(open(f\"{cwd}/{name}.mash_manifest.tsv\"), delimiter='\\t'))\n", + "input: for_each = 'jobs'\n", + "output: f\"{cwd}/{name}_cache/{name}.{_jobs['gene_id']}.mash_input.rds\"\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\n", + "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", + " Rscript ${modular_script_dir}/pecotmr_integration/mash_sumstats_construct.R \\\n", + " --tensorqtl-paths ${_jobs['tensorqtl_paths']} \\\n", + " --conditions ${conditions} \\\n", + " --region ${_jobs['region']} \\\n", + " --output ${_output}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[mash_estimation]\n", + "output: f\"{cwd}/{name}.mash_result.rds\"\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\n", + "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", + " Rscript ${modular_script_dir}/pecotmr_integration/mash.R \\\n", + " --mash-inputs ${_input} \\\n", + " --study ${name} \\\n", + " --ld-sketch ${ld_sketch} \\\n", + " --n-random ${n_random} \\\n", + " --n-null ${n_null} \\\n", + " ${('--exclude-condition ' + exclude_condition) if exclude_condition else ''} \\\n", + " --alpha ${alpha} \\\n", + " --seed ${seed} \\\n", + " --output ${_output}" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "SoS", + "language": "sos", + "name": "sos" + }, + "language_info": { + "codemirror_mode": "sos", + "file_extension": ".sos", + "mimetype": "text/x-sos", + "name": "sos", + "pygments_lexer": "python", + "sos": { + "kernels": [ + [ + "SoS", + "sos", + "", + "" + ] + ], + "version": "0.22.4" + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file diff --git a/code/SoS/pecotmr_integration/multivariate_fine_mapping.ipynb b/code/SoS/pecotmr_integration/multivariate_fine_mapping.ipynb new file mode 100644 index 00000000..194261c5 --- /dev/null +++ b/code/SoS/pecotmr_integration/multivariate_fine_mapping.ipynb @@ -0,0 +1,143 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Multivariate fine-mapping (mvSuSiE)\n", + "\n", + "## Description\n", + "\n", + "Per-region multivariate fine-mapping via `pecotmr::fineMappingPipeline(qtlDataset, methods = \"mvsusie\")`. mvSuSiE jointly fine-maps across the QtlDataset's traits and contexts in one region, producing one `QtlFineMappingResult` per fan-out unit. Each task loads the same `QtlDataset` RDS and fits one unit.\n", + "\n", + "This replaces the `[mnm]` and `[mnm_genes]` steps of the legacy `mnm_regression.ipynb`. The same `fine_mapping.R` worker that drives univariate SuSiE drives mvSuSiE here \u2014 only the `--methods` value differs.\n", + "\n", + "For mvSuSiE TWAS weights (mr.mash + mvSuSiE), call `twas_weights.ipynb --methods mrmash,mvsusie` afterwards against the same `QtlDataset`.\n", + "\n", + "## Inputs\n", + "\n", + "- `--qtl-dataset` — path to the `QtlDataset` RDS produced by `qtl_dataset.ipynb`. Multi-context / multi-trait shape is what makes mvSuSiE meaningful.\n", + "- `--regions chr:start-end ...` **or** `--genes ID1 ID2 ...` — fan-out targets (mutually exclusive). Region mode (multi-trait joint) is the standard use; gene mode runs the multi-context mvSuSiE variant for one focal trait.\n", + "- `--cis-window` — bp window (gene mode only). Default 1,000,000.\n", + "- `--methods` — comma-separated method tokens. Default `mvsusie`. Pass e.g. `mvsusie,susie` to also run univariate SuSiE alongside.\n", + "- `--coverage` — SuSiE/mvSuSiE credible-set coverage. Default 0.95.\n", + "- `--method-args` — optional JSON object spliced into `fineMappingPipeline()` for knobs not exposed as flags (priors, max-iter, residual options, etc.).\n", + "- `--cwd` — output directory. Default `output`.\n", + "- `--study` — study label used in the output filename.\n", + "- `--modular-script-dir` — directory holding the per-task R workers. Default `code/script`.\n", + "\n", + "## Outputs\n", + "\n", + "- `{cwd}/{study}.{gene|region}.mv_finemap.rds` — one `QtlFineMappingResult` per fan-out unit. Region strings are sanitised (`:` and `-` become `_`).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Example\n", + "\n", + "Region mode (canonical, multi-trait joint):\n", + "```bash\n", + "sos run pipeline/multivariate_fine_mapping.ipynb multivariate_fine_mapping \\\n", + " --cwd output \\\n", + " --modular-script-dir /path/to/xqtl-protocol/code/script \\\n", + " --study TEST_STUDY \\\n", + " --qtl-dataset output/TEST_STUDY.qtl_dataset.rds \\\n", + " --regions chr12:752578-2752578\n", + "```\n", + "\n", + "With mixture-prior tuning passed through:\n", + "```bash\n", + "sos run pipeline/multivariate_fine_mapping.ipynb multivariate_fine_mapping \\\n", + " --cwd output --study TEST_STUDY \\\n", + " --qtl-dataset output/TEST_STUDY.qtl_dataset.rds \\\n", + " --regions chr12:752578-2752578 \\\n", + " --method-args '{\"addSusieInf\":false,\"signalCutoff\":0.01}'\n", + "```\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[global]\n", + "parameter: cwd = path('output')\n", + "parameter: study = str\n", + "parameter: modular_script_dir = path('code/script')\n", + "parameter: qtl_dataset = path('.')\n", + "parameter: genes = []\n", + "parameter: regions = []\n", + "parameter: cis_window = 1000000\n", + "parameter: methods = 'mvsusie'\n", + "parameter: coverage = 0.95\n", + "# JSON object spliced into fineMappingPipeline() via do.call. Empty disables.\n", + "parameter: method_args = ''\n", + "parameter: container = ''\n", + "parameter: job_size = 1\n", + "parameter: walltime = '1h'\n", + "parameter: mem = '16G'\n", + "parameter: numThreads = 1" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[multivariate_fine_mapping]\n", + "if not qtl_dataset.is_file():\n", + " raise ValueError(\"multivariate_fine_mapping requires --qtl-dataset to point at an existing QtlDataset RDS.\")\n", + "if bool(genes) == bool(regions):\n", + " raise ValueError(\"Specify exactly one of --genes (trait IDs) or --regions (chr:start-end strings).\")\n", + "fanout_items = genes if genes else regions\n", + "fanout_kind = 'gene' if genes else 'region'\n", + "input: qtl_dataset, for_each = 'fanout_items'\n", + "output: f\"{cwd}/{study}.{_fanout_items.replace(':', '_').replace('-', '_')}.mv_finemap.rds\"\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\n", + "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", + " Rscript ${modular_script_dir}/pecotmr_integration/fine_mapping.R \\\n", + " --qtl-dataset ${_input} \\\n", + " ${('--gene-id ' + _fanout_items + ' --cis-window ' + str(cis_window)) if fanout_kind == 'gene' else ('--region ' + _fanout_items)} \\\n", + " --methods ${methods} \\\n", + " --coverage ${coverage} \\\n", + " ${('--method-args ' + repr(method_args)) if method_args else ''} \\\n", + " --output ${_output}" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "SoS", + "language": "sos", + "name": "sos" + }, + "language_info": { + "codemirror_mode": "sos", + "file_extension": ".sos", + "mimetype": "text/x-sos", + "name": "sos", + "pygments_lexer": "python", + "sos": { + "kernels": [ + [ + "SoS", + "sos", + "", + "" + ] + ], + "version": "0.22.4" + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file diff --git a/code/SoS/pecotmr_integration/twas.ipynb b/code/SoS/pecotmr_integration/twas.ipynb index 9e0b0871..34b16efe 100644 --- a/code/SoS/pecotmr_integration/twas.ipynb +++ b/code/SoS/pecotmr_integration/twas.ipynb @@ -37,7 +37,44 @@ }, "outputs": [], "source": [ - "[twas]\n# Read the per-gene manifest. Header may be canonical (`gene_id`,\n# `twas_weights_rds`, `gwas_sumstats_rds`, `fine_mapping_result_rds`) and\n# may carry a leading `#`. fine_mapping_result_rds is optional per-row\n# (empty cell = skip).\ndef _read_manifest(path):\n rows = []\n with open(path) as f:\n hdr = next(f).rstrip('\\n').lstrip('#').split('\\t')\n required = {'gene_id', 'twas_weights_rds', 'gwas_sumstats_rds'}\n missing = required - set(hdr)\n if missing:\n raise ValueError(\n f\"Manifest missing required column(s): {sorted(missing)}\")\n for line in f:\n parts = line.rstrip('\\n').split('\\t')\n row = dict(zip(hdr, parts))\n if not row.get('gene_id'):\n continue\n rows.append(row)\n if not rows:\n raise ValueError(f\"Manifest {path} is empty.\")\n ids = [r['gene_id'] for r in rows]\n if len(set(ids)) != len(ids):\n raise ValueError(\n \"Manifest has duplicate gene_id values; each row must be unique.\")\n return rows\nrows = _read_manifest(manifest)\ngene_ids = [r['gene_id'] for r in rows]\ntw_paths = [r['twas_weights_rds'] for r in rows]\ngss_paths = [r['gwas_sumstats_rds'] for r in rows]\nfmr_paths = [r.get('fine_mapping_result_rds', '') or '' for r in rows]\ninput: for_each = ['gene_ids', 'tw_paths', 'gss_paths', 'fmr_paths']\nfmr_arg = f\"--fine-mapping-result {_fmr_paths}\" if _fmr_paths else \"\"\noutput: f\"{cwd}/{study}.{_gene_ids}.twas.rds\"\ntask: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\nbash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n Rscript ${modular_script_dir}/pecotmr_integration/twas.R \\\n --twas-weights ${_tw_paths} \\\n --gwas-sumstats ${_gss_paths} \\\n ${fmr_arg} \\\n --mr-pip-cutoff ${mr_pip_cutoff} \\\n --mr-method ${mr_method} \\\n --output ${_output}\n" + "[validate_manifest]\n", + "# Validate + canonicalise the user-supplied per-gene TWAS manifest via\n", + "# twas_manifest_validate.R. Downstream fans out over its rows; no\n", + "# Python parsing in this notebook.\n", + "input: manifest\n", + "output: f\"{cwd}/{study}.twas_manifest.tsv\"\n", + "task: trunk_workers = 1, trunk_size = 1, walltime = '15m', mem = '2G', cores = 1, tags = f\"{step_name}_{_output:bn}\"\n", + "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", + " Rscript ${modular_script_dir}/pecotmr_integration/twas_manifest_validate.R \\\n", + " --manifest ${_input} \\\n", + " --output ${_output}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[twas]\n", + "# Fan out over the canonical manifest's rows: one task per gene_id.\n", + "# Manifest columns: gene_id, twas_weights_rds, gwas_sumstats_rds, and\n", + "# optionally fine_mapping_result_rds.\n", + "import csv\n", + "jobs = list(csv.DictReader(open(f\"{cwd}/{study}.twas_manifest.tsv\"), delimiter='\\t'))\n", + "input: for_each = 'jobs'\n", + "output: f\"{cwd}/{study}.{_jobs['gene_id']}.twas.rds\"\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\n", + "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", + " Rscript ${modular_script_dir}/pecotmr_integration/twas.R \\\n", + " --twas-weights ${_jobs['twas_weights_rds']} \\\n", + " --gwas-sumstats ${_jobs['gwas_sumstats_rds']} \\\n", + " ${('--fine-mapping-result ' + _jobs.get('fine_mapping_result_rds', '')) if _jobs.get('fine_mapping_result_rds') else ''} \\\n", + " --mr-pip-cutoff ${mr_pip_cutoff} \\\n", + " --mr-method ${mr_method} \\\n", + " --output ${_output}" ] } ], @@ -77,4 +114,4 @@ }, "nbformat": 4, "nbformat_minor": 4 -} +} \ No newline at end of file diff --git a/code/SoS/pecotmr_integration/twas_weights.ipynb b/code/SoS/pecotmr_integration/twas_weights.ipynb index c723922c..d9eb4990 100644 --- a/code/SoS/pecotmr_integration/twas_weights.ipynb +++ b/code/SoS/pecotmr_integration/twas_weights.ipynb @@ -37,7 +37,20 @@ }, "outputs": [], "source": [ - "[twas_weights]\nif bool(genes) == bool(regions):\n raise ValueError(\"Specify exactly one of --genes (trait IDs) or --regions (chr:start-end strings).\")\nfanout_items = genes if genes else regions\nfanout_kind = 'gene' if genes else 'region'\ndef _fanout_safe(s):\n return s.replace(':', '_').replace('-', '_')\ninput: qtl_dataset, for_each = 'fanout_items'\noutput: f\"{cwd}/{study}.{_fanout_safe(_fanout_items)}.twas_weights.rds\"\ntask: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\nbash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n Rscript ${modular_script_dir}/pecotmr_integration/twas_weights.R \\\n --qtl-dataset ${_input} \\\n ${('--gene-id ' + _fanout_items + ' --cis-window ' + str(cis_window)) if fanout_kind == 'gene' else ('--region ' + _fanout_items)} \\\n --fine-mapping-result ${fine_mapping_result if fine_mapping_result.is_file() else '\"\"'} \\\n --output ${_output}\n" + "[twas_weights]\n", + "if bool(genes) == bool(regions):\n", + " raise ValueError(\"Specify exactly one of --genes (trait IDs) or --regions (chr:start-end strings).\")\n", + "fanout_items = genes if genes else regions\n", + "fanout_kind = 'gene' if genes else 'region'\n", + "input: qtl_dataset, for_each = 'fanout_items'\n", + "output: f\"{cwd}/{study}.{_fanout_items.replace(':', '_').replace('-', '_')}.twas_weights.rds\"\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\n", + "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", + " Rscript ${modular_script_dir}/pecotmr_integration/twas_weights.R \\\n", + " --qtl-dataset ${_input} \\\n", + " ${('--gene-id ' + _fanout_items + ' --cis-window ' + str(cis_window)) if fanout_kind == 'gene' else ('--region ' + _fanout_items)} \\\n", + " --fine-mapping-result ${fine_mapping_result if fine_mapping_result.is_file() else '\"\"'} \\\n", + " --output ${_output}\n" ] } ], diff --git a/code/SoS/reference_data/rss_ld_sketch.ipynb b/code/SoS/reference_data/rss_ld_sketch.ipynb index ef273f0a..c1ac6c9a 100644 --- a/code/SoS/reference_data/rss_ld_sketch.ipynb +++ b/code/SoS/reference_data/rss_ld_sketch.ipynb @@ -149,64 +149,6 @@ " --cwd output/rss_ld_sketch" ] }, - { - "cell_type": "markdown", - "id": "2f5835fb", - "metadata": {}, - "source": [ - "### Step 4. Load the LD sketch data into R\n", - "\n", - "Load the final per-chromosome pgen output into R with `pecotmr::load_LD_matrix()` via a metadata TSV. The metadata TSV has one row per chromosome with `start=0`, `end=0`, and `path` set to the PLINK file prefix (resolved relative to the TSV directory):\n", - "\n", - "```\n", - "#chrom start end path\n", - "22 0 0 protocol_example.chr22\n", - "```\n" - ] - }, - { - "cell_type": "markdown", - "id": "d8e24892", - "metadata": {}, - "source": [ - "**Tested output** (ADSP R5 EUR chr22, 10000 pseudo-samples, 2Mb region, 9188 variants):\n", - "\n", - "```\n", - "X dimensions (samples x variants): 10000 9188\n", - "R dimensions: 9188 9188\n", - "```\n", - "\n", - "Use `return_genotype = TRUE` for the `susie_rss(z, X = X)` interface, or `FALSE` for the `susie_rss(z, R = R)` interface. For repeated analyses on the same region, compute `R` once and save it with `saveRDS()` for reuse.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8a05b576", - "metadata": {}, - "outputs": [], - "source": [ - "library(pecotmr)\n", - "\n", - "meta <- \"/path/to/ld_meta_chr22.tsv\"\n", - "region <- \"chr22:30000000-32000000\" # 2Mb region\n", - "\n", - "# Load genotype matrix X (for susie_rss z+X interface)\n", - "system.time({\n", - " res_x <- load_LD_matrix(meta, region, return_genotype = TRUE)\n", - "})\n", - "cat(\"X dimensions (samples x variants):\", dim(res_x$LD_matrix), \"\\n\")\n", - "head(res_x$ref_panel, 3)\n", - "res_x$LD_matrix[1:3, 1:3]\n", - "\n", - "# Load LD correlation matrix R (for susie_rss z+R interface)\n", - "system.time({\n", - " res_r <- load_LD_matrix(meta, region, return_genotype = FALSE)\n", - "})\n", - "cat(\"R dimensions:\", dim(res_r$LD_matrix), \"\\n\")\n", - "res_r$LD_matrix[1:3, 1:3]" - ] - }, { "cell_type": "markdown", "id": "32c022be", diff --git a/code/script/pecotmr_integration/fine_mapping.R b/code/script/pecotmr_integration/fine_mapping.R index a9e81d0a..5148fa0d 100644 --- a/code/script/pecotmr_integration/fine_mapping.R +++ b/code/script/pecotmr_integration/fine_mapping.R @@ -22,6 +22,13 @@ # Shared: # --methods Comma-separated method tokens. Default "susie". # --coverage SuSiE credible-set coverage. Default 0.95. +# --method-args Optional JSON object of per-method kwargs spliced +# into fineMappingPipeline() via its named-list +# methods= argument. Keys are method tokens (must be +# a subset of --methods); values are kwargs lists +# forwarded to the underlying fitter (susieR::susie, +# susieR::susie_rss, mvsusieR::mvsusie, fsusieR::susiF, +# etc.). Example: '{"susie":{"L":1,"refine":false}}'. # --output Output RDS path. suppressPackageStartupMessages({ @@ -29,6 +36,7 @@ suppressPackageStartupMessages({ library(pecotmr) library(GenomicRanges) library(IRanges) + library(jsonlite) }) parser <- arg_parser("SuSiE fine-mapping over a pecotmr S4 input (QtlDataset or GwasSumStats)") @@ -53,10 +61,36 @@ parser <- add_argument(parser, "--methods", parser <- add_argument(parser, "--coverage", help = "SuSiE credible-set coverage", type = "numeric", default = 0.95) +parser <- add_argument(parser, "--method-args", + help = "JSON object {token: {kwarg: value, ...}, ...} for fineMappingPipeline()", + type = "character", default = "") parser <- add_argument(parser, "--output", help = "Output RDS path", type = "character") argv <- parse_args(parser) +# Parse --method-args into a nested named list of per-method kwargs. +# Empty / '.' / '{}' all yield NULL (which makes us pass the char-vector +# form of methods= to fineMappingPipeline). +parsed_method_args <- if (nzchar(argv$method_args) && argv$method_args != "." && + argv$method_args != "{}") { + parsed <- tryCatch(jsonlite::fromJSON(argv$method_args, simplifyVector = FALSE), + error = function(e) stop( + "--method-args must be a JSON object string (got: ", + argv$method_args, "). Error: ", conditionMessage(e))) + if (!is.list(parsed) || is.null(names(parsed)) || any(names(parsed) == "")) + stop("--method-args must be a JSON object whose keys are method ", + "tokens, e.g. '{\"susie\":{\"L\":1}}'.") + nonObject <- vapply(parsed, function(x) !is.list(x), logical(1)) + if (any(nonObject)) + stop("--method-args: each value must itself be an object of kwargs ", + "(got non-object for: ", + paste(names(parsed)[nonObject], collapse = ", "), + "). Did you mean '{\"susie\":", argv$method_args, "}'?") + parsed +} else { + NULL +} + parse_region <- function(s) { m <- regmatches(s, regexec("^([^:]+):([0-9]+)-([0-9]+)$", s))[[1L]] if (length(m) != 4L) @@ -75,13 +109,33 @@ if (!has_qtl && !has_gwas) methods <- trimws(strsplit(argv$methods, ",", fixed = TRUE)[[1L]]) +# Build the final `methods` argument for fineMappingPipeline. When the +# user supplied --method-args, validate that every key is one of the +# --methods tokens (no silent typos) and build the named-list form +# {token: kwargs} that fineMappingPipeline expects. Every token in +# --methods is represented in the list (with an empty kwargs list when +# the user didn't supply one). When --method-args is empty we just pass +# the character vector, which fineMappingPipeline accepts unchanged. +methods_arg <- if (is.null(parsed_method_args)) { + methods +} else { + unknown <- setdiff(names(parsed_method_args), methods) + if (length(unknown) > 0L) + stop("--method-args has keys not listed in --methods (got '", + paste(unknown, collapse = ", "), + "'; --methods = '", paste(methods, collapse = ", "), "').") + setNames(lapply(methods, function(tk) { + if (tk %in% names(parsed_method_args)) parsed_method_args[[tk]] + else list() + }), methods) +} + if (has_gwas) { # ----- GWAS mode ------------------------------------------------------- gss <- readRDS(argv$gwas_sumstats) - res <- fineMappingPipeline( - gss, - methods = methods, - coverage = argv$coverage) + res <- fineMappingPipeline(gss, + methods = methods_arg, + coverage = argv$coverage) label <- paste0("GwasSumStats '", basename(argv$gwas_sumstats), "'") } else { # ----- QTL mode -------------------------------------------------------- @@ -93,19 +147,15 @@ if (has_gwas) { stop("QTL mode requires --gene-id (with --cis-window) or --region.") qd <- readRDS(argv$qtl_dataset) res <- if (has_region) { - fineMappingPipeline( - qd, - methods = methods, - region = parse_region(argv$region), - cisWindow = argv$cis_window, - coverage = argv$coverage) + fineMappingPipeline(qd, methods = methods_arg, + region = parse_region(argv$region), + cisWindow = argv$cis_window, + coverage = argv$coverage) } else { - fineMappingPipeline( - qd, - methods = methods, - traitId = argv$gene_id, - cisWindow = argv$cis_window, - coverage = argv$coverage) + fineMappingPipeline(qd, methods = methods_arg, + traitId = argv$gene_id, + cisWindow = argv$cis_window, + coverage = argv$coverage) } label <- if (has_region) paste0("region '", argv$region, "'") else paste0("gene '", argv$gene_id, "'") diff --git a/code/script/pecotmr_integration/fine_mapping_export.R b/code/script/pecotmr_integration/fine_mapping_export.R new file mode 100644 index 00000000..0f46eb0b --- /dev/null +++ b/code/script/pecotmr_integration/fine_mapping_export.R @@ -0,0 +1,126 @@ +#!/usr/bin/env Rscript +# fine_mapping_export.R +# +# Export a view (topLoci / credibleSets / pip / marginals) of one or more +# QtlFineMappingResult / GwasFineMappingResult RDSes as a single concatenated +# TSV. Adds identifier columns derived from the FMR row (study, context, +# trait, method for QTL; study, method, region_id for GWAS) so a downstream +# consumer can split the table back per-tuple. +# +# Inputs: +# --input [ ...] One or more FineMappingResult RDS paths. +# --view {topLoci|cs|pip|marginals} Which per-entry view to export. +# Default "topLoci". +# --signal-cutoff PIP threshold for topLoci/pip exports. Default 0 +# (no filter). Pass 0.025 to mirror the legacy +# susie default. +# --output Output TSV path (gzipped if path ends in .gz). + +suppressPackageStartupMessages({ + library(argparser) + library(pecotmr) +}) + +parser <- arg_parser("Export a FineMappingResult view as a TSV") +parser <- add_argument(parser, "--input", + help = "One or more FineMappingResult RDS paths", + type = "character", nargs = Inf) +parser <- add_argument(parser, "--view", + help = "Which view to export: topLoci | cs | pip | marginals", + type = "character", default = "topLoci") +parser <- add_argument(parser, "--signal-cutoff", + help = "PIP cutoff for topLoci/pip exports", + type = "numeric", default = 0) +parser <- add_argument(parser, "--output", + help = "Output TSV path", type = "character") +argv <- parse_args(parser) + +view <- match.arg(argv$view, c("topLoci", "cs", "pip", "marginals")) +inputs <- as.character(argv$input) +if (length(inputs) == 0L) + stop("--input requires at least one RDS path.") + +# Pull the row's identifier columns regardless of QTL vs GWAS shape; the +# union of slots produces a NA-padded data frame the downstream rbind can +# concatenate across mixed inputs. +.idCols <- function(fmr, i) { + list( + study = as.character(fmr$study)[[i]], + context = if ("context" %in% names(fmr)) as.character(fmr$context)[[i]] else NA_character_, + trait = if ("trait" %in% names(fmr)) as.character(fmr$trait)[[i]] else NA_character_, + region_id = if ("region_id" %in% names(fmr)) as.character(fmr$region_id)[[i]] else NA_character_, + method = as.character(fmr$method)[[i]], + source = basename(attr(fmr, ".source", exact = TRUE) %||% "")) +} +`%||%` <- function(a, b) if (is.null(a)) b else a + +.extract <- function(entry, view, cutoff) { + if (view == "topLoci") { + df <- as.data.frame(getTopLoci(entry, signalCutoff = cutoff)) + } else if (view == "cs") { + df <- as.data.frame(getCs(entry)) + } else if (view == "pip") { + pip <- as.numeric(getPip(entry)) + ids <- as.character(getVariantIds(entry)) + df <- data.frame(variant_id = ids, pip = pip, + stringsAsFactors = FALSE) + if (cutoff > 0) df <- df[df$pip > cutoff, , drop = FALSE] + } else { # marginals + df <- as.data.frame(getMarginalEffects(entry)) + } + if (nrow(df) == 0L) return(NULL) + df +} + +pieces <- list() +for (path in inputs) { + fmr <- readRDS(path) + if (!methods::is(fmr, "FineMappingResultBase")) { + warning("Skipping non-FineMappingResult input: ", path) + next + } + attr(fmr, ".source") <- path + for (i in seq_len(nrow(fmr))) { + entry <- fmr$entry[[i]] + inner <- tryCatch(.extract(entry, view, argv$signal_cutoff), + error = function(e) { + message("Entry ", i, " of ", basename(path), + ": ", conditionMessage(e)) + NULL + }) + if (is.null(inner)) next + ids <- .idCols(fmr, i) + ids$source <- basename(path) + for (k in names(ids)) inner[[k]] <- ids[[k]] + pieces[[length(pieces) + 1L]] <- inner + } +} +if (length(pieces) == 0L) { + message("No rows produced; writing an empty TSV with the id-column ", + "header so downstream consumers don't error on missing files.") + out <- data.frame(study = character(0), context = character(0), + trait = character(0), region_id = character(0), + method = character(0), source = character(0), + stringsAsFactors = FALSE) +} else { + # Pad missing columns across pieces so rbind doesn't lose rows. + all_cols <- unique(unlist(lapply(pieces, names))) + for (k in seq_along(pieces)) { + miss <- setdiff(all_cols, names(pieces[[k]])) + for (m in miss) pieces[[k]][[m]] <- NA + pieces[[k]] <- pieces[[k]][, all_cols, drop = FALSE] + } + out <- do.call(rbind, pieces) +} + +dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE) +conn <- if (grepl("\\.gz$", argv$output)) { + gzfile(argv$output, "w") +} else { + file(argv$output, "w") +} +write.table(out, file = conn, sep = "\t", quote = FALSE, + row.names = FALSE, na = "") +close(conn) +cat(sprintf("Wrote %s view (%d rows from %d input RDS file(s)) to %s\n", + view, nrow(out), length(inputs), argv$output)) diff --git a/code/script/pecotmr_integration/fine_mapping_pip_plot.R b/code/script/pecotmr_integration/fine_mapping_pip_plot.R new file mode 100644 index 00000000..b1aba5fe --- /dev/null +++ b/code/script/pecotmr_integration/fine_mapping_pip_plot.R @@ -0,0 +1,108 @@ +#!/usr/bin/env Rscript +# fine_mapping_pip_plot.R +# +# Render a per-region PIP landscape plot from a QtlFineMappingResult or +# GwasFineMappingResult RDS (output of fine_mapping.ipynb / +# multivariate_fine_mapping.ipynb / functional_fine_mapping.ipynb / +# gwas_rss_fine_mapping.ipynb). One PNG per input RDS, one facet panel +# per row in the FMR. +# +# Inputs: +# --input Path to a FineMappingResult RDS +# --output Output PNG path +# --width PNG width (inches). Default 9. +# --height Per-panel PNG height (inches). Default 3. + +suppressPackageStartupMessages({ + library(argparser) + library(pecotmr) + library(ggplot2) +}) + +parser <- arg_parser("Render a per-region PIP plot from a FineMappingResult RDS") +parser <- add_argument(parser, "--input", + help = "Path to a FineMappingResult RDS", + type = "character") +parser <- add_argument(parser, "--output", + help = "Output PNG path", type = "character") +parser <- add_argument(parser, "--width", + help = "PNG width in inches", + type = "numeric", default = 9) +parser <- add_argument(parser, "--height", + help = "Per-panel PNG height in inches", + type = "numeric", default = 3) +argv <- parse_args(parser) + +fmr <- readRDS(argv$input) +if (!methods::is(fmr, "FineMappingResultBase")) + stop("--input must be a FineMappingResultBase subclass (got '", + class(fmr)[[1L]], "').") + +write_empty <- function(msg) { + message(msg, "; writing an empty plot to ", argv$output) + dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE) + png(argv$output, width = argv$width * 100, height = argv$height * 100) + plot.new(); title(main = msg) + dev.off() +} + +if (nrow(fmr) == 0L) { + write_empty("Empty FineMappingResult") + quit(save = "no") +} + +# Build a long tidy table of (panel-id, variant_id, pip) using the S4 +# accessors. The panel label varies by FMR shape: GWAS keeps region_id, +# QTL uses (context, trait). +isGwas <- methods::is(fmr, "GwasFineMappingResult") +panels <- lapply(seq_len(nrow(fmr)), function(i) { + entry <- fmr$entry[[i]] + pip <- as.numeric(getPip(entry)) + ids <- as.character(getVariantIds(entry)) + if (length(pip) == 0L || length(ids) == 0L) return(NULL) + panel <- if (isGwas) { + sprintf("%s | %s | %s", + as.character(fmr$study)[[i]], + as.character(fmr$method)[[i]], + as.character(fmr$region_id)[[i]]) + } else { + sprintf("%s | %s | %s | %s", + as.character(fmr$study)[[i]], + as.character(fmr$context)[[i]], + as.character(fmr$trait)[[i]], + as.character(fmr$method)[[i]]) + } + data.frame(panel = panel, variant_id = ids, pip = pip, + stringsAsFactors = FALSE) +}) +panels <- panels[!vapply(panels, is.null, logical(1))] +if (length(panels) == 0L) { + write_empty("No usable PIP vectors on any entry") + quit(save = "no") +} +df <- do.call(rbind, panels) + +# Best-effort variant-id → position decode; falls back to row index. +pos <- suppressWarnings({ + m <- regmatches(df$variant_id, + regexec("^[^:_]+[:_]([0-9]+)", df$variant_id)) + vapply(m, function(x) if (length(x) >= 2L) as.numeric(x[[2L]]) else NA_real_, + numeric(1L)) +}) +df$pos <- ifelse(is.na(pos), seq_len(nrow(df)), pos) + +g <- ggplot(df, aes(x = pos, y = pip)) + + geom_point(alpha = 0.6, size = 0.8) + + facet_wrap(~ panel, ncol = 1, scales = "free_x") + + scale_y_continuous(limits = c(0, 1)) + + labs(x = "Variant position", y = "PIP", + title = paste0(class(fmr)[[1L]], ": ", basename(argv$input))) + + theme_minimal(base_size = 10) + +dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE) +ggsave(argv$output, g, + width = argv$width, + height = argv$height * length(unique(df$panel)), + dpi = 150) +cat(sprintf("Wrote PIP plot for %d panel(s) (%d variants) to %s\n", + length(unique(df$panel)), nrow(df), argv$output)) diff --git a/code/script/pecotmr_integration/fine_mapping_upset.R b/code/script/pecotmr_integration/fine_mapping_upset.R new file mode 100644 index 00000000..4e7b5dcc --- /dev/null +++ b/code/script/pecotmr_integration/fine_mapping_upset.R @@ -0,0 +1,110 @@ +#!/usr/bin/env Rscript +# fine_mapping_upset.R +# +# Render an UpSet plot of credible-set variant overlap across one or more +# FineMappingResult RDSes. Each (study, context, trait, method, region_id) +# tuple becomes one set; the set members are the variant IDs in the union +# of that tuple's credible sets. Useful for visualizing which signals are +# shared across studies / contexts / traits. +# +# Inputs: +# --input [ ...] One or more FineMappingResult RDS paths. +# --output Output PNG path. +# --max-sets Cap on the number of sets to render (UpSet +# gets noisy past ~10). Default 20. Sets are +# ranked by their CS size (descending) and +# the top --max-sets are kept. +# --width / --height PNG dimensions in inches. Defaults 12 x 6. + +suppressPackageStartupMessages({ + library(argparser) + library(pecotmr) +}) + +parser <- arg_parser("UpSet plot of credible-set variant overlap across FineMappingResult RDSes") +parser <- add_argument(parser, "--input", + help = "One or more FineMappingResult RDS paths", + type = "character", nargs = Inf) +parser <- add_argument(parser, "--output", + help = "Output PNG path", type = "character") +parser <- add_argument(parser, "--max-sets", + help = "Maximum number of sets to render", + type = "integer", default = 20L) +parser <- add_argument(parser, "--width", + help = "PNG width (inches)", + type = "numeric", default = 12) +parser <- add_argument(parser, "--height", + help = "PNG height (inches)", + type = "numeric", default = 6) +argv <- parse_args(parser) + +if (!requireNamespace("UpSetR", quietly = TRUE)) + stop("fine_mapping_upset.R requires the UpSetR package: ", + "install.packages('UpSetR')") + +inputs <- as.character(argv$input) +if (length(inputs) == 0L) + stop("--input requires at least one RDS path.") + +# Walk every (FMR-row x CS) combination and build a named list of +# {set-label -> variant_ids} that UpSetR consumes via fromList(). +sets <- list() +for (path in inputs) { + fmr <- readRDS(path) + if (!methods::is(fmr, "FineMappingResultBase")) { + warning("Skipping non-FineMappingResult input: ", path) + next + } + isGwas <- methods::is(fmr, "GwasFineMappingResult") + for (i in seq_len(nrow(fmr))) { + entry <- fmr$entry[[i]] + cs_df <- tryCatch(as.data.frame(getCs(entry)), + error = function(e) NULL) + if (is.null(cs_df) || nrow(cs_df) == 0L) next + if (!"variant_id" %in% names(cs_df)) next + label <- if (isGwas) { + sprintf("%s|%s|%s", + as.character(fmr$study)[[i]], + as.character(fmr$method)[[i]], + as.character(fmr$region_id)[[i]]) + } else { + sprintf("%s|%s|%s|%s", + as.character(fmr$study)[[i]], + as.character(fmr$context)[[i]], + as.character(fmr$trait)[[i]], + as.character(fmr$method)[[i]]) + } + sets[[label]] <- unique(as.character(cs_df$variant_id)) + } +} + +if (length(sets) == 0L) { + message("No credible sets found across any input; writing an empty plot.") + dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE) + png(argv$output, width = argv$width * 100, height = argv$height * 100) + plot.new(); title(main = "No credible sets in inputs") + dev.off() + quit(save = "no") +} + +# Rank sets by size descending, cap at --max-sets to keep the plot readable. +sizes <- vapply(sets, length, integer(1L)) +keep <- order(sizes, decreasing = TRUE)[seq_len(min(argv$max_sets, length(sets)))] +sets <- sets[keep] +if (length(sets) < length(sizes)) + message("Rendering top ", argv$max_sets, " set(s) by size; ", + length(sizes) - length(sets), " smaller set(s) dropped.") + +binary <- UpSetR::fromList(sets) + +dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE) +png(argv$output, width = argv$width * 100, height = argv$height * 100) +print(UpSetR::upset( + binary, + nsets = length(sets), + order.by = "freq", + mainbar.y.label = "CS-variant intersection size", + sets.x.label = "CS variants per (study, context, trait, method)")) +dev.off() +cat(sprintf("Wrote UpSet plot of %d set(s) to %s\n", + length(sets), argv$output)) diff --git a/code/script/pecotmr_integration/gwas_rss_manifest.R b/code/script/pecotmr_integration/gwas_rss_manifest.R new file mode 100644 index 00000000..03ab7d54 --- /dev/null +++ b/code/script/pecotmr_integration/gwas_rss_manifest.R @@ -0,0 +1,177 @@ +#!/usr/bin/env Rscript +# gwas_rss_manifest.R +# +# Resolve `gwas_rss_fine_mapping.ipynb`'s per-study GWAS sources (a +# --gwas-meta TSV and/or --gwas-tsv-list STUDY=PATH items) together with +# its region inputs (a --region-list BED-ish file and/or --regions +# chr:start-end strings) into a single per-job manifest TSV. One row per +# (study x region) cross-product. Downstream SoS steps fan out over this +# TSV's rows without needing any custom Python parsing. +# +# Inputs (zero or more of each input pair; at least one of each pair): +# --gwas-meta Optional. Columns: study_id, path, +# column_mapping (optional). Relative +# paths resolve against the meta file's +# own directory. +# --gwas-tsv-list STUDY=PATH Optional. One or more positional items. +# (zero or more) The notebook may pass each STUDY=PATH item +# as a separate positional arg. +# --region-list Optional. Whitespace-separated #chr start +# end ...; lines starting with '#' or with +# non-numeric start/end are skipped. +# --regions chr:start-end Optional. One or more positional items. +# (zero or more) +# --output Output manifest path. +# +# Output TSV columns: +# study_id, gwas_tsv, column_mapping, chr, start, end, region_id, +# gwas_tsv_basename +# +# `region_id` is the SoS-safe sanitised region label (`:` and `-` replaced +# with `_`); the downstream notebook uses it as the per-task ID. +# `gwas_tsv_basename` is the basename of the TSV without extension — the +# notebook uses it in the per-task output filename for traceability. + +suppressPackageStartupMessages({ + library(argparser) +}) + +parser <- arg_parser("Resolve GWAS sources x regions into a per-task manifest TSV") +parser <- add_argument(parser, "--gwas-meta", + help = "Optional per-study meta TSV", + type = "character", default = "") +parser <- add_argument(parser, "--gwas-tsv-list", + help = "Zero or more STUDY=PATH items", + type = "character", nargs = Inf, default = character(0)) +parser <- add_argument(parser, "--region-list", + help = "Optional BED-like region file", + type = "character", default = "") +parser <- add_argument(parser, "--regions", + help = "Zero or more chr:start-end items", + type = "character", nargs = Inf, default = character(0)) +parser <- add_argument(parser, "--output", + help = "Output manifest TSV path", + type = "character") +argv <- parse_args(parser) + +# ----- Resolve per-study sources ---------------------------------------- +studies <- data.frame(study_id = character(0), + gwas_tsv = character(0), + column_mapping = character(0), + stringsAsFactors = FALSE) +seenStudies <- character(0) + +if (nzchar(argv$gwas_meta) && argv$gwas_meta != ".") { + if (!file.exists(argv$gwas_meta)) + stop("--gwas-meta file not found: ", argv$gwas_meta) + meta <- read.table(argv$gwas_meta, header = TRUE, sep = "\t", + stringsAsFactors = FALSE, check.names = FALSE, + comment.char = "") + required <- c("study_id", "path") + missing <- setdiff(required, names(meta)) + if (length(missing) > 0L) + stop("--gwas-meta missing required column(s): ", + paste(missing, collapse = ", "), " (got: ", + paste(names(meta), collapse = ", "), ").") + metaDir <- dirname(normalizePath(argv$gwas_meta)) + hasCm <- "column_mapping" %in% names(meta) + for (i in seq_len(nrow(meta))) { + sid <- as.character(meta$study_id[[i]]) + tsv <- as.character(meta$path[[i]]) + if (!startsWith(tsv, "/")) tsv <- file.path(metaDir, tsv) + cm <- if (hasCm) as.character(meta$column_mapping[[i]]) else "" + if (!is.na(cm) && nzchar(cm) && !startsWith(cm, "/")) + cm <- file.path(metaDir, cm) + if (sid %in% seenStudies) + stop("Duplicate study_id in --gwas-meta: ", sid) + seenStudies <- c(seenStudies, sid) + studies <- rbind(studies, + data.frame(study_id = sid, gwas_tsv = tsv, + column_mapping = if (is.na(cm)) "" else cm, + stringsAsFactors = FALSE)) + } +} + +tsvItems <- as.character(argv$gwas_tsv_list) +tsvItems <- tsvItems[nzchar(tsvItems)] +for (item in tsvItems) { + if (!grepl("=", item, fixed = TRUE)) + stop("--gwas-tsv-list expects STUDY=PATH items (got: ", item, ").") + parts <- regmatches(item, regexec("^([^=]+)=(.+)$", item))[[1L]] + if (length(parts) != 3L) + stop("Cannot parse --gwas-tsv-list item: ", item) + sid <- parts[[2L]]; tsv <- parts[[3L]] + if (sid %in% seenStudies) + stop("Study '", sid, "' appears in both --gwas-meta and ", + "--gwas-tsv-list.") + seenStudies <- c(seenStudies, sid) + studies <- rbind(studies, + data.frame(study_id = sid, gwas_tsv = tsv, + column_mapping = "", stringsAsFactors = FALSE)) +} +if (nrow(studies) == 0L) + stop("No GWAS inputs supplied (give --gwas-meta and/or --gwas-tsv-list).") + +# ----- Resolve regions -------------------------------------------------- +regions <- data.frame(chr = character(0), start = integer(0), + end = integer(0), stringsAsFactors = FALSE) +pushRegion <- function(chr, start, end) { + if (!startsWith(chr, "chr")) chr <- paste0("chr", chr) + start <- as.integer(start); end <- as.integer(end) + if (is.na(start) || is.na(end)) return(invisible(NULL)) + key <- paste(chr, start, end, sep = "|") + prior <- paste(regions$chr, regions$start, regions$end, sep = "|") + if (key %in% prior) return(invisible(NULL)) + regions <<- rbind(regions, + data.frame(chr = chr, start = start, end = end, + stringsAsFactors = FALSE)) +} + +if (nzchar(argv$region_list) && argv$region_list != ".") { + if (!file.exists(argv$region_list)) + stop("--region-list file not found: ", argv$region_list) + rl <- readLines(argv$region_list) + for (line in rl) { + line <- trimws(line) + if (!nzchar(line) || startsWith(line, "#")) next + parts <- strsplit(line, "\\s+")[[1L]] + if (length(parts) < 3L) next + if (is.na(suppressWarnings(as.integer(parts[[2L]])))) next # header + pushRegion(parts[[1L]], parts[[2L]], parts[[3L]]) + } +} + +regionItems <- as.character(argv$regions) +regionItems <- regionItems[nzchar(regionItems)] +for (r in regionItems) { + m <- regmatches(r, regexec("^([^:]+):([0-9]+)-([0-9]+)$", r))[[1L]] + if (length(m) != 4L) + stop("--regions item must be chr:start-end (got: ", r, ").") + pushRegion(m[[2L]], m[[3L]], m[[4L]]) +} +if (nrow(regions) == 0L) + stop("No regions supplied (give --region-list and/or --regions).") + +# ----- Build cross-product manifest ------------------------------------- +manifest <- list() +for (i in seq_len(nrow(studies))) { + for (j in seq_len(nrow(regions))) { + chr <- regions$chr[[j]]; s <- regions$start[[j]]; e <- regions$end[[j]] + region_id <- gsub("[:\\-]", "_", sprintf("%s_%d_%d", chr, s, e)) + manifest[[length(manifest) + 1L]] <- data.frame( + study_id = studies$study_id[[i]], + gwas_tsv = studies$gwas_tsv[[i]], + column_mapping = studies$column_mapping[[i]], + chr = chr, start = s, end = e, + region_id = region_id, + gwas_tsv_basename = tools::file_path_sans_ext( + tools::file_path_sans_ext(basename(studies$gwas_tsv[[i]]))), + stringsAsFactors = FALSE) + } +} +out <- do.call(rbind, manifest) +dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE) +write.table(out, file = argv$output, sep = "\t", quote = FALSE, + row.names = FALSE, na = "") +cat(sprintf("Wrote manifest with %d row(s) (%d studies x %d regions) to %s\n", + nrow(out), nrow(studies), nrow(regions), argv$output)) diff --git a/code/script/pecotmr_integration/gwas_rss_plot.R b/code/script/pecotmr_integration/gwas_rss_plot.R new file mode 100644 index 00000000..aa1f5367 --- /dev/null +++ b/code/script/pecotmr_integration/gwas_rss_plot.R @@ -0,0 +1,99 @@ +#!/usr/bin/env Rscript +# gwas_rss_plot.R +# +# Render a per-region PIP plot from a pecotmr::GwasFineMappingResult RDS +# (output of fine_mapping.R --gwas-sumstats). One PNG per input RDS; +# each row in the FMR produces one panel (study × method × region_id). +# Replaces the legacy pipeline/rss_analysis.ipynb [univariate_plot] step. +# +# Inputs: +# --input Path to a GwasFineMappingResult RDS +# --output Output PNG path +# --width PNG width (inches). Default 9. +# --height PNG height (inches). Default 3 per panel. + +suppressPackageStartupMessages({ + library(argparser) + library(pecotmr) + library(ggplot2) +}) + +parser <- arg_parser("Render a PIP plot from a GwasFineMappingResult RDS") +parser <- add_argument(parser, "--input", + help = "Path to a GwasFineMappingResult RDS", + type = "character") +parser <- add_argument(parser, "--output", + help = "Output PNG path", type = "character") +parser <- add_argument(parser, "--width", + help = "PNG width in inches", + type = "numeric", default = 9) +parser <- add_argument(parser, "--height", + help = "Per-panel PNG height in inches", + type = "numeric", default = 3) +argv <- parse_args(parser) + +fmr <- readRDS(argv$input) +if (!methods::is(fmr, "GwasFineMappingResult")) + stop("--input must be a GwasFineMappingResult (got '", + class(fmr)[[1L]], "').") +if (nrow(fmr) == 0L) { + message("Empty GwasFineMappingResult; writing an empty plot to ", argv$output) + png(argv$output, width = argv$width * 100, height = argv$height * 100) + plot.new(); title(main = "No fine-mapping rows in input") + dev.off() + quit(save = "no") +} + +# Build a tidy long table of (study, method, region_id, variant_id, pip) +# straight from the S4 accessors; one panel per row of the FMR. +panels <- lapply(seq_len(nrow(fmr)), function(i) { + entry <- fmr$entry[[i]] + pip <- as.numeric(getPip(entry)) + ids <- as.character(getVariantIds(entry)) + if (length(pip) == 0L || length(ids) == 0L) return(NULL) + data.frame( + study = as.character(fmr$study)[[i]], + method = as.character(fmr$method)[[i]], + region_id = as.character(fmr$region_id)[[i]], + variant_id = ids, + pip = pip, + stringsAsFactors = FALSE) +}) +panels <- panels[!vapply(panels, is.null, logical(1))] +if (length(panels) == 0L) { + message("No usable PIP vectors on any entry; writing an empty plot to ", + argv$output) + png(argv$output, width = argv$width * 100, height = argv$height * 100) + plot.new(); title(main = "No PIP vectors in input") + dev.off() + quit(save = "no") +} +df <- do.call(rbind, panels) + +# Extract a numeric pos when variant_id looks like "chrN:pos:..." or +# "chrN_pos_..."; otherwise plot against the row index. This is a +# best-effort cosmetic decode, not a hard requirement. +pos <- suppressWarnings({ + m <- regmatches(df$variant_id, + regexec("^[^:_]+[:_]([0-9]+)", df$variant_id)) + vapply(m, function(x) if (length(x) >= 2L) as.numeric(x[[2L]]) else NA_real_, + numeric(1L)) +}) +df$pos <- ifelse(is.na(pos), seq_len(nrow(df)), pos) +df$panel <- paste(df$study, df$method, df$region_id, sep = " | ") + +g <- ggplot(df, aes(x = pos, y = pip)) + + geom_point(alpha = 0.6, size = 0.8) + + facet_wrap(~ panel, ncol = 1, scales = "free_x") + + scale_y_continuous(limits = c(0, 1)) + + labs(x = "Variant position", y = "PIP", + title = paste0("GWAS RSS fine-mapping: ", basename(argv$input))) + + theme_minimal(base_size = 10) + +dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE) +ggsave(argv$output, g, + width = argv$width, + height = argv$height * length(unique(df$panel)), + dpi = 150) +cat(sprintf("Wrote PIP plot for %d panel(s) (%d total variants) to %s\n", + length(unique(df$panel)), nrow(df), argv$output)) diff --git a/code/script/pecotmr_integration/gwas_sumstats_construct.R b/code/script/pecotmr_integration/gwas_sumstats_construct.R index 04f505c7..2abdd353 100644 --- a/code/script/pecotmr_integration/gwas_sumstats_construct.R +++ b/code/script/pecotmr_integration/gwas_sumstats_construct.R @@ -10,13 +10,32 @@ # Inputs: # --study Single study identifier # --gwas-tsv Path to a tabix-indexed (or plain) GWAS TSV -# Standard columns (with aliases): +# Standard columns (with hardcoded aliases when no +# --column-mapping is supplied): # #chrom|chrom|chr, pos, variant_id|SNP, # A1, A2, z|Z, n_sample|N # Optional: effect_allele_frequency, p, beta, se +# --column-mapping Optional YAML file mapping standard column names +# to this study's actual column names. Keys are +# the standard names (chrom, pos, variant_id, A1, +# A2, z, n_sample, and optionally beta, se, p, maf, +# info); values are the column name as it appears +# in the TSV. Required entries: chrom, pos, +# variant_id, A1, A2, z, n_sample. Used in place +# of the hardcoded alias fallback. # --ld-block Genomic interval as chr:start-end (one LD block) # --ld-meta LD-meta TSV (#chr, start, end, path) # --genome Genome build label (e.g. "GRCh38"). Default "GRCh38" +# --qc-method LD-mismatch QC method passed to +# summaryStatsQc(zMismatchQc = ...). One of "none" +# (default), "slalom", "dentist". +# --impute Flag: run RAISS sumstat imputation in +# summaryStatsQc(impute = TRUE) against the LD +# sketch. +# --qc-args Optional JSON object of extra named kwargs +# spliced into summaryStatsQc(). Anything in the +# summaryStatsQc() signature is reachable here +# (e.g. '{"mafCutoff":0.01,"imputeOpts":{"rcond":0.005}}'). # --output Output RDS path suppressPackageStartupMessages({ @@ -25,6 +44,8 @@ suppressPackageStartupMessages({ library(GenomicRanges) library(IRanges) library(S4Vectors) + library(jsonlite) + library(yaml) }) parser <- arg_parser("Build a per-LD-block GwasSumStats RDS") @@ -43,6 +64,18 @@ parser <- add_argument(parser, "--ld-meta", parser <- add_argument(parser, "--genome", help = "Genome build label", type = "character", default = "GRCh38") +parser <- add_argument(parser, "--column-mapping", + help = "Optional YAML file mapping standard column names to this study's column names", + type = "character", default = "") +parser <- add_argument(parser, "--qc-method", + help = "LD-mismatch QC: 'none' (default), 'slalom', or 'dentist'", + type = "character", default = "none") +parser <- add_argument(parser, "--impute", + help = "Enable RAISS sumstat imputation in summaryStatsQc()", + flag = TRUE) +parser <- add_argument(parser, "--qc-args", + help = "JSON object of extra named kwargs for summaryStatsQc()", + type = "character", default = "") parser <- add_argument(parser, "--skip-qc", help = "Skip summaryStatsQc() (debug-only; the output GwasSumStats will have qcInfo = list() and fineMappingPipeline / twasWeightsPipeline will reject it)", flag = TRUE) @@ -50,6 +83,43 @@ parser <- add_argument(parser, "--output", help = "Output RDS path", type = "character") argv <- parse_args(parser) +# ----- Resolve column-name lookup ------------------------------------------- +# Either pull from a user-supplied YAML mapping (per-study explicit names) +# or fall back to the hardcoded alias lists. The mapping must cover every +# required standard name; optional names (BETA/SE/P/MAF/INFO) pass through +# when present, regardless of source. +column_mapping <- if (nzchar(argv$column_mapping) && argv$column_mapping != ".") { + if (!file.exists(argv$column_mapping)) + stop("--column-mapping file not found: ", argv$column_mapping) + yaml::read_yaml(argv$column_mapping) +} else { + NULL +} + +# ----- Parse --qc-args JSON ------------------------------------------------- +qc_extra <- if (nzchar(argv$qc_args) && argv$qc_args != "." && + argv$qc_args != "{}") { + parsed <- tryCatch(jsonlite::fromJSON(argv$qc_args, simplifyVector = FALSE), + error = function(e) stop( + "--qc-args must be a JSON object string (got: ", + argv$qc_args, "). Error: ", conditionMessage(e))) + if (!is.list(parsed) || is.null(names(parsed)) || any(names(parsed) == "")) + stop("--qc-args must be a JSON object with named fields, e.g. ", + '\'{"mafCutoff":0.01,"removeStrandAmbiguous":false}\'.') + parsed +} else { + list() +} +# Reject collisions between explicit flags and --qc-args (zMismatchQc / +# impute) to avoid silent override surprises. +clash <- intersect(names(qc_extra), c("zMismatchQc", "impute")) +if (length(clash) > 0L) + stop("--qc-args sets ", paste(clash, collapse = ", "), + " which is also controlled by a dedicated flag (--qc-method / ", + "--impute). Pass it via the dedicated flag.") + +qc_method <- match.arg(argv$qc_method, c("none", "slalom", "dentist")) + parse_region <- function(s) { m <- regmatches(s, regexec("^([^:]+):([0-9]+)-([0-9]+)$", s))[[1L]] if (length(m) != 4L) @@ -67,17 +137,33 @@ gwas <- read.table(if (grepl("\\.gz$", argv$gwas_tsv)) gzfile(argv$gwas_tsv) comment.char = "") pick <- function(opts, where) intersect(opts, names(where))[1L] -chr_col <- pick(c("#chrom", "chrom", "chr"), gwas) -pos_col <- pick(c("pos", "position", "BP"), gwas) -snp_col <- pick(c("variant_id", "SNP", "rsid"), gwas) -a1_col <- pick(c("A1", "a1"), gwas) -a2_col <- pick(c("A2", "a2"), gwas) -z_col <- pick(c("z", "Z"), gwas) -n_col <- pick(c("n_sample", "N", "n"), gwas) + +# When the user provides a column mapping, use it directly; otherwise +# fall back to the hardcoded alias lists below. +resolve_col <- function(std, fallback) { + if (!is.null(column_mapping) && !is.null(column_mapping[[std]])) { + named <- column_mapping[[std]] + if (!(named %in% names(gwas))) + stop("--column-mapping['", std, "'] = '", named, + "' is not a column in ", argv$gwas_tsv) + return(named) + } + pick(fallback, gwas) +} +chr_col <- resolve_col("chrom", c("#chrom", "chrom", "chr")) +pos_col <- resolve_col("pos", c("pos", "position", "BP")) +snp_col <- resolve_col("variant_id", c("variant_id", "SNP", "rsid")) +a1_col <- resolve_col("A1", c("A1", "a1")) +a2_col <- resolve_col("A2", c("A2", "a2")) +z_col <- resolve_col("z", c("z", "Z")) +n_col <- resolve_col("n_sample", c("n_sample", "N", "n")) if (any(is.na(c(chr_col, pos_col, snp_col, a1_col, a2_col, z_col, n_col)))) stop("--gwas-tsv missing one of required columns ", - "(chrom/pos/variant_id/A1/A2/z/n_sample) in: ", argv$gwas_tsv) + "(chrom/pos/variant_id/A1/A2/z/n_sample) in: ", argv$gwas_tsv, + if (!is.null(column_mapping)) + " — check that every required key is present in --column-mapping" + else "") # Normalise chromosome label to match the LD block's chrom_vals <- as.character(gwas[[chr_col]]) @@ -179,7 +265,11 @@ gss_out <- if (argv$skip_qc) { message("--skip-qc set; serialising raw GwasSumStats without summaryStatsQc().") gss } else { - summaryStatsQc(gss) + qc_call_args <- c(list(gss, + zMismatchQc = qc_method, + impute = isTRUE(argv$impute)), + qc_extra) + do.call(summaryStatsQc, qc_call_args) } dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE) diff --git a/code/script/pecotmr_integration/ld_blocks_to_manifest.R b/code/script/pecotmr_integration/ld_blocks_to_manifest.R new file mode 100644 index 00000000..efe5b788 --- /dev/null +++ b/code/script/pecotmr_integration/ld_blocks_to_manifest.R @@ -0,0 +1,70 @@ +#!/usr/bin/env Rscript +# ld_blocks_to_manifest.R +# +# Parse a BED-like LD-block file (`#chr`/`chrom` + `start` + `stop`/`end` +# columns; arbitrary leading-`#` header line) into a 2-column manifest +# TSV (`region`, `region_id`). The `region` is the canonical +# `chr:start-end` string; `region_id` is the SoS-safe sanitised form +# (`:` and `-` replaced with `_`) used as the per-task ID downstream. +# +# Inputs: +# --ld-blocks Path to the LD-blocks BED. +# --output Output manifest TSV path. + +suppressPackageStartupMessages({ + library(argparser) +}) + +parser <- arg_parser("Convert an LD-blocks BED into a region/region_id manifest") +parser <- add_argument(parser, "--ld-blocks", + help = "BED-like LD-block file", + type = "character") +parser <- add_argument(parser, "--output", + help = "Output manifest TSV path", + type = "character") +argv <- parse_args(parser) + +if (!file.exists(argv$ld_blocks)) + stop("--ld-blocks file not found: ", argv$ld_blocks) + +lines <- readLines(argv$ld_blocks) +if (length(lines) == 0L) + stop("Empty --ld-blocks file: ", argv$ld_blocks) + +# Header: strip a leading '#' if present, split on whitespace. +hdr <- strsplit(sub("^#", "", lines[[1L]]), "\\s+", perl = TRUE)[[1L]] +chrCol <- intersect(c("chr", "chrom"), hdr)[1L] +startCol <- intersect(c("start", "Start"), hdr)[1L] +endCol <- intersect(c("stop", "end", "End"), hdr)[1L] +if (is.na(chrCol) || is.na(startCol) || is.na(endCol)) + stop("--ld-blocks header missing chr / start / stop|end columns; got: ", + paste(hdr, collapse = ", ")) +chrIx <- match(chrCol, hdr) +startIx <- match(startCol, hdr) +endIx <- match(endCol, hdr) + +rows <- list() +for (l in lines[-1L]) { + l <- trimws(l) + if (!nzchar(l) || startsWith(l, "#")) next + parts <- strsplit(l, "\\s+", perl = TRUE)[[1L]] + if (length(parts) < max(chrIx, startIx, endIx)) next + chrom <- parts[[chrIx]] + start <- suppressWarnings(as.integer(parts[[startIx]])) + end <- suppressWarnings(as.integer(parts[[endIx]])) + if (is.na(start) || is.na(end)) next + region <- sprintf("%s:%d-%d", chrom, start, end) + region_id <- gsub("[:\\-]", "_", region) + rows[[length(rows) + 1L]] <- data.frame( + region = region, region_id = region_id, + stringsAsFactors = FALSE) +} +if (length(rows) == 0L) + stop("No data rows parsed from --ld-blocks ", argv$ld_blocks) + +out <- do.call(rbind, rows) +dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE) +write.table(out, file = argv$output, sep = "\t", quote = FALSE, + row.names = FALSE, na = "") +cat(sprintf("Wrote LD-block manifest with %d row(s) to %s\n", + nrow(out), argv$output)) diff --git a/code/script/pecotmr_integration/legacy_mash_input_to_sumstatslist.R b/code/script/pecotmr_integration/legacy_mash_input_to_sumstatslist.R new file mode 100644 index 00000000..ee03b1b0 --- /dev/null +++ b/code/script/pecotmr_integration/legacy_mash_input_to_sumstatslist.R @@ -0,0 +1,149 @@ +#!/usr/bin/env Rscript +# legacy_mash_input_to_sumstatslist.R +# +# Standalone one-shot converter: reshape a pre-baked legacy MASH MWE +# input file (`protocol_example.mash_input.rds`, the partitioned output +# of the retired `susie_to_mash` workflow) into the named list of +# QtlSumStats objects that `pecotmr::mashPipeline()` accepts directly. +# +# Not part of the production pipeline. Use this once to materialise a +# `sumStatsList` RDS from the legacy MWE fixture so you can manually +# smoke `mashPipeline()` end-to-end, e.g.: +# +# Rscript legacy_mash_input_to_sumstatslist.R \ +# --legacy-mash-input input/mash_preprocessing/protocol_example.mash_input.rds \ +# --ld-sketch path/to/ld_sketch.rds \ +# --study protocol_example \ +# --output /tmp/sumstatslist.rds +# +# Rscript -e 'suppressPackageStartupMessages(library(pecotmr)); +# saveRDS(mashPipeline(readRDS("/tmp/sumstatslist.rds"), alpha = 0), +# "/tmp/mash_result.rds")' +# +# Input shape (verified against the shipped MWE fixture): +# list( +# strong.z, strong.b, strong.s, # matrices [n_variants_strong x n_conditions] +# random.z, random.b, random.s, # matrices [n_variants_random x n_conditions] +# null.z, null.b, null.s, # matrices [n_variants_null x n_conditions] +# ZtZ) +# +# Output shape (one entry per partition, ready to splice into mashPipeline): +# list( +# strong = QtlSumStats(...), +# random = QtlSumStats(...), +# null = QtlSumStats(...)) # null is optional +# +# Each QtlSumStats has one row per condition; the entry GRanges carries +# the partition's z/beta/se vectors as mcols (with a synthesised +# variant_id when the legacy matrices have no rownames). qcInfo is +# pre-stamped non-empty so mashPipeline()'s QC gate accepts the result +# without re-running summaryStatsQc. +# +# Inputs: +# --legacy-mash-input The pre-baked mash_input.rds. +# --ld-sketch GenotypeHandle RDS (required slot on +# QtlSumStats; never read by mashPipeline, +# only validated). +# --study Study label for the synthesised +# QtlSumStats. Default "study". +# --output Output sumStatsList RDS path. + +suppressPackageStartupMessages({ + library(argparser) + library(pecotmr) + library(GenomicRanges) + library(IRanges) + library(S4Vectors) +}) + +parser <- arg_parser("Convert legacy pre-baked mash_input.rds to a mashPipeline sumStatsList") +parser <- add_argument(parser, "--legacy-mash-input", + help = "Pre-baked MASH input RDS (strong/random/null .z/.b/.s matrices)", + type = "character") +parser <- add_argument(parser, "--ld-sketch", + help = "GenotypeHandle RDS to attach to the QtlSumStats", + type = "character") +parser <- add_argument(parser, "--study", + help = "Study label", type = "character", default = "study") +parser <- add_argument(parser, "--output", + help = "Output sumStatsList RDS path", type = "character") +argv <- parse_args(parser) + +if (!file.exists(argv$legacy_mash_input)) + stop("--legacy-mash-input file not found: ", argv$legacy_mash_input) +if (!file.exists(argv$ld_sketch)) + stop("--ld-sketch file not found: ", argv$ld_sketch) + +ld_handle <- readRDS(argv$ld_sketch) +if (!methods::is(ld_handle, "GenotypeHandle")) + stop("--ld-sketch must deserialise to a GenotypeHandle (got '", + class(ld_handle)[[1L]], "').") + +mi <- readRDS(argv$legacy_mash_input) +required <- c("strong.z", "random.z") +missing <- setdiff(required, names(mi)) +if (length(missing) > 0L) + stop("Pre-baked MASH input missing required component(s): ", + paste(missing, collapse = ", "), + " (got: ", paste(names(mi), collapse = ", "), ").") + +# Build a single QtlSumStats from one partition's matrices. Each +# condition becomes one row of the collection; its entry GRanges holds +# the partition's variants with Z/BETA/SE/N mcols pulled column-wise. +toQss <- function(partition, role) { + zMat <- partition[["z"]]; bMat <- partition[["b"]]; sMat <- partition[["s"]] + if (is.null(zMat) || nrow(zMat) == 0L || ncol(zMat) == 0L) return(NULL) + vids <- rownames(zMat) + if (is.null(vids)) + vids <- sprintf("var_%s_%05d", role, seq_len(nrow(zMat))) + contexts <- colnames(zMat) + if (is.null(contexts)) + stop("Partition '", role, "' has no column names; cannot infer conditions.") + entries <- lapply(seq_along(contexts), function(j) { + mcols_df <- DataFrame( + SNP = vids, + A1 = rep("A", length(vids)), + A2 = rep("G", length(vids)), + Z = as.numeric(zMat[, j]), + N = rep(1000L, length(vids))) + if (!is.null(bMat)) mcols_df$BETA <- as.numeric(bMat[, j]) + if (!is.null(sMat)) mcols_df$SE <- as.numeric(sMat[, j]) + gr <- GRanges(seqnames = rep("chr1", length(vids)), + ranges = IRanges(start = seq_along(vids), width = 1L)) + mcols(gr) <- mcols_df + gr + }) + QtlSumStats( + study = rep(argv$study, length(contexts)), + context = contexts, + trait = rep("mash", length(contexts)), + entry = entries, + genome = "GRCh38", + ldSketch = ld_handle, + qcInfo = list(source = "legacy_mash_input_to_sumstatslist", + role = role, + entryAudit = vector("list", length(contexts)))) +} + +# Pull the three (or two) partitions out of the flat key naming +# (`.z` / `.b` / `.s`). +partitionOf <- function(role) { + out <- list(z = mi[[paste0(role, ".z")]], + b = mi[[paste0(role, ".b")]], + s = mi[[paste0(role, ".s")]]) + if (is.null(out$z)) NULL else out +} + +sumStatsList <- list( + strong = toQss(partitionOf("strong"), "strong"), + random = toQss(partitionOf("random"), "random")) +nullPart <- partitionOf("null") +if (!is.null(nullPart)) sumStatsList$null <- toQss(nullPart, "null") + +dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE) +saveRDS(sumStatsList, argv$output, compress = "xz") +cat(sprintf(paste0("Wrote sumStatsList for mashPipeline (strong=%d, random=%d, ", + "null=%s rows x %d conditions) to %s\n"), + nrow(mi$strong.z), nrow(mi$random.z), + if (is.null(mi$null.z)) "none" else as.character(nrow(mi$null.z)), + ncol(mi$strong.z), argv$output)) diff --git a/code/script/pecotmr_integration/legacy_sumstats_db_to_mash.R b/code/script/pecotmr_integration/legacy_sumstats_db_to_mash.R new file mode 100644 index 00000000..69433c81 --- /dev/null +++ b/code/script/pecotmr_integration/legacy_sumstats_db_to_mash.R @@ -0,0 +1,173 @@ +#!/usr/bin/env Rscript +# legacy_sumstats_db_to_mash.R +# +# Convert the legacy `protocol_example.sumstats_db.rds` MASH fixture +# (nested list keyed by condition -> region -> {variant_names, sumstats}) +# into the inputs the new `pecotmr_integration/mash_preprocessing.ipynb` +# expects: +# +# * per-condition tensorqtl-style TSVs (one per chromosome, gzipped), +# filename embeds the chrom so the notebook's `_match_paths_for_chrom` +# helper finds them. +# * one `/{condition}.sum_list.txt` per condition listing +# its per-chrom TSV paths (one line per chrom). +# * a `/regions.tsv` file with columns `chr start end gene_id` +# (one row per region present in the fixture). +# * a `/ld_sketch.rds` GenotypeHandle built from a supplied +# PLINK2 prefix. +# +# Inputs: +# --legacy-rds Path to protocol_example.sumstats_db.rds. +# --plink2-prefix PLINK2 pgen/pvar/psam prefix (no extension) +# covering the variants in the fixture; the +# new pipeline needs a GenotypeHandle to feed +# mashPipeline()'s QC gate. +# --output-dir Output directory (created if absent). + +suppressPackageStartupMessages({ + library(argparser) + library(pecotmr) +}) + +parser <- arg_parser("Convert the legacy sumstats_db.rds fixture to new-pipeline MASH inputs") +parser <- add_argument(parser, "--legacy-rds", + help = "Path to the legacy protocol_example.sumstats_db.rds", + type = "character") +parser <- add_argument(parser, "--plink2-prefix", + help = "PLINK2 pgen/pvar/psam prefix for the LD sketch", + type = "character") +parser <- add_argument(parser, "--output-dir", + help = "Output directory", + type = "character") +argv <- parse_args(parser) + +if (!file.exists(argv$legacy_rds)) + stop("--legacy-rds file not found: ", argv$legacy_rds) +for (ext in c(".pgen", ".pvar", ".psam")) { + if (!file.exists(paste0(argv$plink2_prefix, ext))) + stop("PLINK2 file missing: ", paste0(argv$plink2_prefix, ext)) +} +dir.create(argv$output_dir, showWarnings = FALSE, recursive = TRUE) + +db <- readRDS(argv$legacy_rds) +if (!is.list(db) || length(db) == 0L) + stop("Legacy RDS is not a non-empty list (got ", class(db)[[1L]], + ", length ", length(db), ").") + +conditions <- names(db) +if (any(nchar(conditions) == 0L)) + stop("Every condition in the legacy RDS must be named.") + +# ----- Pass 1: enumerate regions present across all conditions ---------- +parseRegion <- function(s) { + m <- regmatches(s, regexec("^([^:]+):([0-9]+)-([0-9]+)$", s))[[1L]] + if (length(m) != 4L) + stop("Cannot parse legacy region label '", s, + "'; expected 'chr:start-end'.") + list(chr = m[[2L]], start = as.integer(m[[3L]]), + end = as.integer(m[[4L]])) +} + +regionSet <- list() +for (cond in conditions) { + for (rid in names(db[[cond]])) { + if (is.null(regionSet[[rid]])) regionSet[[rid]] <- parseRegion(rid) + } +} +if (length(regionSet) == 0L) + stop("No regions found across any condition in the legacy RDS.") + +# Pre-compute per-condition variant_id chromosomes so we can group +# per-chrom TSV writes. Variant IDs are of the form 'chr22:15528227:A:G'. +chromOf <- function(vids) { + sub("^([^:]+):.*", "\\1", as.character(vids)) +} + +# ----- Pass 2: write per-condition per-chrom TSVs ----------------------- +sumListPaths <- character(length(conditions)) +names(sumListPaths) <- conditions +for (cond in conditions) { + # Concatenate every region of this condition into a single data frame + # (variant_id, z), then split by chromosome and write one TSV per chrom. + perRegion <- db[[cond]] + pieces <- list() + for (rid in names(perRegion)) { + rd <- perRegion[[rid]] + if (is.null(rd$variant_names) || is.null(rd$sumstats) || + is.null(rd$sumstats$z)) { + warning("Skipping region '", rid, "' in condition '", cond, + "': missing variant_names or sumstats$z.") + next + } + vids <- as.character(rd$variant_names) + z <- as.numeric(rd$sumstats$z) + if (length(vids) != length(z)) + stop("Length mismatch in condition '", cond, + "', region '", rid, "': ", length(vids), " variants vs ", + length(z), " z-scores.") + pieces[[length(pieces) + 1L]] <- data.frame( + variant_id = vids, z = z, stringsAsFactors = FALSE) + } + if (length(pieces) == 0L) { + warning("Condition '", cond, "' produced no rows; skipping.") + sumListPaths[[cond]] <- "" + next + } + df <- do.call(rbind, pieces) + df <- df[!duplicated(df$variant_id), , drop = FALSE] + + chroms <- chromOf(df$variant_id) + uchroms <- sort(unique(chroms)) + perCondPaths <- character(length(uchroms)) + names(perCondPaths) <- uchroms + for (chr in uchroms) { + sub <- df[chroms == chr, , drop = FALSE] + # Embed the bare chromosome number in the filename (matching the + # notebook's `.{chr_number}.` matching convention). + chrNum <- sub("^chr", "", chr) + outPath <- file.path(argv$output_dir, + sprintf("%s.%s.tsv.gz", cond, chrNum)) + gz <- gzfile(outPath, "w") + write.table(sub, file = gz, sep = "\t", quote = FALSE, + row.names = FALSE, na = "") + close(gz) + perCondPaths[[chr]] <- outPath + } + sumListPath <- file.path(argv$output_dir, + sprintf("%s.sum_list.txt", cond)) + writeLines(perCondPaths, sumListPath) + sumListPaths[[cond]] <- sumListPath + cat(sprintf("Wrote %d TSV(s) for condition '%s' (%d total variants)\n", + length(uchroms), cond, nrow(df))) +} + +# ----- Region TSV ------------------------------------------------------- +regionsTsv <- file.path(argv$output_dir, "regions.tsv") +regionRows <- do.call(rbind, lapply(names(regionSet), function(rid) { + r <- regionSet[[rid]] + # Synthesise a stable gene_id from the region label so the notebook can + # use it as a per-region filename token. + geneId <- gsub("[:\\-]", "_", rid) + data.frame(chr = r$chr, start = r$start, end = r$end, + gene_id = geneId, stringsAsFactors = FALSE) +})) +write.table(regionRows, file = regionsTsv, sep = "\t", + quote = FALSE, row.names = FALSE) +cat(sprintf("Wrote regions TSV (%d rows) to %s\n", + nrow(regionRows), regionsTsv)) + +# ----- LD-sketch GenotypeHandle ---------------------------------------- +ldHandle <- GenotypeHandle(plink2Prefix = argv$plink2_prefix) +ldOut <- file.path(argv$output_dir, "ld_sketch.rds") +saveRDS(ldHandle, ldOut) +cat(sprintf("Wrote GenotypeHandle (path '%s') to %s\n", + argv$plink2_prefix, ldOut)) + +# Print a tiny manifest summary so the user knows what to pass downstream. +cat("\n--- Conversion summary ---\n") +cat("output dir: ", argv$output_dir, "\n", sep = "") +cat("conditions: ", paste(conditions, collapse = ", "), "\n", sep = "") +cat("--conditions arg: ", paste(conditions, collapse = ","), "\n", sep = "") +cat("--sum-files args: ", paste(unname(sumListPaths), collapse = " "), "\n", sep = "") +cat("--region-file arg: ", regionsTsv, "\n", sep = "") +cat("--ld-sketch arg: ", ldOut, "\n", sep = "") diff --git a/code/script/pecotmr_integration/mash.R b/code/script/pecotmr_integration/mash.R new file mode 100644 index 00000000..975466cc --- /dev/null +++ b/code/script/pecotmr_integration/mash.R @@ -0,0 +1,241 @@ +#!/usr/bin/env Rscript +# mash.R +# +# Estimate the MASH mixture-component covariance + weights via +# pecotmr::mashPipeline(). Consumes the per-region RDSes produced by +# mash_sumstats_construct.R (one per LD-block region), partitions into +# strong / random / null subsets via pecotmr::mashRandNullSample, wraps +# each partition as a per-context QtlSumStats collection (with a +# pass-through QC record), and calls mashPipeline. +# +# Inputs: +# --mash-inputs f1.rds [f2.rds ...] Per-region RDSes (each is +# list(region_id = list(z, region))). +# --study Study label for the synthesised +# QtlSumStats. Default "study". +# --ld-sketch GenotypeHandle RDS to embed in +# the synthesised QtlSumStats +# (required by mashPipeline). +# --n-random / --n-null Random / null subset sizes +# passed to mashRandNullSample. +# Defaults 4000 / 4000. +# --exclude-condition c1,c2,... Conditions to drop from the +# per-context QtlSumStats before +# running MASH. Default none. +# --alpha alpha argument to mashPipeline(). +# Default 0 (standard scale). +# --seed RNG seed. Default 999. +# --vhat-rds Optional pre-computed residual +# correlation matrix (vhat). When +# supplied, replaces +# estimate_null_correlation_simple() +# inside mashPipeline; the wrapper +# readRDS()s the file and passes +# the matrix as an in-memory R +# object (no I/O inside pecotmr). +# --prior-rds Optional pre-computed prior +# covariance matrices. The RDS is +# expected to contain either a +# named list of square matrices +# directly, OR a list with a `$U` +# slot holding that list (the +# legacy `protocol_example.EE.prior.rds` +# shape). The wrapper extracts and +# forwards the U list to +# mashPipeline's `priorCovariances` +# argument. +# --output MASH result RDS (U + w + meta). + +suppressPackageStartupMessages({ + library(argparser) + library(pecotmr) + library(GenomicRanges) + library(IRanges) + library(S4Vectors) +}) + +parser <- arg_parser("Run pecotmr::mashPipeline on per-region MASH inputs") +parser <- add_argument(parser, "--mash-inputs", + help = "Per-region MASH input RDSes", + type = "character", nargs = Inf) +parser <- add_argument(parser, "--study", + help = "Study label for the synthesised QtlSumStats", + type = "character", default = "study") +parser <- add_argument(parser, "--ld-sketch", + help = "GenotypeHandle RDS (required by mashPipeline's QC gate)", + type = "character", default = "") +parser <- add_argument(parser, "--n-random", + help = "Random subset size for mashRandNullSample", + type = "integer", default = 4000L) +parser <- add_argument(parser, "--n-null", + help = "Null subset size for mashRandNullSample", + type = "integer", default = 4000L) +parser <- add_argument(parser, "--exclude-condition", + help = "Comma-separated conditions to drop", + type = "character", default = "") +parser <- add_argument(parser, "--alpha", + help = "alpha argument to mashPipeline()", + type = "numeric", default = 0) +parser <- add_argument(parser, "--seed", + help = "RNG seed", type = "integer", default = 999L) +parser <- add_argument(parser, "--vhat-rds", + help = "Optional pre-computed residual correlation matrix RDS", + type = "character", default = "") +parser <- add_argument(parser, "--prior-rds", + help = "Optional pre-computed prior-covariance RDS (matrix list or list with $U)", + type = "character", default = "") +parser <- add_argument(parser, "--output", + help = "Output MASH-result RDS path", + type = "character") +argv <- parse_args(parser) + +# Load the optional pre-computed mash artefacts at the I/O boundary so +# the pecotmr-side call stays purely in-memory. +vhat_obj <- NULL +if (nzchar(argv$vhat_rds)) { + if (!file.exists(argv$vhat_rds)) + stop("--vhat-rds file not found: ", argv$vhat_rds) + vhat_obj <- readRDS(argv$vhat_rds) + if (!is.matrix(vhat_obj) || !is.numeric(vhat_obj)) + stop("--vhat-rds must deserialise to a numeric matrix (got '", + class(vhat_obj)[[1L]], "').") + if (nrow(vhat_obj) != ncol(vhat_obj)) + stop("--vhat-rds matrix must be square; got ", + nrow(vhat_obj), " x ", ncol(vhat_obj), ".") +} +prior_U <- NULL +if (nzchar(argv$prior_rds)) { + if (!file.exists(argv$prior_rds)) + stop("--prior-rds file not found: ", argv$prior_rds) + raw <- readRDS(argv$prior_rds) + # Accept either a named list of matrices OR a list with $U. + prior_U <- if (is.list(raw) && !is.null(raw$U)) raw$U else raw + if (!is.list(prior_U) || length(prior_U) == 0L || + is.null(names(prior_U)) || any(names(prior_U) == "")) + stop("--prior-rds must hold (or have a $U slot containing) a non-empty ", + "named list of square matrices.") +} + +inputs <- as.character(argv$mash_inputs) +if (length(inputs) == 0L) + stop("--mash-inputs requires at least one per-region RDS.") +if (!nzchar(argv$ld_sketch) || !file.exists(argv$ld_sketch)) + stop("--ld-sketch is required and must point at a GenotypeHandle RDS ", + "(mashPipeline gates on QC'd sumstats, which carry the ldSketch).") +ld_handle <- readRDS(argv$ld_sketch) +if (!methods::is(ld_handle, "GenotypeHandle")) + stop("--ld-sketch must deserialise to a GenotypeHandle (got '", + class(ld_handle)[[1L]], "').") + +exclude <- if (nzchar(argv$exclude_condition)) { + trimws(strsplit(argv$exclude_condition, ",", fixed = TRUE)[[1L]]) +} else character(0) + +# Load and concatenate per-region inputs. Each entry is +# list(region_id = list(z = matrix, region = chr:start-end)) +dat <- list() +for (path in inputs) { + x <- readRDS(path) + if (!is.list(x) || length(x) == 0L) + stop(path, " is not a non-empty list (expected per-region MASH input).") + for (rid in names(x)) { + if (rid %in% names(dat)) + stop("Duplicate region_id '", rid, "' across --mash-inputs (", + path, "); regions must be unique.") + dat[[rid]] <- x[[rid]] + } +} + +# Build a single concatenated `dat` for mashRandNullSample. We require +# every region to share the same condition set (so column-binding the +# z-matrices stays meaningful for MASH). +conditions <- colnames(dat[[1L]]$z) +for (rid in names(dat)) { + if (!identical(colnames(dat[[rid]]$z), conditions)) + stop("Region '", rid, + "' has a different condition set than the first region '", + names(dat)[[1L]], + "'; mashRandNullSample requires a common set of conditions.") +} +zStack <- do.call(rbind, lapply(dat, function(x) x$z)) +cat(sprintf("Stacked %d region(s) -> %d variants x %d conditions\n", + length(dat), nrow(zStack), ncol(zStack))) + +# Partition into strong / random / null. Strong is the input itself +# (max(|z|) >= threshold variants are downstream-filtered by mash); +# random / null are the subsets mashRandNullSample emits. +partition <- mashRandNullSample( + list(z = zStack), + nRandom = argv$n_random, + nNull = argv$n_null, + excludeCondition = exclude, + seed = argv$seed) +randomZ <- partition$random$z +nullZ <- partition$null$z +if (is.null(randomZ) || nrow(randomZ) == 0L) + stop("mashRandNullSample returned an empty random subset; ", + "increase --n-random or check the input z-matrix.") +strongZ <- zStack +# Drop excluded conditions if mashRandNullSample applied them. +if (length(exclude) > 0L) { + keepCols <- setdiff(colnames(strongZ), exclude) + strongZ <- strongZ[, keepCols, drop = FALSE] +} + +# Wrap each partition as a QtlSumStats collection: one row per +# (study, context, trait) with a single GRanges entry of the partition's +# variants. Use synthetic positions when variant IDs don't parse as +# "chr:pos:..." (mashPipeline only reads Z out of mcols). +.toQss <- function(zMat, role) { + vids <- rownames(zMat) + if (is.null(vids)) vids <- paste0("var", seq_len(nrow(zMat))) + # Try chr:pos:... decode; otherwise synthesise chr1 positions. + m <- regmatches(vids, regexec("^([^:_]+)[:_]([0-9]+)", vids)) + chrom <- vapply(m, function(x) if (length(x) >= 2L) x[[2L]] else "chr1", + character(1L)) + pos <- suppressWarnings(vapply(m, + function(x) if (length(x) >= 3L) as.integer(x[[3L]]) else NA_integer_, + integer(1L))) + pos[is.na(pos)] <- seq_along(pos)[is.na(pos)] + # One per-context entry. + entries <- lapply(seq_len(ncol(zMat)), function(j) { + gr <- GRanges(seqnames = chrom, + ranges = IRanges(start = pos, width = 1L)) + mcols(gr) <- DataFrame( + SNP = vids, A1 = rep("A", length(vids)), A2 = rep("G", length(vids)), + Z = as.numeric(zMat[, j]), + N = rep(1000L, length(vids))) + gr + }) + ctxs <- colnames(zMat) + qss <- QtlSumStats( + study = rep(argv$study, ncol(zMat)), + context = ctxs, + trait = rep("mash", ncol(zMat)), + entry = entries, + genome = "GRCh38", + ldSketch = ld_handle, + qcInfo = list(role = role, + entryAudit = vector("list", ncol(zMat)))) + qss +} + +sumStatsList <- list(strong = .toQss(strongZ, "strong"), + random = .toQss(randomZ, "random")) +if (!is.null(nullZ) && nrow(nullZ) > 0L) + sumStatsList$null <- .toQss(nullZ, "null") + +cat(sprintf("Built sumStatsList: strong=%d, random=%d, null=%s\n", + nrow(strongZ), nrow(randomZ), + if (is.null(nullZ)) "(none)" else as.character(nrow(nullZ)))) + +res <- mashPipeline( + sumStatsList = sumStatsList, + alpha = argv$alpha, + residualCorrelation = vhat_obj, + priorCovariances = prior_U, + setSeed = argv$seed) + +dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE) +saveRDS(res, argv$output, compress = "xz") +cat(sprintf("Wrote MASH result (U + w) to %s\n", argv$output)) diff --git a/code/script/pecotmr_integration/mash_manifest.R b/code/script/pecotmr_integration/mash_manifest.R new file mode 100644 index 00000000..6850b28d --- /dev/null +++ b/code/script/pecotmr_integration/mash_manifest.R @@ -0,0 +1,122 @@ +#!/usr/bin/env Rscript +# mash_manifest.R +# +# Resolve `mash_preprocessing.ipynb`'s --region-file + --sum-files + --conditions +# into a single per-region manifest TSV. Each row carries the chrom-matched +# per-condition tensorqtl paths so the downstream `[mash_inputs_from_tensorqtl]` +# step can fan out one task per region without any Python parsing. +# +# Inputs: +# --region-file Region list with columns `chr start end gene_id`. +# Header rows (start/end not numeric) are skipped. +# --sum-files [...] One text file per condition; each line lists +# one tensorqtl summary-stats path. The file +# names embed the chromosome (legacy `.CHR.` +# convention) so we can pick the right path +# per region. +# --conditions c1,c2,... Condition labels matching --sum-files order. +# --output Output manifest TSV path. +# +# Output TSV columns: +# gene_id, chr, start, end, region, tensorqtl_paths +# where `tensorqtl_paths` is a space-separated list (one per condition, same +# order as --conditions) of the per-condition tensorqtl files for this region. + +suppressPackageStartupMessages({ + library(argparser) +}) + +parser <- arg_parser("Build a per-region manifest TSV for mash_preprocessing.ipynb") +parser <- add_argument(parser, "--region-file", + help = "Region list TSV (chr/start/end/gene_id)", + type = "character") +parser <- add_argument(parser, "--sum-files", + help = "One text file per condition", + type = "character", nargs = Inf) +parser <- add_argument(parser, "--conditions", + help = "Comma-separated condition labels (one per --sum-files)", + type = "character") +parser <- add_argument(parser, "--output", + help = "Output manifest TSV path", + type = "character") +argv <- parse_args(parser) + +if (!file.exists(argv$region_file)) + stop("--region-file not found: ", argv$region_file) +sumFiles <- as.character(argv$sum_files) +conditions <- trimws(strsplit(argv$conditions, ",", fixed = TRUE)[[1L]]) +if (length(sumFiles) != length(conditions)) + stop("--sum-files length (", length(sumFiles), ") must match --conditions ", + "length (", length(conditions), ").") +for (sf in sumFiles) if (!file.exists(sf)) + stop("--sum-files entry not found: ", sf) + +# Parse the region file (skip headers + non-numeric rows). +parseRegions <- function(path) { + lines <- readLines(path) + out <- list() + for (l in lines) { + l <- trimws(l) + if (!nzchar(l) || startsWith(l, "#")) next + parts <- strsplit(l, "\\s+")[[1L]] + if (length(parts) < 4L) next + sNum <- suppressWarnings(as.integer(parts[[2L]])) + eNum <- suppressWarnings(as.integer(parts[[3L]])) + if (is.na(sNum) || is.na(eNum)) next + chr <- if (startsWith(parts[[1L]], "chr")) parts[[1L]] else paste0("chr", parts[[1L]]) + out[[length(out) + 1L]] <- data.frame( + chr = chr, start = sNum, end = eNum, + gene_id = parts[[4L]], stringsAsFactors = FALSE) + } + if (length(out) == 0L) + stop("No data rows parsed from --region-file ", path) + do.call(rbind, out) +} +regions <- parseRegions(argv$region_file) + +# For each sum_files entry, read every line once and keep a per-chrom lookup +# (legacy `.CHR.` naming convention). +matchPerChrom <- function(sumFile) { + lines <- readLines(sumFile) + perChrom <- list() + for (line in lines) { + line <- trimws(line) + if (!nzchar(line)) next + m <- regmatches(line, regexec("\\.([0-9XYM]+)\\.", line))[[1L]] + if (length(m) < 2L) next + chrNum <- m[[2L]] + perChrom[[chrNum]] <- c(perChrom[[chrNum]], line) + } + perChrom +} +perCondLookup <- lapply(sumFiles, matchPerChrom) + +rows <- list() +for (i in seq_len(nrow(regions))) { + chrNum <- sub("^chr", "", regions$chr[[i]]) + paths <- character(length(conditions)) + for (j in seq_along(conditions)) { + hits <- perCondLookup[[j]][[chrNum]] + if (is.null(hits) || length(hits) == 0L) + stop("No path matching .", chrNum, ". in ", sumFiles[[j]], + " (needed for condition '", conditions[[j]], + "', region ", regions$chr[[i]], ":", regions$start[[i]], + "-", regions$end[[i]], ").") + paths[[j]] <- hits[[1L]] + } + rows[[length(rows) + 1L]] <- data.frame( + gene_id = regions$gene_id[[i]], + chr = regions$chr[[i]], + start = regions$start[[i]], + end = regions$end[[i]], + region = sprintf("%s:%d-%d", regions$chr[[i]], + regions$start[[i]], regions$end[[i]]), + tensorqtl_paths = paste(paths, collapse = " "), + stringsAsFactors = FALSE) +} +out <- do.call(rbind, rows) +dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE) +write.table(out, file = argv$output, sep = "\t", quote = FALSE, + row.names = FALSE, na = "") +cat(sprintf("Wrote MASH manifest with %d row(s) (%d regions x %d conditions) to %s\n", + nrow(out), nrow(regions), length(conditions), argv$output)) diff --git a/code/script/pecotmr_integration/mash_sumstats_construct.R b/code/script/pecotmr_integration/mash_sumstats_construct.R new file mode 100644 index 00000000..f241f655 --- /dev/null +++ b/code/script/pecotmr_integration/mash_sumstats_construct.R @@ -0,0 +1,101 @@ +#!/usr/bin/env Rscript +# mash_sumstats_construct.R +# +# Per-region MASH input builder. Reads one or more multi-context tensorqtl +# summary-statistics files (one per condition / trait combination) for a +# single LD-block region and writes a per-region RDS in the legacy "dat" +# shape that mash.R consumes: +# +# list( = list(z = )) +# +# The downstream mash.R worker concatenates per-region RDSes, partitions +# into strong / random / null subsets via pecotmr::mashRandNullSample, +# wraps each partition as a QtlSumStats, runs summaryStatsQc, and calls +# pecotmr::mashPipeline. +# +# Inputs: +# --tensorqtl-paths file1 [file2 ...] Per-condition tensorqtl +# summary-stats files for THIS +# region. Each file must carry +# a `variant_id` column and +# either `z` or `tstat`. +# --conditions c1,c2,... Condition labels, same order +# and length as --tensorqtl-paths. +# --region chr:start-end Genomic interval label (used +# as the region_id key in the +# output RDS). +# --output Output path. + +suppressPackageStartupMessages({ + library(argparser) +}) + +parser <- arg_parser("Build a per-region MASH input RDS from per-condition tensorqtl outputs") +parser <- add_argument(parser, "--tensorqtl-paths", + help = "Per-condition tensorqtl sumstat files", + type = "character", nargs = Inf) +parser <- add_argument(parser, "--conditions", + help = "Comma-separated condition labels (one per path)", + type = "character") +parser <- add_argument(parser, "--region", + help = "Genomic interval as chr:start-end (used as region_id)", + type = "character") +parser <- add_argument(parser, "--output", + help = "Output RDS path", type = "character") +argv <- parse_args(parser) + +paths <- as.character(argv$tensorqtl_paths) +conditions <- trimws(strsplit(argv$conditions, ",", fixed = TRUE)[[1L]]) +if (length(paths) == 0L) + stop("--tensorqtl-paths requires at least one path.") +if (length(paths) != length(conditions)) + stop("--tensorqtl-paths length (", length(paths), + ") must match --conditions length (", length(conditions), ").") + +# Read one file → return data.frame(variant_id, z) +read_one <- function(path) { + open_fn <- if (grepl("\\.gz$", path)) function(p) gzfile(p) + else function(p) p + df <- read.table(open_fn(path), header = TRUE, sep = "\t", + stringsAsFactors = FALSE, check.names = FALSE, + comment.char = "") + pick <- function(opts) intersect(opts, names(df))[1L] + vid <- pick(c("variant_id", "SNP", "snp", "rsid")) + zcol <- pick(c("z", "Z")) + tcol <- pick(c("tstat", "tstatistic", "tstat_val")) + if (is.na(vid)) + stop(path, " is missing a variant_id column (looked for variant_id / SNP).") + if (is.na(zcol) && is.na(tcol)) + stop(path, " has neither z nor tstat column.") + z <- if (!is.na(zcol)) as.numeric(df[[zcol]]) else as.numeric(df[[tcol]]) + data.frame(variant_id = as.character(df[[vid]]), + z = z, stringsAsFactors = FALSE) +} + +# Read every condition, then inner-join on variant_id so the z-matrix is +# rectangular (variants × conditions). This mirrors what the legacy +# `load_multitrait_*_sumstat` helpers did before they were retired. +perCondition <- lapply(paths, read_one) +common <- Reduce(intersect, lapply(perCondition, `[[`, "variant_id")) +if (length(common) == 0L) + stop("No variant overlaps across the supplied --tensorqtl-paths for region ", + argv$region, ".") +zMat <- matrix(NA_real_, nrow = length(common), ncol = length(conditions), + dimnames = list(common, conditions)) +for (i in seq_along(perCondition)) { + d <- perCondition[[i]] + idx <- match(common, d$variant_id) + zMat[, i] <- d$z[idx] +} +zMat <- zMat[stats::complete.cases(zMat), , drop = FALSE] +if (nrow(zMat) == 0L) + stop("No variants left after dropping rows with NA in any condition for region ", + argv$region, ".") + +out <- list() +out[[argv$region]] <- list(z = zMat, region = argv$region) + +dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE) +saveRDS(out, argv$output, compress = "xz") +cat(sprintf("Wrote MASH input for region %s: %d variants x %d conditions to %s\n", + argv$region, nrow(zMat), ncol(zMat), argv$output)) diff --git a/code/script/pecotmr_integration/twas_manifest_validate.R b/code/script/pecotmr_integration/twas_manifest_validate.R new file mode 100644 index 00000000..db77d54f --- /dev/null +++ b/code/script/pecotmr_integration/twas_manifest_validate.R @@ -0,0 +1,81 @@ +#!/usr/bin/env Rscript +# twas_manifest_validate.R +# +# Validate the per-gene TWAS manifest used by twas.ipynb and emit a +# canonicalised copy (header has no leading '#'; required columns +# present; gene_id values unique; optional fine_mapping_result_rds +# column normalised). Downstream SoS steps fan out over the canonical +# copy via csv.DictReader without needing any Python parsing logic. +# +# Inputs: +# --manifest Per-gene manifest with at least the columns +# `gene_id`, `twas_weights_rds`, `gwas_sumstats_rds` +# (and optionally `fine_mapping_result_rds`). +# A leading '#' on the header line is tolerated +# and stripped. +# --output Canonical manifest TSV path. + +suppressPackageStartupMessages({ + library(argparser) +}) + +parser <- arg_parser("Validate + canonicalise the per-gene TWAS manifest") +parser <- add_argument(parser, "--manifest", + help = "Per-gene TWAS manifest TSV", + type = "character") +parser <- add_argument(parser, "--output", + help = "Canonical manifest TSV path", + type = "character") +argv <- parse_args(parser) + +if (!file.exists(argv$manifest)) + stop("--manifest file not found: ", argv$manifest) + +lines <- readLines(argv$manifest) +if (length(lines) == 0L) + stop("Empty manifest: ", argv$manifest) + +# Tolerate (and strip) a leading '#' on the header line. +hdr <- strsplit(sub("^#", "", lines[[1L]]), "\t", fixed = TRUE)[[1L]] +required <- c("gene_id", "twas_weights_rds", "gwas_sumstats_rds") +missing <- setdiff(required, hdr) +if (length(missing) > 0L) + stop("Manifest missing required column(s): ", + paste(missing, collapse = ", "), + " (header was: ", paste(hdr, collapse = ", "), ").") +hasFmr <- "fine_mapping_result_rds" %in% hdr + +rows <- list() +for (line in lines[-1L]) { + if (!nzchar(line)) next + parts <- strsplit(line, "\t", fixed = TRUE)[[1L]] + # Pad short rows with empty strings so column alignment stays sane. + if (length(parts) < length(hdr)) + parts <- c(parts, rep("", length(hdr) - length(parts))) + row <- setNames(as.list(parts[seq_along(hdr)]), hdr) + if (!nzchar(row$gene_id)) next + rows[[length(rows) + 1L]] <- row +} +if (length(rows) == 0L) + stop("Manifest has no data rows: ", argv$manifest) + +geneIds <- vapply(rows, `[[`, character(1), "gene_id") +if (anyDuplicated(geneIds)) + stop("Manifest has duplicate gene_id values: ", + paste(unique(geneIds[duplicated(geneIds)]), collapse = ", ")) + +outCols <- c("gene_id", "twas_weights_rds", "gwas_sumstats_rds", + if (hasFmr) "fine_mapping_result_rds" else NULL) +outDf <- do.call(rbind, lapply(rows, function(r) { + as.data.frame(setNames(lapply(outCols, function(k) { + v <- r[[k]]; if (is.null(v)) "" else as.character(v) + }), outCols), stringsAsFactors = FALSE) +})) + +dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE) +write.table(outDf, file = argv$output, sep = "\t", quote = FALSE, + row.names = FALSE, na = "") +cat(sprintf("Canonicalised manifest (%d rows, %s) to %s\n", + nrow(outDf), + if (hasFmr) "with fine_mapping_result_rds" + else "no fine_mapping_result_rds column", argv$output)) diff --git a/code/script/pecotmr_integration/twas_weights.R b/code/script/pecotmr_integration/twas_weights.R index 4fe72878..618d1b61 100644 --- a/code/script/pecotmr_integration/twas_weights.R +++ b/code/script/pecotmr_integration/twas_weights.R @@ -15,10 +15,21 @@ # --gene-id (gene mode) trait identifier # --cis-window (gene mode) cis-window in bp # --region (region mode) chr:start-end string +# --methods Comma-separated method tokens (default +# "default"; "default" expands inside pecotmr). +# Pass e.g. "mrmash,mvsusie" for multivariate +# TWAS weights. # --fine-mapping-result Optional pre-fit FineMappingResult RDS; # SuSiE-family methods reuse the cached fits # via the fineMappingResult cache. Pass "" or # "." to skip. +# --method-args Optional JSON object of per-method kwargs +# spliced into twasWeightsPipeline() via its +# named-list methods= argument. Keys are +# method tokens (must be a subset of --methods); +# values are kwargs lists forwarded to the +# underlying per-method learner. Example: +# '{"lasso":{"nfolds":10}}'. # --output Output RDS path (one TwasWeights) suppressPackageStartupMessages({ @@ -26,6 +37,7 @@ suppressPackageStartupMessages({ library(pecotmr) library(GenomicRanges) library(IRanges) + library(jsonlite) }) parser <- arg_parser("Per-gene or per-region default-preset TWAS weights over a pre-built QtlDataset") @@ -41,13 +53,64 @@ parser <- add_argument(parser, "--cis-window", parser <- add_argument(parser, "--region", help = "Genomic region as chr:start-end (region mode); mutually exclusive with --gene-id", type = "character", default = "") +parser <- add_argument(parser, "--methods", + help = "Comma-separated TWAS method tokens (default 'default')", + type = "character", default = "default") parser <- add_argument(parser, "--fine-mapping-result", help = "Optional pre-fit FineMappingResult RDS", type = "character", default = "") +parser <- add_argument(parser, "--method-args", + help = "JSON object {token: {kwarg: value, ...}, ...} for twasWeightsPipeline()", + type = "character", default = "") parser <- add_argument(parser, "--output", help = "Output RDS path", type = "character") argv <- parse_args(parser) +# Parse --method-args into a nested named list of per-method kwargs. +parsed_method_args <- if (nzchar(argv$method_args) && argv$method_args != "." && + argv$method_args != "{}") { + parsed <- tryCatch(jsonlite::fromJSON(argv$method_args, simplifyVector = FALSE), + error = function(e) stop( + "--method-args must be a JSON object string (got: ", + argv$method_args, "). Error: ", conditionMessage(e))) + if (!is.list(parsed) || is.null(names(parsed)) || any(names(parsed) == "")) + stop("--method-args must be a JSON object whose keys are method ", + "tokens, e.g. '{\"lasso\":{\"nfolds\":10}}'.") + nonObject <- vapply(parsed, function(x) !is.list(x), logical(1)) + if (any(nonObject)) + stop("--method-args: each value must itself be an object of kwargs ", + "(got non-object for: ", + paste(names(parsed)[nonObject], collapse = ", "), ").") + parsed +} else { + NULL +} + +# Normalize --methods into the form twasWeightsPipeline expects: a +# character vector of method tokens, or "default" (the preset). When +# --method-args is supplied, build the named-list form {token: kwargs} +# that twasWeightsPipeline accepts directly. The "default" preset can't +# carry per-method kwargs (it expands inside pecotmr to a fixed token +# set we can't see here) — explicit --methods is required in that case. +methods <- trimws(strsplit(argv$methods, ",", fixed = TRUE)[[1L]]) +methods_arg <- if (is.null(parsed_method_args)) { + if (length(methods) == 1L && methods == "default") "default" else methods +} else { + if (length(methods) == 1L && methods == "default") + stop("--method-args is only supported with explicit --methods (got ", + "--methods 'default'); list the method tokens you want to tune ", + "explicitly, e.g. --methods lasso,enet --method-args '{\"lasso\":{\"nfolds\":10}}'.") + unknown <- setdiff(names(parsed_method_args), methods) + if (length(unknown) > 0L) + stop("--method-args has keys not listed in --methods (got '", + paste(unknown, collapse = ", "), + "'; --methods = '", paste(methods, collapse = ", "), "').") + setNames(lapply(methods, function(tk) { + if (tk %in% names(parsed_method_args)) parsed_method_args[[tk]] + else list() + }), methods) +} + parse_region <- function(s) { m <- regmatches(s, regexec("^([^:]+):([0-9]+)-([0-9]+)$", s))[[1L]] if (length(m) != 4L) @@ -74,19 +137,15 @@ fmr <- if (nzchar(fmr_path) && fmr_path != "." && file.exists(fmr_path)) { } res <- if (has_region) { - twasWeightsPipeline( - qd, - methods = NULL, - region = parse_region(argv$region), - cisWindow = argv$cis_window, - fineMappingResult = fmr) + twasWeightsPipeline(qd, methods = methods_arg, + region = parse_region(argv$region), + cisWindow = argv$cis_window, + fineMappingResult = fmr) } else { - twasWeightsPipeline( - qd, - methods = NULL, - traitId = argv$gene_id, - cisWindow = argv$cis_window, - fineMappingResult = fmr) + twasWeightsPipeline(qd, methods = methods_arg, + traitId = argv$gene_id, + cisWindow = argv$cis_window, + fineMappingResult = fmr) } dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE)