From 4062750494c5dcd217fa4bd603163d2e42e05c4c Mon Sep 17 00:00:00 2001 From: Daniel Nachun Date: Mon, 22 Jun 2026 16:30:03 -0700 Subject: [PATCH] new S4 updates --- code/SoS/pecotmr_integration/colocboost.ipynb | 8 +- code/SoS/pecotmr_integration/ctwas.ipynb | 102 ++++++ code/SoS/pecotmr_integration/enloc.ipynb | 89 +++++ .../pecotmr_integration/fine_mapping.ipynb | 23 +- .../pecotmr_integration/gwas_sumstats.ipynb | 80 ++++ .../SoS/pecotmr_integration/qtl_dataset.ipynb | 8 +- code/SoS/pecotmr_integration/twas.ipynb | 80 ++++ .../pecotmr_integration/twas_weights.ipynb | 8 +- code/script/pecotmr_integration/coloc.R | 96 +++++ code/script/pecotmr_integration/colocboost.R | 128 ++++++- .../pecotmr_integration/ctwas_assemble.R | 130 +++++++ code/script/pecotmr_integration/ctwas_est.R | 78 ++++ .../pecotmr_integration/ctwas_finemap.R | 87 +++++ .../script/pecotmr_integration/fine_mapping.R | 116 ++++-- .../gwas_sumstats_construct.R | 189 ++++++++++ .../legacy_ctwas_weights_to_s4.R | 157 ++++++++ .../legacy_twas_weights_to_s4.R | 206 +++++++++++ .../qtl_dataset_construct.R | 168 ++++++--- .../pecotmr_integration/qtl_enrichment.R | 68 ++++ code/script/pecotmr_integration/twas.R | 62 ++++ .../script/pecotmr_integration/twas_weights.R | 79 ++-- dev/pecotmr_migration_survey.md | 87 +++++ dev/python-replacement-scope.md | 174 +++++++++ dev/s4_migration_plan.md | 345 ++++++++++++++++++ 24 files changed, 2419 insertions(+), 149 deletions(-) create mode 100644 code/SoS/pecotmr_integration/ctwas.ipynb create mode 100644 code/SoS/pecotmr_integration/enloc.ipynb create mode 100644 code/SoS/pecotmr_integration/gwas_sumstats.ipynb create mode 100644 code/SoS/pecotmr_integration/twas.ipynb create mode 100644 code/script/pecotmr_integration/coloc.R create mode 100644 code/script/pecotmr_integration/ctwas_assemble.R create mode 100644 code/script/pecotmr_integration/ctwas_est.R create mode 100644 code/script/pecotmr_integration/ctwas_finemap.R create mode 100644 code/script/pecotmr_integration/gwas_sumstats_construct.R create mode 100644 code/script/pecotmr_integration/legacy_ctwas_weights_to_s4.R create mode 100644 code/script/pecotmr_integration/legacy_twas_weights_to_s4.R create mode 100644 code/script/pecotmr_integration/qtl_enrichment.R create mode 100644 code/script/pecotmr_integration/twas.R create mode 100644 dev/pecotmr_migration_survey.md create mode 100644 dev/python-replacement-scope.md create mode 100644 dev/s4_migration_plan.md diff --git a/code/SoS/pecotmr_integration/colocboost.ipynb b/code/SoS/pecotmr_integration/colocboost.ipynb index a5d94d90..f0e2e4b0 100644 --- a/code/SoS/pecotmr_integration/colocboost.ipynb +++ b/code/SoS/pecotmr_integration/colocboost.ipynb @@ -6,7 +6,7 @@ "kernel": "SoS" }, "source": [ - "# Per-gene colocboost across contexts in a single study\n\n## Description\n\nFor a single study's `QtlDataset` RDS, run `pecotmr::colocboostPipeline()` per focal gene. Each task colocalizes the focal gene's signal across the QtlDataset's contexts (the within-study, cross-context case \u2014 no GWAS input). Gene-level parallelization is via SoS task fan-out: every task loads the same RDS and runs one focal gene.\n\n## Inputs\n\n- `--qtl-dataset` — path to the QtlDataset RDS produced by `qtl_dataset.ipynb`.\n- `--genes` — explicit gene-ID list (one focal gene per task).\n- `--cis-window` — bp window around each focal gene's TSS for variant selection. Default 1,000,000.\n\n## Output\n\n- `{cwd}/{study}.{gene}.colocboost.rds` per gene.\n" + "# Per-gene / per-region colocboost across contexts (and optionally GWAS) in a single study\n\n## Description\n\nFor a single study's `QtlDataset` RDS, run `pecotmr::colocboostPipeline()` per fan-out unit (gene or region). Three pipeline variants can be enabled independently:\n\n- `--xqtl-coloc` (default ON; opt out with `--no-xqtl-coloc=true` or `xqtl_coloc=False` in SoS): within-study cross-context QTL coloc.\n- `--joint-gwas` (default OFF): xQTL + GWAS joint coloc with the focal outcome from the GWAS. Requires `--gwas-sumstats`.\n- `--separate-gwas` (default OFF): one focal-GWAS coloc per GWAS study. Requires `--gwas-sumstats`.\n\nWhen `--gwas-sumstats` is supplied it must point at a QC'd `GwasSumStats` RDS (output of `summaryStatsQc()` upstream).\n\n## Inputs\n\n- `--qtl-dataset` — QtlDataset RDS produced by `qtl_dataset.ipynb`.\n- `--genes ID1 ID2 \u2026` **or** `--regions chr:start-end \u2026` (mutually exclusive).\n- `--cis-window` — bp window (gene mode default 1e6).\n- `--gwas-sumstats` — QC'd GwasSumStats RDS path (required for joint/separate variants).\n- `--xqtl-coloc` / `--joint-gwas` / `--separate-gwas` — pipeline-variant toggles.\n- `--modular-script-dir` — directory containing the per-gene worker R scripts. Default `code/script`.\n\n## Output\n\n- `{cwd}/{study}.{gene|region}.colocboost.rds` per fan-out unit. The RDS holds the colocboostPipeline list with slots `xqtl_coloc`, `joint_gwas`, `separate_gwas`, `computing_time`; unused-variant slots are NULL.\n" ] }, { @@ -15,7 +15,7 @@ "kernel": "SoS" }, "source": [ - "## Example\n\n```bash\nsos run pipeline/colocboost.ipynb colocboost \\\n --cwd output \\\n --study ROSMAP_DLPFC \\\n --qtl-dataset output/ROSMAP_DLPFC.qtl_dataset.rds \\\n --genes ENSG00000060237 ENSG00000234593 \\\n --cis-window 1000000\n```\n" + "## Example\n\n```bash\n# xQTL-only (default)\nsos run pipeline/colocboost.ipynb colocboost \\\n --cwd output --modular-script-dir /path/to/code/script \\\n --study TEST_STUDY \\\n --qtl-dataset output/TEST_STUDY.qtl_dataset.rds \\\n --genes ENSG00000060237\n\n# xQTL + joint-GWAS\nsos run pipeline/colocboost.ipynb colocboost \\\n --cwd output --modular-script-dir /path/to/code/script \\\n --study TEST_STUDY \\\n --qtl-dataset output/TEST_STUDY.qtl_dataset.rds \\\n --gwas-sumstats output/GWAS.qcd.rds \\\n --joint-gwas True \\\n --genes ENSG00000060237\n```\n" ] }, { @@ -26,7 +26,7 @@ }, "outputs": [], "source": [ - "[global]\nparameter: cwd = path('output')\nparameter: study = str\nparameter: qtl_dataset = path\nparameter: genes = []\nparameter: cis_window = 1000000\nparameter: container = ''\nparameter: job_size = 1\nparameter: walltime = '1h'\nparameter: mem = '16G'\nparameter: numThreads = 1\n" + "[global]\nparameter: cwd = path('output')\nparameter: study = str\nparameter: qtl_dataset = path\nparameter: modular_script_dir = path('code/script')\n# Mutually exclusive fan-out sources.\nparameter: genes = []\nparameter: regions = []\nparameter: cis_window = 1000000\n# GWAS-side input: a QC'd GwasSumStats RDS (build it upstream).\nparameter: gwas_sumstats = path('.')\n# Pipeline variants.\nparameter: xqtl_coloc = True # within-study cross-context (default ON)\nparameter: joint_gwas = False # xQTL + GWAS joint (requires gwas_sumstats)\nparameter: separate_gwas = False # per-GWAS-study separate-focal (requires gwas_sumstats)\nparameter: container = ''\nparameter: job_size = 1\nparameter: walltime = '30m'\nparameter: mem = '8G'\nparameter: numThreads = 1\n" ] }, { @@ -37,7 +37,7 @@ }, "outputs": [], "source": [ - "[colocboost]\ninput: qtl_dataset, for_each = 'genes'\noutput: f\"{cwd}/{study}.{_genes}.colocboost.rds\"\ntask: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\nbash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n Rscript ${path(\"code/script/pecotmr_integration/colocboost.R\", absolute=True)} \\\n --qtl-dataset ${_input} \\\n --gene-id ${_genes} \\\n --cis-window ${cis_window} \\\n --output ${_output}\n" + "[colocboost]\nif bool(genes) == bool(regions):\n raise ValueError(\"Specify exactly one of --genes (trait IDs) or --regions (chr:start-end strings).\")\nif (joint_gwas or separate_gwas) and not gwas_sumstats.is_file():\n raise ValueError(\"--joint-gwas / --separate-gwas require --gwas-sumstats pointing at a QC'd GwasSumStats RDS.\")\nif not (xqtl_coloc or joint_gwas or separate_gwas):\n raise ValueError(\"Enable at least one of xqtl_coloc, joint_gwas, separate_gwas.\")\nfanout_items = genes if genes else regions\nfanout_kind = 'gene' if genes else 'region'\ndef _fanout_safe(s):\n return s.replace(':', '_').replace('-', '_')\ninput: qtl_dataset, for_each = 'fanout_items'\noutput: f\"{cwd}/{study}.{_fanout_safe(_fanout_items)}.colocboost.rds\"\ntask: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\nbash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n Rscript ${modular_script_dir}/pecotmr_integration/colocboost.R \\\n --qtl-dataset ${_input} \\\n ${('--gene-id ' + _fanout_items + ' --cis-window ' + str(cis_window)) if fanout_kind == 'gene' else ('--region ' + _fanout_items)} \\\n ${'--gwas-sumstats ' + str(gwas_sumstats) if gwas_sumstats.is_file() else ''} \\\n ${'--no-xqtl-coloc' if not xqtl_coloc else ''} \\\n ${'--joint-gwas' if joint_gwas else ''} \\\n ${'--separate-gwas' if separate_gwas else ''} \\\n --output ${_output}\n" ] } ], diff --git a/code/SoS/pecotmr_integration/ctwas.ipynb b/code/SoS/pecotmr_integration/ctwas.ipynb new file mode 100644 index 00000000..4a338241 --- /dev/null +++ b/code/SoS/pecotmr_integration/ctwas.ipynb @@ -0,0 +1,102 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "# cTWAS via the three-step pecotmr API\n\n## Description\n\nDrives a multi-LD-block cTWAS analysis through `pecotmr`'s split pipeline:\n\n- `[ctwas_1]` — `assembleCtwasInputs`: consume the per-region manifest, load each block's `GwasSumStats` + per-gene `TwasWeights` RDS files, produce the assembled ctwas inputs (`z_snp`, `weights`, `region_info`, `snp_map`, `LD_map`, LD/snpInfo loader closures).\n- `[ctwas_2]` — `estCtwasParam`: estimate the joint group prior + prior variance (`ctwas::est_param`). `--fallback-to-prefit` recovers from accurate-EM NaN divergence by re-running the prefit step only (mirrors the legacy ctwas_2 workaround).\n- `[ctwas_3]` — `screenCtwasRegions` + `finemapCtwasRegions`: screen regions and fine-map causal genes; optional `--param-override` RDS lets the caller substitute hand-tuned priors before screening.\n\nThe three steps chain via intermediate RDS files (`{name}.ctwas_inputs.rds`, `{name}.ctwas_est.rds`, `{name}.ctwas_finemap.rds`) so individual stages can be re-run independently.\n\n## Inputs\n\n- `--manifest` — per-region manifest TSV with columns:\n - `region_id` (unique string)\n - `gwas_sumstats_rds` (path to per-block `GwasSumStats` RDS from `gwas_sumstats_construct.R`)\n - `twas_weights_rds` (comma-separated per-gene `TwasWeights` RDS paths for that block; may be empty for SNP-only blocks)\n - `fine_mapping_result_rds` (optional, comma-separated)\n- Optional weight pre-filters (`--twas-weight-cutoff`, `--cs-min-cor`, `--min-pip-cutoff`, `--max-num-variants`).\n- Optional `--twas-z` precomputed TWAS-Z `GRanges` RDS (output of `twas.ipynb`).\n- Step 2 knobs: `--thin`, `--niter-prefit`, `--niter`, `--min-group-size`, `--min-p-single-effect`, `--fallback-to-prefit`.\n- Step 3 knobs: `--L`, `--no-filter-L`, `--min-nonsnp-pip`, `--param-override`.\n\n## Output\n\n- `{cwd}/{name}.ctwas_inputs.rds` (step 1)\n- `{cwd}/{name}.ctwas_est.rds` (step 2)\n- `{cwd}/{name}.ctwas_finemap.rds` (step 3; ctwas_sumstats-shape result)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Example\n\n```bash\nsos run pipeline/ctwas.ipynb ctwas_3 \\\n --cwd output --modular-script-dir /path/to/code/script \\\n --name protocol_example_twas_chr22 \\\n --manifest /tmp/gwas_smoke/ctwas_manifest_blessed.tsv \\\n --fallback-to-prefit \\\n --min-group-size 1\n```\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[global]\nparameter: cwd = path('output')\nparameter: name = str\nparameter: manifest = path\n# --- step 1 (assemble) -----------------------------------------------\nparameter: method = ''\nparameter: twas_z = path('.')\nparameter: twas_weight_cutoff = 0.0\nparameter: cs_min_cor = 0.8\nparameter: min_pip_cutoff = 0.0\nparameter: max_num_variants = 'Inf'\n# --- step 2 (est_param) ----------------------------------------------\nparameter: thin = 0.1\nparameter: niter_prefit = 3\nparameter: niter = 30\nparameter: group_prior_var_structure = 'shared_type'\nparameter: min_group_size = 100\nparameter: min_p_single_effect = 0.8\nparameter: fallback_to_prefit = False\n# --- step 3 (screen + finemap) ---------------------------------------\nparameter: L = 5\nparameter: no_filter_L = False\nparameter: min_nonsnp_pip = 0.5\nparameter: param_override = path('.')\n# --- infrastructure --------------------------------------------------\nparameter: modular_script_dir = path('code/script')\nparameter: container = ''\nparameter: job_size = 1\nparameter: walltime = '1h'\nparameter: mem = '16G'\nparameter: numThreads = 1\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[ctwas_1 (assemble inputs)]\ninput: manifest\noutput: f\"{cwd}/{name}.ctwas_inputs.rds\"\ntask: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\nbash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n Rscript ${modular_script_dir}/pecotmr_integration/ctwas_assemble.R \\\n --manifest ${_input} \\\n ${'--method ' + method if method else ''} \\\n ${'--twas-z ' + str(twas_z) if twas_z.is_file() else ''} \\\n --twas-weight-cutoff ${twas_weight_cutoff} \\\n --cs-min-cor ${cs_min_cor} \\\n --min-pip-cutoff ${min_pip_cutoff} \\\n --max-num-variants ${max_num_variants} \\\n --output ${_output}\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[ctwas_2 (estimate priors)]\noutput: f\"{cwd}/{name}.ctwas_est.rds\"\ntask: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\nbash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n Rscript ${modular_script_dir}/pecotmr_integration/ctwas_est.R \\\n --inputs ${_input} \\\n --thin ${thin} \\\n --niter-prefit ${niter_prefit} \\\n --niter ${niter} \\\n --group-prior-var-structure ${group_prior_var_structure} \\\n --min-group-size ${min_group_size} \\\n --min-p-single-effect ${min_p_single_effect} \\\n ${'--fallback-to-prefit' if fallback_to_prefit else ''} \\\n --ncore ${numThreads} \\\n --output ${_output}\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[ctwas_3 (screen + finemap)]\noutput: f\"{cwd}/{name}.ctwas_finemap.rds\"\ntask: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\nbash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n Rscript ${modular_script_dir}/pecotmr_integration/ctwas_finemap.R \\\n --est ${_input} \\\n ${'--param-override ' + str(param_override) if param_override.is_file() else ''} \\\n --L ${L} \\\n ${'--no-filter-L' if no_filter_L else ''} \\\n --min-nonsnp-pip ${min_nonsnp_pip} \\\n --ncore ${numThreads} \\\n --output ${_output}\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "SoS", + "language": "sos", + "name": "sos" + }, + "language_info": { + "codemirror_mode": "sos", + "file_extension": ".sos", + "mimetype": "text/x-sos", + "name": "sos", + "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter", + "pygments_lexer": "sos" + }, + "sos": { + "kernels": [ + [ + "Bash", + "bash", + "Bash", + "#E6EEFF", + "shell" + ], + [ + "SoS", + "sos", + "", + "", + "sos" + ] + ], + "version": "0.24.3" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/code/SoS/pecotmr_integration/enloc.ipynb b/code/SoS/pecotmr_integration/enloc.ipynb new file mode 100644 index 00000000..26d58dcd --- /dev/null +++ b/code/SoS/pecotmr_integration/enloc.ipynb @@ -0,0 +1,89 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "# xQTL–GWAS enrichment + colocalization\n\n## Description\n\nReplaces the legacy `SuSiE_enloc.ipynb` two-step `[xqtl_gwas_enrichment]` / `[susie_coloc]` flow with two thin SoS steps backed by the new pecotmr S4 API. Pairs an xQTL-side `QtlFineMappingResult` with a GWAS-side `GwasFineMappingResult` (or `GwasSumStats`) and runs the two pipelines back-to-back:\n\n- `[enrichment]` — `qtlEnrichmentPipeline(gwasFineMappingResult, qtlFineMappingResult, ...)` produces a per-(gwasStudy, qtlContext) enrichment data.frame.\n- `[coloc]` — `colocPipeline(qtlFineMappingResult, gwasInput, ..., enrichment = )` runs `coloc::coloc.bf_bf` per (QTL tuple, GWAS tuple, CS pair) with enrichment-scaled `p12` priors (the \"enloc\" variant). Pass `--skip-enrichment` to run plain coloc with the default p12.\n\nBecause the S4 FMR collections already encapsulate every (study, context, trait, method) tuple across regions, the workflow is two single-call steps — no per-gene fan-out, no Python overlap helpers, no manifest. Per-region fan-out (and the per-region coloc loop) lives inside `colocPipeline` itself.\n\n## Inputs\n\n- `--qtl-fine-mapping` — path to S4 `QtlFineMappingResult` RDS (output of `fine_mapping.ipynb` / `fineMappingPipeline`).\n- `--gwas-fine-mapping` — path to S4 `GwasFineMappingResult` RDS (also a `fineMappingPipeline` output). `colocPipeline` will also accept a `GwasSumStats` RDS and inline-fine-map it.\n- `--name` — identifier used in the output filenames.\n- Optional enrichment knobs (`--num-gwas`, `--pi-qtl`, `--lambda`, `--imp-n`).\n- Optional coloc knobs (`--p1`, `--p2`, `--p12`, `--p12-max`, `--filter-lbf-cs`, `--filter-lbf-cs-secondary`, `--no-adjust-pips`).\n- `--skip-enrichment` — skip step 1 and run coloc with the default p12 (the non-enloc variant).\n\n## Outputs\n\n- `{cwd}/{name}.enrichment.rds` — data.frame, one row per (gwasStudy, qtlContext) pair (omitted when `--skip-enrichment`).\n- `{cwd}/{name}.coloc.rds` — data.frame, one row per (QTL tuple, GWAS tuple, CS pair).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": "## Example\n\n```bash\nsos run pipeline/enloc.ipynb susie_enloc \\\n --cwd output --modular-script-dir /path/to/code/script \\\n --name protocol_example_enloc \\\n --qtl-fine-mapping output/fine_mapping/protocol_example.qtl_finemap.rds \\\n --gwas-fine-mapping output/fine_mapping/protocol_example.gwas_finemap.rds\n```\n" + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[global]\nparameter: cwd = path('output')\nparameter: name = str\nparameter: qtl_fine_mapping = path\nparameter: gwas_fine_mapping = path\n# --- step 1 (enrichment) --------------------------------------------\nparameter: skip_enrichment = False\nparameter: num_gwas = -1.0 # negative → NULL pass-through\nparameter: pi_qtl = -1.0 # negative → NULL pass-through\nparameter: lambda_ = 1.0\nparameter: imp_n = 25\n# --- step 2 (coloc) -------------------------------------------------\nparameter: p1 = 1e-4\nparameter: p2 = 1e-4\nparameter: p12 = 5e-6\nparameter: p12_max = 1e-3\nparameter: filter_lbf_cs = False\nparameter: filter_lbf_cs_secondary = -1.0 # negative → NULL\nparameter: no_adjust_pips = False\nparameter: finemapping_methods = 'susie'\n# --- infrastructure -------------------------------------------------\nparameter: modular_script_dir = path('code/script')\nparameter: container = ''\nparameter: job_size = 1\nparameter: walltime = '30m'\nparameter: mem = '16G'\nparameter: numThreads = 1\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[susie_enloc_1 (enrichment)]\nstop_if(skip_enrichment, '--skip-enrichment set; skipping qtlEnrichmentPipeline.')\ninput: qtl_fine_mapping, gwas_fine_mapping\noutput: f\"{cwd}/{name}.enrichment.rds\"\ntask: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\nbash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n Rscript ${modular_script_dir}/pecotmr_integration/qtl_enrichment.R \\\n --qtl-fine-mapping ${qtl_fine_mapping} \\\n --gwas-fine-mapping ${gwas_fine_mapping} \\\n ${'--num-gwas ' + str(num_gwas) if num_gwas >= 0 else ''} \\\n ${'--pi-qtl ' + str(pi_qtl) if pi_qtl >= 0 else ''} \\\n --lambda ${lambda_} \\\n --imp-n ${imp_n} \\\n --ncore ${numThreads} \\\n --output ${_output}\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[susie_enloc_2 (coloc)]\nenrich_path = f\"{cwd}/{name}.enrichment.rds\"\nenrich_arg = f\"--enrichment {enrich_path}\" if not skip_enrichment else ''\ninput: qtl_fine_mapping, gwas_fine_mapping\noutput: f\"{cwd}/{name}.coloc.rds\"\ntask: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\nbash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n Rscript ${modular_script_dir}/pecotmr_integration/coloc.R \\\n --qtl-fine-mapping ${qtl_fine_mapping} \\\n --gwas-input ${gwas_fine_mapping} \\\n ${enrich_arg} \\\n ${'--filter-lbf-cs' if filter_lbf_cs else ''} \\\n ${'--filter-lbf-cs-secondary ' + str(filter_lbf_cs_secondary) if filter_lbf_cs_secondary >= 0 else ''} \\\n --p1 ${p1} \\\n --p2 ${p2} \\\n --p12 ${p12} \\\n --p12-max ${p12_max} \\\n ${'--no-adjust-pips' if no_adjust_pips else ''} \\\n --finemapping-methods ${finemapping_methods} \\\n --output ${_output}\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "SoS", + "language": "sos", + "name": "sos" + }, + "language_info": { + "codemirror_mode": "sos", + "file_extension": ".sos", + "mimetype": "text/x-sos", + "name": "sos", + "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter", + "pygments_lexer": "sos" + }, + "sos": { + "kernels": [ + [ + "Bash", + "bash", + "Bash", + "#E6EEFF", + "shell" + ], + [ + "SoS", + "sos", + "", + "", + "sos" + ] + ], + "version": "0.24.3" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file diff --git a/code/SoS/pecotmr_integration/fine_mapping.ipynb b/code/SoS/pecotmr_integration/fine_mapping.ipynb index 9886e42d..31f36f25 100644 --- a/code/SoS/pecotmr_integration/fine_mapping.ipynb +++ b/code/SoS/pecotmr_integration/fine_mapping.ipynb @@ -5,18 +5,14 @@ "metadata": { "kernel": "SoS" }, - "source": [ - "# Per-gene fine-mapping against a pre-built QtlDataset\n\n## Description\n\nFor a single study's `QtlDataset` RDS, run `pecotmr::fineMappingPipeline(methods = \"susie\", ...)` per gene. Gene-level parallelization is via SoS task fan-out: each task loads the same RDS and fine-maps one gene. No pecotmr-internal accessors are called from this notebook \u2014 the wrapper hands the QtlDataset and a single `--gene-id` to the pipeline and writes one RDS per gene.\n\n## Inputs\n\n- `--qtl-dataset` — path to the QtlDataset RDS produced by `qtl_dataset.ipynb`.\n- `--genes` — explicit gene-ID list (space-separated CLI args).\n- `--cis-window` — bp window around each trait's TSS for variant selection. Default 1,000,000.\n- `--coverage` — SuSiE credible-set coverage. Default 0.95.\n\n## Output\n\n- `{cwd}/{study}.{gene}.finemap.rds` per gene.\n" - ] + "source": "# SuSiE fine-mapping (QTL or GWAS)\n\n## Description\n\nTwo workflows, both backed by `pecotmr::fineMappingPipeline` (which dispatches on the S4 input class):\n\n- **`fine_mapping`** — per-gene / per-region fine-mapping over a pre-built `QtlDataset` RDS. Fan-out goes by `--genes` (list of trait IDs, gene mode) **or** `--regions` (list of `chr:start-end` strings, region mode); exactly one must be supplied. Each task loads the same QtlDataset RDS and fine-maps one unit.\n- **`fine_mapping_gwas`** — per-block fine-mapping over a list of per-block `GwasSumStats` RDS paths (typically produced by `gwas_sumstats_construct.R` / `gwas_sumstats.ipynb`). One task per RDS; each task produces one `GwasFineMappingResult` RDS. Downstream consumers (e.g. `enloc.ipynb`) concatenate the per-block outputs as needed.\n\n## Inputs\n\n### `fine_mapping` (QTL)\n\n- `--qtl-dataset` — path to the QtlDataset RDS produced by `qtl_dataset.ipynb`.\n- `--genes ID1 ID2 …` **or** `--regions chr:start-end …` — fan-out targets (mutually exclusive).\n- `--cis-window` — bp window around each gene's TSS (gene mode only). Default 1,000,000.\n- `--methods` — comma-separated fine-mapping method tokens. Default `susie`.\n- `--coverage` — SuSiE credible-set coverage. Default 0.95.\n\n### `fine_mapping_gwas` (GWAS)\n\n- `--gwas-sumstats-list S1.rds S2.rds …` — per-block GwasSumStats RDS paths to fan out over.\n- `--methods` — comma-separated fine-mapping method tokens. Default `susie`.\n- `--coverage` — SuSiE credible-set coverage. Default 0.95.\n\n### Both\n\n- `--cwd` — output directory. Default `output`.\n- `--study` — study label used in the output filename.\n- `--modular-script-dir` — directory containing the per-task worker R scripts. Default `code/script`; override when SoS is invoked from a working directory that doesn't contain them.\n\n## Outputs\n\n- `fine_mapping`: `{cwd}/{study}.{gene|region}.finemap.rds` per fan-out unit (region strings are sanitised: `:` and `-` become `_`).\n- `fine_mapping_gwas`: `{cwd}/{study}.{bn-of-gwas-sumstats-rds}.gwas_finemap.rds` per input RDS.\n" }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, - "source": [ - "## Example\n\n```bash\nsos run pipeline/fine_mapping.ipynb fine_mapping \\\n --cwd output \\\n --study ROSMAP_DLPFC \\\n --qtl-dataset output/ROSMAP_DLPFC.qtl_dataset.rds \\\n --genes ENSG00000060237 ENSG00000234593 ENSG00000168487 \\\n --cis-window 1000000\n```\n" - ] + "source": "## Example\n\nQTL (per-gene):\n```bash\nsos run pipeline/fine_mapping.ipynb fine_mapping \\\n --cwd output \\\n --modular-script-dir /path/to/xqtl-protocol/code/script \\\n --study TEST_STUDY \\\n --qtl-dataset output/TEST_STUDY.qtl_dataset.rds \\\n --genes ENSG00000060237 ENSG00000234593\n```\n\nGWAS (per-block, list of GwasSumStats RDS):\n```bash\nsos run pipeline/fine_mapping.ipynb fine_mapping_gwas \\\n --cwd output \\\n --modular-script-dir /path/to/xqtl-protocol/code/script \\\n --study TEST_STUDY \\\n --gwas-sumstats-list output/TEST_STUDY.block1.gwas_sumstats.rds \\\n output/TEST_STUDY.block2.gwas_sumstats.rds\n```\n" }, { "cell_type": "code", @@ -25,9 +21,7 @@ "kernel": "SoS" }, "outputs": [], - "source": [ - "[global]\nparameter: cwd = path('output')\nparameter: study = str\nparameter: qtl_dataset = path\n# Explicit gene list; for large fan-outs use the standard SoS @file.txt\n# CLI sugar (one gene per line in the file).\nparameter: genes = []\nparameter: cis_window = 1000000\nparameter: coverage = 0.95\nparameter: container = ''\nparameter: job_size = 1\nparameter: walltime = '30m'\nparameter: mem = '8G'\nparameter: numThreads = 1\n" - ] + "source": "[global]\nparameter: cwd = path('output')\nparameter: study = str\nparameter: modular_script_dir = path('code/script')\n# --- fine_mapping (QTL) ----------------------------------------------\nparameter: qtl_dataset = path('.')\nparameter: genes = []\nparameter: regions = []\nparameter: cis_window = 1000000\n# --- fine_mapping_gwas -----------------------------------------------\nparameter: gwas_sumstats_list = []\n# --- shared ----------------------------------------------------------\nparameter: methods = 'susie'\nparameter: coverage = 0.95\nparameter: container = ''\nparameter: job_size = 1\nparameter: walltime = '30m'\nparameter: mem = '8G'\nparameter: numThreads = 1\n" }, { "cell_type": "code", @@ -36,9 +30,14 @@ "kernel": "SoS" }, "outputs": [], - "source": [ - "[fine_mapping]\n# One task per gene; every task loads the same QtlDataset RDS once and\n# runs fineMappingPipeline against one traitId.\ninput: qtl_dataset, for_each = 'genes'\noutput: f\"{cwd}/{study}.{_genes}.finemap.rds\"\ntask: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\nbash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n Rscript ${path(\"code/script/pecotmr_integration/fine_mapping.R\", absolute=True)} \\\n --qtl-dataset ${_input} \\\n --gene-id ${_genes} \\\n --cis-window ${cis_window} \\\n --coverage ${coverage} \\\n --output ${_output}\n" - ] + "source": "[fine_mapping]\nif not qtl_dataset.is_file():\n raise ValueError(\"fine_mapping requires --qtl-dataset to point at an existing QtlDataset RDS.\")\nif bool(genes) == bool(regions):\n raise ValueError(\"Specify exactly one of --genes (trait IDs) or --regions (chr:start-end strings).\")\nfanout_items = genes if genes else regions\nfanout_kind = 'gene' if genes else 'region'\ndef _fanout_safe(s):\n return s.replace(':', '_').replace('-', '_')\ninput: qtl_dataset, for_each = 'fanout_items'\noutput: f\"{cwd}/{study}.{_fanout_safe(_fanout_items)}.finemap.rds\"\ntask: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\nbash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n Rscript ${modular_script_dir}/pecotmr_integration/fine_mapping.R \\\n --qtl-dataset ${_input} \\\n ${('--gene-id ' + _fanout_items + ' --cis-window ' + str(cis_window)) if fanout_kind == 'gene' else ('--region ' + _fanout_items)} \\\n --methods ${methods} \\\n --coverage ${coverage} \\\n --output ${_output}\n" + }, + { + "cell_type": "code", + "source": "[fine_mapping_gwas]\nif not gwas_sumstats_list:\n raise ValueError(\"fine_mapping_gwas requires --gwas-sumstats-list pointing at one or more per-block GwasSumStats RDS files.\")\ngss_paths = [str(p) for p in gwas_sumstats_list]\ninput: gss_paths, group_by = 1\noutput: f\"{cwd}/{study}.{_input:bnn}.gwas_finemap.rds\"\ntask: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\nbash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n Rscript ${modular_script_dir}/pecotmr_integration/fine_mapping.R \\\n --gwas-sumstats ${_input} \\\n --methods ${methods} \\\n --coverage ${coverage} \\\n --output ${_output}\n", + "metadata": {}, + "execution_count": null, + "outputs": [] } ], "metadata": { diff --git a/code/SoS/pecotmr_integration/gwas_sumstats.ipynb b/code/SoS/pecotmr_integration/gwas_sumstats.ipynb new file mode 100644 index 00000000..9b908b01 --- /dev/null +++ b/code/SoS/pecotmr_integration/gwas_sumstats.ipynb @@ -0,0 +1,80 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "# Per-LD-block GwasSumStats construction\n\n## Description\n\nFor one study's GWAS TSV, fan out per LD block, build a `pecotmr::GwasSumStats` (including `summaryStatsQc()`), and serialize one RDS per `(study, block)`. The RDS files become the per-block GWAS inputs for `twas.ipynb` and `ctwas.ipynb`.\n\n## Inputs\n\n- `--study` — study identifier.\n- `--gwas-tsv` — path to the GWAS summary-statistics TSV.\n- `--ld-blocks` — BED file of LD-block intervals (`chr/chrom`, `start`, `stop/end`).\n- `--ld-meta` — LD-meta TSV pointing at per-region LD references.\n- `--genome` — genome build label. Default `hg19`.\n- `--modular-script-dir` — directory containing the worker R scripts.\n\n## Output\n\n- `{cwd}/{study}.{block_id}.gwas_sumstats.rds` per LD block (`block_id` sanitises `:`\u2192`_`, `-`\u2192`_`).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Example\n\n```bash\nsos run pipeline/gwas_sumstats.ipynb gwas_sumstats_construct \\\n --cwd output --modular-script-dir /path/to/code/script \\\n --study TEST_GWAS \\\n --gwas-tsv input/twas/protocol_example.twas.gwas_sumstats.chr22.tsv.gz \\\n --ld-blocks input/twas/protocol_example.twas.LD_blocks.chr22.bed \\\n --ld-meta input/ld_reference/protocol_example.ld_meta_file.tsv\n```\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[global]\nparameter: cwd = path('output')\nparameter: study = str\nparameter: gwas_tsv = path\nparameter: ld_blocks = path\nparameter: ld_meta = path\nparameter: genome = 'hg19'\nparameter: modular_script_dir = path('code/script')\nparameter: container = ''\nparameter: walltime = '30m'\nparameter: mem = '8G'\nparameter: numThreads = 1\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[gwas_sumstats_construct]\n# Parse LD-blocks BED (header line uses 'chr/chrom', 'start', 'stop/end').\ndef _read_blocks(path):\n out = []\n with open(path) as f:\n hdr = next(f).rstrip('\\n').lstrip('#').split('\\t')\n for line in f:\n row = dict(zip(hdr, line.rstrip('\\n').split('\\t')))\n chrom = row.get('chr', row.get('chrom'))\n start = int(row.get('start', row.get('Start', 0)))\n stop = int(row.get('stop', row.get('end', row.get('End', 0))))\n out.append(f'{chrom}:{start}-{stop}')\n return out\nblocks = _read_blocks(ld_blocks)\nif not blocks:\n raise ValueError(f\"No LD blocks found in {ld_blocks}\")\ndef _safe(s):\n return s.replace(':', '_').replace('-', '_')\ninput: gwas_tsv, for_each = 'blocks'\noutput: f\"{cwd}/{study}.{_safe(_blocks)}.gwas_sumstats.rds\"\ntask: trunk_workers = 1, trunk_size = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\nbash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n Rscript ${modular_script_dir}/pecotmr_integration/gwas_sumstats_construct.R \\\n --study ${study} \\\n --gwas-tsv ${_input} \\\n --ld-block ${_blocks} \\\n --ld-meta ${ld_meta} \\\n --genome ${genome} \\\n --output ${_output}\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "SoS", + "language": "sos", + "name": "sos" + }, + "language_info": { + "codemirror_mode": "sos", + "file_extension": ".sos", + "mimetype": "text/x-sos", + "name": "sos", + "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter", + "pygments_lexer": "sos" + }, + "sos": { + "kernels": [ + [ + "Bash", + "bash", + "Bash", + "#E6EEFF", + "shell" + ], + [ + "SoS", + "sos", + "", + "", + "sos" + ] + ], + "version": "0.24.3" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file diff --git a/code/SoS/pecotmr_integration/qtl_dataset.ipynb b/code/SoS/pecotmr_integration/qtl_dataset.ipynb index 9e091099..55e976ce 100644 --- a/code/SoS/pecotmr_integration/qtl_dataset.ipynb +++ b/code/SoS/pecotmr_integration/qtl_dataset.ipynb @@ -6,7 +6,7 @@ "kernel": "SoS" }, "source": [ - "# Build a QtlDataset per study\n\n## Description\n\nConstruct a single pecotmr `QtlDataset` from one study's genotype + per-context phenotype + per-context phenotype-PC files (plus an optional uniform genotype-PC file), and serialize it to a single RDS. The resulting object is the upstream dependency for every per-gene fineMapping / TWAS / colocboost task \u2014 gene-level parallelization happens against this single object via per-gene `traitId` selectors inside the downstream pipelines.\n\nThis step does NOT iterate over genes or regions. Build the QtlDataset once per study; downstream notebooks fan out per-gene tasks against the resulting RDS.\n\n## Inputs\n\n- `--study` — study identifier.\n- `--genotype-prefix` — PLINK2 pgen/pvar/psam prefix (no extension).\n- `--phenotype CONTEXT=PATH ...` — one entry per QTL context; PATH is a bgzipped BED of phenotype values (columns `#chr`, `start`, `end`, `gene_id`, then per-sample columns).\n- `--phenotype-covariates CONTEXT=PATH ...` — optional per-context molecular-trait PC TSVs (samples as rows, PCs as columns).\n- `--genotype-covariates PATH` — optional TSV of genotype-derived covariates (e.g. ancestry PCs) applied uniformly across all contexts (same shape as the phenotype-PC files).\n- `--maf-cutoff` / `--xvar-cutoff` — pass-through to `QtlDataset()`. These are lazy filters applied inside the accessors at fit time, not at construction.\n\n## Output\n\n- `{cwd}/{study}.qtl_dataset.rds`\n" + "# Build a QtlDataset per study\n\n## Description\n\nConstruct a single pecotmr `QtlDataset` from one study's genotype + per-context phenotype + per-context covariate files (plus an optional uniform genotype-PC file), and serialize it to a single RDS. The resulting object is the upstream dependency for every per-gene fineMapping / TWAS / colocboost task \u2014 gene-level parallelization happens against this single object via per-gene `traitId` selectors inside the downstream pipelines.\n\nThis step does NOT iterate over genes or regions. Build the QtlDataset once per study; downstream notebooks fan out per-gene tasks against the resulting RDS.\n\n## Inputs\n\n- `--study` — study identifier.\n- `--genotype-prefix` — PLINK1 bed/bim/fam prefix (no extension).\n- `--phenotype-manifest` — TSV with columns `#chr, start, end, ID, path, cond` (and optionally `cov_path`). One row per (region, context). Relative `path` / `cov_path` entries resolve against the manifest's own directory.\n- `--genotype-covariates` — optional TSV of genotype-derived covariates (e.g. ancestry PCs) applied uniformly across all contexts.\n- `--transpose-covariates` — when set, transposes every covariate TSV (phenotype + genotype) after reading. Use this for QTLtools-format inputs where rows are covariate names and columns are samples.\n- `--maf-cutoff` / `--xvar-cutoff` — pass-through to `QtlDataset()`. Lazy filters applied inside the accessors at fit time.\n\n## Output\n\n- `{cwd}/{study}.qtl_dataset.rds`\n" ] }, { @@ -15,7 +15,7 @@ "kernel": "SoS" }, "source": [ - "## Example\n\n```bash\nsos run pipeline/qtl_dataset.ipynb qtl_dataset_construct \\\n --cwd output \\\n --study ROSMAP_DLPFC \\\n --genotype-prefix input/genotype/chr22 \\\n --phenotype DLPFC=input/pheno/DLPFC.bed.gz AC=input/pheno/AC.bed.gz \\\n --phenotype-covariates DLPFC=input/cov/DLPFC.pcs.tsv AC=input/cov/AC.pcs.tsv \\\n --genotype-covariates input/cov/genotype_pcs.tsv\n```\n" + "## Example\n\n```bash\nsos run pipeline/qtl_dataset.ipynb qtl_dataset_construct \\\n --cwd output \\\n --study TEST_STUDY \\\n --genotype-prefix input/colocboost/example.chr22 \\\n --phenotype-manifest input/colocboost/pheno_manifest_multicontext.tsv \\\n --transpose-covariates\n```\n" ] }, { @@ -26,7 +26,7 @@ }, "outputs": [], "source": [ - "[global]\nparameter: cwd = path('output')\nparameter: study = str\nparameter: genotype_prefix = path\n# Per-context CONTEXT=PATH entries; SoS parses repeated CLI args as lists.\nparameter: phenotype = []\nparameter: phenotype_covariates = []\nparameter: genotype_covariates = path('.')\nparameter: maf_cutoff = 0.0\nparameter: xvar_cutoff = 0.0\nparameter: container = ''\nparameter: walltime = '15m'\nparameter: mem = '8G'\nparameter: numThreads = 1\n" + "[global]\nparameter: cwd = path('output')\nparameter: study = str\nparameter: genotype_prefix = path\nparameter: phenotype_manifest = path\nparameter: genotype_covariates = path('.')\n# Set when covariate TSVs are in QTLtools format (rows = covariates, cols = samples).\nparameter: transpose_covariates = False\nparameter: maf_cutoff = 0.0\nparameter: xvar_cutoff = 0.0\n# Directory holding code/script of the xqtl-protocol checkout; override\n# when SoS is invoked from a working dir that doesn't contain the scripts.\nparameter: modular_script_dir = path('code/script')\nparameter: container = ''\nparameter: walltime = '15m'\nparameter: mem = '8G'\nparameter: numThreads = 1\n" ] }, { @@ -37,7 +37,7 @@ }, "outputs": [], "source": [ - "[qtl_dataset_construct]\n# Build the QtlDataset once. No fan-out; downstream notebooks load this\n# RDS and parallelize per gene.\noutput: f\"{cwd}/{study}.qtl_dataset.rds\"\ntask: trunk_workers = 1, trunk_size = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\nbash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n Rscript ${path(\"code/script/pecotmr_integration/qtl_dataset_construct.R\", absolute=True)} \\\n --study ${study} \\\n --genotype-prefix ${genotype_prefix} \\\n --phenotype \"${','.join(phenotype)}\" \\\n --phenotype-covariates \"${','.join(phenotype_covariates)}\" \\\n --genotype-covariates ${genotype_covariates if genotype_covariates.is_file() else '\"\"'} \\\n --maf-cutoff ${maf_cutoff} \\\n --xvar-cutoff ${xvar_cutoff} \\\n --output ${_output}\n" + "[qtl_dataset_construct]\n# Build the QtlDataset once. No fan-out; downstream notebooks load this\n# RDS and parallelize per gene.\noutput: f\"{cwd}/{study}.qtl_dataset.rds\"\ntask: trunk_workers = 1, trunk_size = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\nbash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n Rscript ${modular_script_dir}/pecotmr_integration/qtl_dataset_construct.R \\\n --study ${study} \\\n --genotype-prefix ${genotype_prefix} \\\n --phenotype-manifest ${phenotype_manifest} \\\n --genotype-covariates ${genotype_covariates if genotype_covariates.is_file() else '\"\"'} \\\n ${'--transpose-covariates' if transpose_covariates else ''} \\\n --maf-cutoff ${maf_cutoff} \\\n --xvar-cutoff ${xvar_cutoff} \\\n --output ${_output}\n" ] } ], diff --git a/code/SoS/pecotmr_integration/twas.ipynb b/code/SoS/pecotmr_integration/twas.ipynb new file mode 100644 index 00000000..9e0b0871 --- /dev/null +++ b/code/SoS/pecotmr_integration/twas.ipynb @@ -0,0 +1,80 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "# Per-gene TWAS Z + Mendelian Randomization\n\n## Description\n\nFan out per row of a per-gene manifest. Each row pairs a gene's `TwasWeights` RDS with the matching `GwasSumStats` RDS for that gene's home LD block; the SoS step runs `twas.R` per row, which calls `pecotmr::causalInferencePipeline()` to compute the gene's TWAS Z + (optional) MR statistics. Optional per-gene `FineMappingResult` is wired through too.\n\n## Inputs\n\n- `--manifest` — per-gene manifest TSV with columns:\n - `gene_id` — gene identifier (unique; used in the output filename).\n - `twas_weights_rds` — path to the per-gene `TwasWeights` RDS (single TwasWeightsEntry).\n - `gwas_sumstats_rds` — path to the per-LD-block `GwasSumStats` RDS that covers the gene's home block.\n - `fine_mapping_result_rds` — optional; path to the matching per-gene `FineMappingResult` RDS (leave blank to skip).\n- `--study` — study identifier (used in output filenames).\n- `--mr-pip-cutoff` — pass-through (default 0.5).\n- `--mr-method` — pass-through (`\"ivwPerVariant\"` or `\"csAware\"`; default `\"ivwPerVariant\"`).\n\n## Output\n\n- `{cwd}/{study}.{gene_id}.twas.rds` per gene.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Example\n\n```bash\nsos run pipeline/twas.ipynb twas \\\n --cwd output --modular-script-dir /path/to/code/script \\\n --study protocol_example_chr22 \\\n --manifest /tmp/gwas_smoke/twas_manifest.tsv\n```\n\nManifest example:\n```\ngene_id\ttwas_weights_rds\tgwas_sumstats_rds\tfine_mapping_result_rds\nENSG00000130538\t/tmp/gwas_smoke/blessed_tw.rds\t/tmp/gwas_smoke/blocks/chr22_10516173_17414263.rds\t\n```\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[global]\nparameter: cwd = path('output')\nparameter: study = str\nparameter: manifest = path\nparameter: mr_pip_cutoff = 0.5\nparameter: mr_method = 'ivwPerVariant'\nparameter: modular_script_dir = path('code/script')\nparameter: container = ''\nparameter: job_size = 1\nparameter: walltime = '30m'\nparameter: mem = '8G'\nparameter: numThreads = 1\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[twas]\n# Read the per-gene manifest. Header may be canonical (`gene_id`,\n# `twas_weights_rds`, `gwas_sumstats_rds`, `fine_mapping_result_rds`) and\n# may carry a leading `#`. fine_mapping_result_rds is optional per-row\n# (empty cell = skip).\ndef _read_manifest(path):\n rows = []\n with open(path) as f:\n hdr = next(f).rstrip('\\n').lstrip('#').split('\\t')\n required = {'gene_id', 'twas_weights_rds', 'gwas_sumstats_rds'}\n missing = required - set(hdr)\n if missing:\n raise ValueError(\n f\"Manifest missing required column(s): {sorted(missing)}\")\n for line in f:\n parts = line.rstrip('\\n').split('\\t')\n row = dict(zip(hdr, parts))\n if not row.get('gene_id'):\n continue\n rows.append(row)\n if not rows:\n raise ValueError(f\"Manifest {path} is empty.\")\n ids = [r['gene_id'] for r in rows]\n if len(set(ids)) != len(ids):\n raise ValueError(\n \"Manifest has duplicate gene_id values; each row must be unique.\")\n return rows\nrows = _read_manifest(manifest)\ngene_ids = [r['gene_id'] for r in rows]\ntw_paths = [r['twas_weights_rds'] for r in rows]\ngss_paths = [r['gwas_sumstats_rds'] for r in rows]\nfmr_paths = [r.get('fine_mapping_result_rds', '') or '' for r in rows]\ninput: for_each = ['gene_ids', 'tw_paths', 'gss_paths', 'fmr_paths']\nfmr_arg = f\"--fine-mapping-result {_fmr_paths}\" if _fmr_paths else \"\"\noutput: f\"{cwd}/{study}.{_gene_ids}.twas.rds\"\ntask: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\nbash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n Rscript ${modular_script_dir}/pecotmr_integration/twas.R \\\n --twas-weights ${_tw_paths} \\\n --gwas-sumstats ${_gss_paths} \\\n ${fmr_arg} \\\n --mr-pip-cutoff ${mr_pip_cutoff} \\\n --mr-method ${mr_method} \\\n --output ${_output}\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "SoS", + "language": "sos", + "name": "sos" + }, + "language_info": { + "codemirror_mode": "sos", + "file_extension": ".sos", + "mimetype": "text/x-sos", + "name": "sos", + "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter", + "pygments_lexer": "sos" + }, + "sos": { + "kernels": [ + [ + "Bash", + "bash", + "Bash", + "#E6EEFF", + "shell" + ], + [ + "SoS", + "sos", + "", + "", + "sos" + ] + ], + "version": "0.24.3" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/code/SoS/pecotmr_integration/twas_weights.ipynb b/code/SoS/pecotmr_integration/twas_weights.ipynb index 99c817e5..c723922c 100644 --- a/code/SoS/pecotmr_integration/twas_weights.ipynb +++ b/code/SoS/pecotmr_integration/twas_weights.ipynb @@ -6,7 +6,7 @@ "kernel": "SoS" }, "source": [ - "# Per-gene TWAS weights against a pre-built QtlDataset\n\n## Description\n\nFor a single study's `QtlDataset` RDS, run `pecotmr::twasWeightsPipeline(methods = \"default\", ...)` per gene. Gene-level parallelization is via SoS task fan-out: each task loads the same RDS and computes TWAS weights for one gene. The default method preset gives the standard univariate methods (SuSiE / lasso / elastic-net / etc. as defined by `pecotmr::.twasMethodLookup(\"default\")`).\n\nOptionally pass a pre-fit `--fine-mapping-result` RDS; SuSiE-family TWAS methods will reuse those fits instead of refitting.\n\n## Inputs\n\n- `--qtl-dataset` — path to the QtlDataset RDS produced by `qtl_dataset.ipynb`.\n- `--genes` — explicit gene-ID list.\n- `--cis-window` — bp window. Default 1,000,000.\n- `--fine-mapping-result` — optional pre-fit FineMappingResult RDS (e.g. the matching output of `fine_mapping.ipynb`); SuSiE-family TWAS methods will reuse the fits.\n\n## Output\n\n- `{cwd}/{study}.{gene}.twas_weights.rds` per gene.\n" + "# Per-gene / per-region TWAS weights against a pre-built QtlDataset\n\n## Description\n\nFor a single study's `QtlDataset` RDS, run `pecotmr::twasWeightsPipeline(methods = \"default\", ...)` per fan-out unit (gene or region). The default method preset gives the standard univariate methods (SuSiE / lasso / elastic-net / etc. as defined by `pecotmr::.twasMethodLookup(\"default\")`).\n\nOptionally pass a pre-fit `--fine-mapping-result` RDS; SuSiE-family TWAS methods reuse those fits via the `fineMappingResult` cache.\n\n## Inputs\n\n- `--qtl-dataset` — path to the QtlDataset RDS produced by `qtl_dataset.ipynb`.\n- `--genes ID1 ID2 \u2026` **or** `--regions chr:start-end \u2026` (mutually exclusive).\n- `--cis-window` — bp window (gene mode). Default 1,000,000.\n- `--fine-mapping-result` — optional pre-fit FineMappingResult RDS.\n- `--modular-script-dir` — directory containing the per-gene worker R scripts. Default `code/script`.\n\n## Output\n\n- `{cwd}/{study}.{gene|region}.twas_weights.rds` per fan-out unit.\n" ] }, { @@ -15,7 +15,7 @@ "kernel": "SoS" }, "source": [ - "## Example\n\n```bash\nsos run pipeline/twas_weights.ipynb twas_weights \\\n --cwd output \\\n --study ROSMAP_DLPFC \\\n --qtl-dataset output/ROSMAP_DLPFC.qtl_dataset.rds \\\n --genes ENSG00000060237 ENSG00000234593 \\\n --cis-window 1000000\n```\n" + "## Example\n\n```bash\nsos run pipeline/twas_weights.ipynb twas_weights \\\n --cwd output \\\n --modular-script-dir /path/to/xqtl-protocol/code/script \\\n --study TEST_STUDY \\\n --qtl-dataset output/TEST_STUDY.qtl_dataset.rds \\\n --genes ENSG00000060237 ENSG00000234593\n```\n" ] }, { @@ -26,7 +26,7 @@ }, "outputs": [], "source": [ - "[global]\nparameter: cwd = path('output')\nparameter: study = str\nparameter: qtl_dataset = path\nparameter: genes = []\nparameter: cis_window = 1000000\nparameter: fine_mapping_result = path('.')\nparameter: container = ''\nparameter: job_size = 1\nparameter: walltime = '1h'\nparameter: mem = '16G'\nparameter: numThreads = 1\n" + "[global]\nparameter: cwd = path('output')\nparameter: study = str\nparameter: qtl_dataset = path\nparameter: modular_script_dir = path('code/script')\n# Mutually exclusive fan-out sources: pass one as a list, leave the other empty.\nparameter: genes = []\nparameter: regions = []\nparameter: cis_window = 1000000\nparameter: fine_mapping_result = path('.')\nparameter: container = ''\nparameter: job_size = 1\nparameter: walltime = '30m'\nparameter: mem = '8G'\nparameter: numThreads = 1\n" ] }, { @@ -37,7 +37,7 @@ }, "outputs": [], "source": [ - "[twas_weights]\ninput: qtl_dataset, for_each = 'genes'\noutput: f\"{cwd}/{study}.{_genes}.twas_weights.rds\"\ntask: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\nbash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n Rscript ${path(\"code/script/pecotmr_integration/twas_weights.R\", absolute=True)} \\\n --qtl-dataset ${_input} \\\n --gene-id ${_genes} \\\n --cis-window ${cis_window} \\\n --fine-mapping-result ${fine_mapping_result if fine_mapping_result.is_file() else '\"\"'} \\\n --output ${_output}\n" + "[twas_weights]\nif bool(genes) == bool(regions):\n raise ValueError(\"Specify exactly one of --genes (trait IDs) or --regions (chr:start-end strings).\")\nfanout_items = genes if genes else regions\nfanout_kind = 'gene' if genes else 'region'\ndef _fanout_safe(s):\n return s.replace(':', '_').replace('-', '_')\ninput: qtl_dataset, for_each = 'fanout_items'\noutput: f\"{cwd}/{study}.{_fanout_safe(_fanout_items)}.twas_weights.rds\"\ntask: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\nbash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n Rscript ${modular_script_dir}/pecotmr_integration/twas_weights.R \\\n --qtl-dataset ${_input} \\\n ${('--gene-id ' + _fanout_items + ' --cis-window ' + str(cis_window)) if fanout_kind == 'gene' else ('--region ' + _fanout_items)} \\\n --fine-mapping-result ${fine_mapping_result if fine_mapping_result.is_file() else '\"\"'} \\\n --output ${_output}\n" ] } ], diff --git a/code/script/pecotmr_integration/coloc.R b/code/script/pecotmr_integration/coloc.R new file mode 100644 index 00000000..55aebc7c --- /dev/null +++ b/code/script/pecotmr_integration/coloc.R @@ -0,0 +1,96 @@ +#!/usr/bin/env Rscript +# coloc.R +# +# Colocalization worker. Loads an S4 `QtlFineMappingResult` and either +# an S4 `GwasFineMappingResult` or a `GwasSumStats` (colocPipeline does +# inline fine-mapping when given sumstats), and calls +# `pecotmr::colocPipeline()`. When `--enrichment` points at the output +# of `qtl_enrichment.R`, the per-pair `p12` prior is scaled by +# `(1 + enrichment)` (capped at `--p12-max`), the "enloc" variant of +# coloc. +# +# Inputs: +# --qtl-fine-mapping Path to S4 QtlFineMappingResult RDS +# --gwas-input Path to S4 GwasFineMappingResult RDS, or +# S4 GwasSumStats RDS (colocPipeline dispatches) +# --enrichment Optional enrichment data.frame RDS (output of +# qtl_enrichment.R) +# --filter-lbf-cs Flag: keep only effects that produced a CS +# --filter-lbf-cs-secondary Secondary coverage for CS-concentration filter (default NULL) +# --p1 Per-variant QTL prior (default 1e-4) +# --p2 Per-variant GWAS prior (default 1e-4) +# --p12 Per-variant shared prior (default 5e-6) +# --p12-max Cap on enrichment-adjusted p12 (default 1e-3) +# --no-adjust-pips Disable adjustPips (default: PIPs renormalized to overlap) +# --finemapping-methods Comma-separated list for inline GWAS FM when --gwas-input is GwasSumStats (default 'susie') +# --output Output RDS path (data.frame of coloc results) + +suppressPackageStartupMessages({ + library(argparser) + library(pecotmr) + library(coloc) +}) + +parser <- arg_parser("Colocalization via colocPipeline()") +parser <- add_argument(parser, "--qtl-fine-mapping", + help = "Path to S4 QtlFineMappingResult RDS", + type = "character") +parser <- add_argument(parser, "--gwas-input", + help = "Path to S4 GwasFineMappingResult OR GwasSumStats RDS", + type = "character") +parser <- add_argument(parser, "--enrichment", + help = "Optional enrichment data.frame RDS (from qtl_enrichment.R)", + type = "character", default = "") +parser <- add_argument(parser, "--filter-lbf-cs", + help = "Keep only effects that produced a CS", + flag = TRUE) +parser <- add_argument(parser, "--filter-lbf-cs-secondary", + help = "Secondary coverage for CS-concentration filter (default NA = off)", + type = "numeric", default = NA) +parser <- add_argument(parser, "--p1", + help = "Per-variant QTL prior", + type = "numeric", default = 1e-4) +parser <- add_argument(parser, "--p2", + help = "Per-variant GWAS prior", + type = "numeric", default = 1e-4) +parser <- add_argument(parser, "--p12", + help = "Per-variant shared prior", + type = "numeric", default = 5e-6) +parser <- add_argument(parser, "--p12-max", + help = "Cap on enrichment-adjusted p12", + type = "numeric", default = 1e-3) +parser <- add_argument(parser, "--no-adjust-pips", + help = "Disable adjustPips (use FMRs as supplied)", + flag = TRUE) +parser <- add_argument(parser, "--finemapping-methods", + help = "Comma-separated methods for inline GWAS FM (when --gwas-input is GwasSumStats)", + type = "character", default = "susie") +parser <- add_argument(parser, "--output", + help = "Output RDS path", type = "character") +argv <- parse_args(parser) + +qtlFmr <- readRDS(argv$qtl_fine_mapping) +gwasIn <- readRDS(argv$gwas_input) +enrich <- if (nzchar(argv$enrichment) && argv$enrichment != "." && + file.exists(argv$enrichment)) readRDS(argv$enrichment) else NULL + +methods <- trimws(strsplit(argv$finemapping_methods, ",", fixed = TRUE)[[1L]]) + +res <- colocPipeline( + qtlFineMappingResult = qtlFmr, + gwasInput = gwasIn, + filterLbfCs = argv$filter_lbf_cs, + filterLbfCsSecondary = if (is.na(argv$filter_lbf_cs_secondary)) NULL + else argv$filter_lbf_cs_secondary, + p1 = argv$p1, + p2 = argv$p2, + p12 = argv$p12, + finemappingMethods = methods, + enrichment = enrich, + p12Max = argv$p12_max, + adjustPips = !argv$no_adjust_pips) + +dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE) +saveRDS(res, argv$output) +cat(sprintf("Wrote colocPipeline result (%d row(s)) to %s\n", + nrow(res), argv$output)) diff --git a/code/script/pecotmr_integration/colocboost.R b/code/script/pecotmr_integration/colocboost.R index 2508cb57..ba40c6da 100644 --- a/code/script/pecotmr_integration/colocboost.R +++ b/code/script/pecotmr_integration/colocboost.R @@ -1,48 +1,136 @@ #!/usr/bin/env Rscript # colocboost.R # -# Per-gene colocboost worker. Loads a pre-built pecotmr::QtlDataset RDS -# and calls colocboostPipeline() for a single focal trait, colocalizing -# its signal across the QtlDataset's contexts. Designed to be invoked -# once per gene by the SoS fan-out in colocboost.ipynb. +# Per-region colocboost worker. Loads a pre-built pecotmr::QtlDataset RDS +# (and optionally a QC'd GwasSumStats RDS) and calls colocboostPipeline() +# for either a single focal trait (gene mode) or a single genomic region +# (region mode). Designed to be invoked once per fan-out unit by the SoS +# step in colocboost.ipynb. +# +# Modes (mutually exclusive): +# gene : --gene-id ENSG... --cis-window 1000000 +# traitId = gene-id, focalTrait = gene-id. +# region : --region chr22:15000000-16000000 +# Pipeline receives region = GRanges(...). No focal trait. +# +# Pipeline variants (any combination; at least one must be active): +# --xqtl-coloc ON by default; within-study cross-context QTL coloc. +# Pass --no-xqtl-coloc to disable. +# --joint-gwas Off by default; xQTL+GWAS joint coloc. Requires +# --gwas-sumstats. +# --separate-gwas Off by default; one focal-GWAS coloc per GWAS study. +# Requires --gwas-sumstats. # # Inputs: # --qtl-dataset Path to a QtlDataset RDS -# --gene-id Focal trait identifier (and the only traitId passed -# to the pipeline; colocboost colocalizes the focal -# trait's signal across the QtlDataset's contexts) +# --gene-id (gene mode) focal trait identifier # --cis-window cis-window in bp around the trait's TSS +# --region (region mode) chr:start-end string +# --gwas-sumstats Optional path to a QC'd GwasSumStats RDS +# --no-xqtl-coloc Flag: disable xQTL-only cross-context coloc +# --joint-gwas Flag: run xQTL+GWAS joint coloc +# --separate-gwas Flag: run per-GWAS-study separate-focal coloc # --output Output RDS path (one colocboost-pipeline list) suppressPackageStartupMessages({ library(argparser) library(pecotmr) + library(GenomicRanges) + library(IRanges) }) -parser <- arg_parser("Per-gene cross-context colocboost over a pre-built QtlDataset") +parser <- arg_parser("Per-gene or per-region colocboost over a pre-built QtlDataset") parser <- add_argument(parser, "--qtl-dataset", help = "Path to a QtlDataset RDS", type = "character") parser <- add_argument(parser, "--gene-id", - help = "Focal trait identifier", - type = "character") + help = "Focal trait identifier (gene mode); mutually exclusive with --region", + type = "character", default = "") parser <- add_argument(parser, "--cis-window", - help = "cis-window in bp around the trait's TSS", + help = "cis-window in bp around the focal trait's TSS", type = "integer", default = 1000000L) +parser <- add_argument(parser, "--region", + help = "Genomic region as chr:start-end (region mode); mutually exclusive with --gene-id", + type = "character", default = "") +parser <- add_argument(parser, "--gwas-sumstats", + help = "Path to a QC'd GwasSumStats RDS (required when --joint-gwas or --separate-gwas is set)", + type = "character", default = "") +parser <- add_argument(parser, "--no-xqtl-coloc", + help = "Disable the within-study cross-context xQTL coloc", + flag = TRUE) +parser <- add_argument(parser, "--joint-gwas", + help = "Run xQTL+GWAS joint colocboost (requires --gwas-sumstats)", + flag = TRUE) +parser <- add_argument(parser, "--separate-gwas", + help = "Run per-GWAS-study separate-focal coloc (requires --gwas-sumstats)", + flag = TRUE) parser <- add_argument(parser, "--output", help = "Output RDS path", type = "character") argv <- parse_args(parser) -qd <- readRDS(argv$qtl_dataset) +parse_region <- function(s) { + m <- regmatches(s, regexec("^([^:]+):([0-9]+)-([0-9]+)$", s))[[1L]] + if (length(m) != 4L) + stop("--region must be in chr:start-end format (got: ", s, ")") + GRanges(seqnames = m[[2L]], + ranges = IRanges(start = as.integer(m[[3L]]), + end = as.integer(m[[4L]]))) +} + +# Mode validation +has_gene <- nzchar(argv$gene_id) +has_region <- nzchar(argv$region) +if (has_gene && has_region) + stop("--gene-id and --region are mutually exclusive; pass exactly one.") +if (!has_gene && !has_region) + stop("Specify either --gene-id (with --cis-window) or --region.") + +xqtl_coloc <- !argv$no_xqtl_coloc +joint_gwas <- argv$joint_gwas +separate_gwas <- argv$separate_gwas -res <- colocboostPipeline( - qd, - traitId = argv$gene_id, - cisWindow = argv$cis_window, - focalTrait = argv$gene_id, - xqtlColoc = TRUE) +if (!xqtl_coloc && !joint_gwas && !separate_gwas) + stop("All colocboost variants disabled. Enable at least one of xQTL coloc, --joint-gwas, --separate-gwas.") + +gwas_path <- argv$gwas_sumstats +has_gwas <- nzchar(gwas_path) && gwas_path != "." && file.exists(gwas_path) +if ((joint_gwas || separate_gwas) && !has_gwas) + stop("--joint-gwas / --separate-gwas require --gwas-sumstats pointing at a QC'd GwasSumStats RDS.") + +qd <- readRDS(argv$qtl_dataset) +gss <- if (has_gwas) readRDS(gwas_path) else NULL + +res <- if (has_region) { + colocboostPipeline( + qd, + gwasSumStats = gss, + region = parse_region(argv$region), + cisWindow = argv$cis_window, + xqtlColoc = xqtl_coloc, + jointGwas = joint_gwas, + separateGwas = separate_gwas) +} else { + colocboostPipeline( + qd, + gwasSumStats = gss, + traitId = argv$gene_id, + cisWindow = argv$cis_window, + focalTrait = argv$gene_id, + xqtlColoc = xqtl_coloc, + jointGwas = joint_gwas, + separateGwas = separate_gwas) +} dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE) saveRDS(res, argv$output) -cat(sprintf("Wrote colocboost result for gene '%s' to %s\n", - argv$gene_id, argv$output)) + +# Brief per-variant summary +variants <- c( + if (xqtl_coloc) "xqtl-coloc" else NULL, + if (joint_gwas) "joint-gwas" else NULL, + if (separate_gwas) "separate-gwas" else NULL) +cat(sprintf("Wrote colocboost result for %s [variants: %s] to %s\n", + if (has_region) paste0("region '", argv$region, "'") + else paste0("gene '", argv$gene_id, "'"), + paste(variants, collapse = ", "), + argv$output)) diff --git a/code/script/pecotmr_integration/ctwas_assemble.R b/code/script/pecotmr_integration/ctwas_assemble.R new file mode 100644 index 00000000..747238b1 --- /dev/null +++ b/code/script/pecotmr_integration/ctwas_assemble.R @@ -0,0 +1,130 @@ +#!/usr/bin/env Rscript +# ctwas_assemble.R +# +# cTWAS step 1: per-region manifest → assembled cTWAS inputs. +# Reads the per-region manifest, loads each block's GwasSumStats / +# TwasWeights RDS, and calls pecotmr::assembleCtwasInputs() to build +# the named list of ctwas-shape inputs (z_snp, weights, region_info, +# snp_map, LD_map, LD/snpInfo loader closures). The result is saved +# to a single RDS that ctwas_est.R consumes downstream. +# +# Inputs: +# --manifest Per-region manifest TSV with columns: +# region_id (string, unique) +# gwas_sumstats_rds (per-block GwasSumStats RDS) +# twas_weights_rds (comma-sep per-gene TwasWeights RDS; may be empty) +# fine_mapping_result_rds (optional, comma-sep) +# --method Which TWAS method to feed into ctwas +# (default: NULL — resolves to 'ensemble' if +# present, or sole method, or errors) +# --twas-z Optional TWAS-Z GRanges RDS +# --twas-weight-cutoff Pass-through (default 0) +# --cs-min-cor Pass-through (default 0.8) +# --min-pip-cutoff Pass-through (default 0) +# --max-num-variants Pass-through (default Inf) +# --output Output RDS path (the assembled `inputs` list) + +suppressPackageStartupMessages({ + library(argparser) + library(pecotmr) + library(S4Vectors) +}) + +parser <- arg_parser("cTWAS step 1: assemble inputs from per-region manifest") +parser <- add_argument(parser, "--manifest", + help = "Per-region manifest TSV (region_id, gwas_sumstats_rds, twas_weights_rds[, fine_mapping_result_rds])", + type = "character") +parser <- add_argument(parser, "--method", + help = "Which TWAS method to feed into ctwas (defaults to 'ensemble' if present, or sole method)", + type = "character", default = "") +parser <- add_argument(parser, "--twas-z", + help = "Optional TWAS-Z GRanges RDS", + type = "character", default = "") +parser <- add_argument(parser, "--twas-weight-cutoff", + help = "Drop weight variants with |w| below this", + type = "numeric", default = 0) +parser <- add_argument(parser, "--cs-min-cor", + help = "CS purity floor for must-keep rescue", + type = "numeric", default = 0.8) +parser <- add_argument(parser, "--min-pip-cutoff", + help = "PIP rescue threshold", + type = "numeric", default = 0) +parser <- add_argument(parser, "--max-num-variants", + help = "Per-gene variant cap", + type = "numeric", default = Inf) +parser <- add_argument(parser, "--output", + help = "Output RDS path (assembled cTWAS inputs)", + type = "character") +argv <- parse_args(parser) + +split_paths <- function(s) { + if (is.na(s) || !nzchar(s)) return(character(0)) + trimws(strsplit(s, ",", fixed = TRUE)[[1L]]) +} + +manifest <- read.table(argv$manifest, header = TRUE, sep = "\t", + stringsAsFactors = FALSE, check.names = FALSE, + comment.char = "") +required <- c("region_id", "gwas_sumstats_rds", "twas_weights_rds") +missing <- setdiff(required, names(manifest)) +if (length(missing) > 0L) + stop("Manifest missing required column(s): ", + paste(missing, collapse = ", ")) +if (anyDuplicated(manifest$region_id)) + stop("Manifest has duplicate region_id values.") +if (nrow(manifest) < 2L) + stop("Manifest must list at least two LD blocks. cTWAS's EM cannot ", + "converge on a single region.") + +gwasSumStatsByRegion <- list() +twasWeightsByRegion <- list() +fmrByRegion <- list() +for (i in seq_len(nrow(manifest))) { + rid <- manifest$region_id[[i]] + gwasSumStatsByRegion[[rid]] <- readRDS(manifest$gwas_sumstats_rds[[i]]) + tw_paths <- split_paths(manifest$twas_weights_rds[[i]]) + if (length(tw_paths) > 0L) { + tw_list <- lapply(tw_paths, readRDS) + twasWeightsByRegion[[rid]] <- if (length(tw_list) == 1L) tw_list[[1L]] + else Reduce(function(a, b) + pecotmr:::.rbindTwasWeights(a, b), + tw_list) + } + if ("fine_mapping_result_rds" %in% names(manifest)) { + fmr_paths <- split_paths(manifest$fine_mapping_result_rds[[i]]) + if (length(fmr_paths) > 0L) { + fmr_list <- lapply(fmr_paths, readRDS) + fmrByRegion[[rid]] <- if (length(fmr_list) == 1L) fmr_list[[1L]] + else Reduce(function(a, b) + pecotmr:::.rbindFineMappingResult(a, b), + fmr_list) + } + } +} +if (length(twasWeightsByRegion) == 0L) + stop("Manifest yielded zero TwasWeights across all blocks.") + +# Optional precomputed TWAS-Z and fineMappingResult (single-object). +tz <- if (nzchar(argv$twas_z) && argv$twas_z != "." && file.exists(argv$twas_z)) + readRDS(argv$twas_z) else NULL +fmr <- if (length(fmrByRegion) > 0L) { + if (length(fmrByRegion) == 1L) fmrByRegion[[1L]] + else Reduce(function(a, b) pecotmr:::.rbindFineMappingResult(a, b), + unname(fmrByRegion)) +} else NULL + +inputs <- assembleCtwasInputs( + gwasSumStats = gwasSumStatsByRegion, + twasWeights = twasWeightsByRegion, + twasZ = tz, + fineMappingResult = fmr, + method = if (nzchar(argv$method)) argv$method else NULL, + twasWeightCutoff = argv$twas_weight_cutoff, + csMinCor = argv$cs_min_cor, + minPipCutoff = argv$min_pip_cutoff, + maxNumVariants = argv$max_num_variants) + +dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE) +saveRDS(inputs, argv$output) +cat(sprintf("Wrote assembled cTWAS inputs (%d regions, %d weights) to %s\n", + nrow(inputs$region_info), length(inputs$weights), argv$output)) diff --git a/code/script/pecotmr_integration/ctwas_est.R b/code/script/pecotmr_integration/ctwas_est.R new file mode 100644 index 00000000..2f21218e --- /dev/null +++ b/code/script/pecotmr_integration/ctwas_est.R @@ -0,0 +1,78 @@ +#!/usr/bin/env Rscript +# ctwas_est.R +# +# cTWAS step 2: estimate group prior + prior variance. +# Reads the assembled inputs from ctwas_assemble.R and calls +# pecotmr::estCtwasParam() (which wraps ctwas::assemble_region_data + +# ctwas::est_param). Saves the augmented state — `inputs` + `region_data` +# + `boundary_genes` + `z_gene` + `param` — for ctwas_finemap.R to +# consume downstream. +# +# Inputs: +# --inputs RDS produced by ctwas_assemble.R +# --thin Pass-through (default 0.1) +# --niter-prefit Pass-through (default 3) +# --niter Pass-through (default 30) +# --group-prior-var-structure Pass-through (default 'shared_type') +# --min-group-size Pass-through (default 100) +# --min-p-single-effect Pass-through (default 0.8) +# --fallback-to-prefit Flag: on accurate-EM NaN, recover via prefit-only +# --ncore Pass-through (default 1) +# --output Output RDS path (the est state list) + +suppressPackageStartupMessages({ + library(argparser) + library(pecotmr) +}) + +parser <- arg_parser("cTWAS step 2: estimate group prior + prior variance") +parser <- add_argument(parser, "--inputs", + help = "RDS produced by ctwas_assemble.R", + type = "character") +parser <- add_argument(parser, "--thin", + help = "Thin", type = "numeric", default = 0.1) +parser <- add_argument(parser, "--niter-prefit", + help = "Prefit EM iterations", + type = "integer", default = 3L) +parser <- add_argument(parser, "--niter", + help = "Accurate EM iterations", + type = "integer", default = 30L) +parser <- add_argument(parser, "--group-prior-var-structure", + help = "Pass-through (default 'shared_type')", + type = "character", default = "shared_type") +parser <- add_argument(parser, "--min-group-size", + help = "Minimum genes per ctwas group", + type = "integer", default = 100L) +parser <- add_argument(parser, "--min-p-single-effect", + help = "Minimum p(single effect) for accurate EM region retention", + type = "numeric", default = 0.8) +parser <- add_argument(parser, "--fallback-to-prefit", + help = "On accurate-EM NaN, fall back to prefit estimates", + flag = TRUE) +parser <- add_argument(parser, "--ncore", + help = "Number of cores", + type = "integer", default = 1L) +parser <- add_argument(parser, "--output", + help = "Output RDS path (est state)", + type = "character") +argv <- parse_args(parser) + +inputs <- readRDS(argv$inputs) +est <- estCtwasParam( + inputs, + thin = argv$thin, + niterPrefit = argv$niter_prefit, + niter = argv$niter, + groupPriorVarStructure = argv$group_prior_var_structure, + ncore = argv$ncore, + fallbackToPrefit = argv$fallback_to_prefit, + min_group_size = argv$min_group_size, + min_p_single_effect = argv$min_p_single_effect) + +dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE) +saveRDS(est, argv$output) +cat(sprintf("Wrote estCtwasParam result to %s\n", argv$output)) +cat("group_prior: ", + paste(format(est$param$group_prior, digits = 4), collapse = " | "), "\n") +cat("group_prior_var:", + paste(format(est$param$group_prior_var, digits = 4), collapse = " | "), "\n") diff --git a/code/script/pecotmr_integration/ctwas_finemap.R b/code/script/pecotmr_integration/ctwas_finemap.R new file mode 100644 index 00000000..f3779c4f --- /dev/null +++ b/code/script/pecotmr_integration/ctwas_finemap.R @@ -0,0 +1,87 @@ +#!/usr/bin/env Rscript +# ctwas_finemap.R +# +# cTWAS step 3: screen regions + fine-map. +# Reads the est state from ctwas_est.R and calls +# pecotmr::screenCtwasRegions() then pecotmr::finemapCtwasRegions(). +# Optionally accepts a user-supplied param override RDS so the caller +# can swap in hand-tuned priors when the accurate EM diverged and the +# defaults aren't usable. Saves the final ctwas_sumstats-shape result. +# +# Inputs: +# --est RDS produced by ctwas_est.R +# --param-override Optional RDS with $group_prior and +# $group_prior_var to substitute before +# screen/finemap. Mirrors the legacy ctwas_3 +# escape hatch for NaN-on-iter-2 EM divergence. +# --L Pass-through (default 5) +# --no-filter-L Flag: disable ctwas's internal L >= 1 region screen +# --min-nonsnp-pip Pass-through (default 0.5) +# --ncore Pass-through (default 1) +# --output Output RDS path (ctwas_sumstats-shape result) + +suppressPackageStartupMessages({ + library(argparser) + library(pecotmr) +}) + +parser <- arg_parser("cTWAS step 3: screen regions + fine-map") +parser <- add_argument(parser, "--est", + help = "RDS produced by ctwas_est.R", + type = "character") +parser <- add_argument(parser, "--param-override", + help = "Optional param-override RDS (overrides $param before screen/finemap)", + type = "character", default = "") +parser <- add_argument(parser, "--L", + help = "Max number of credible sets per region", + type = "integer", default = 5L) +parser <- add_argument(parser, "--no-filter-L", + help = "Disable ctwas's internal L >= 1 region screen (toy data)", + flag = TRUE) +parser <- add_argument(parser, "--min-nonsnp-pip", + help = "min_nonSNP_PIP threshold for screen_regions", + type = "numeric", default = 0.5) +parser <- add_argument(parser, "--ncore", + help = "Number of cores", + type = "integer", default = 1L) +parser <- add_argument(parser, "--output", + help = "Output RDS path (ctwas_sumstats-shape result)", + type = "character") +argv <- parse_args(parser) + +est <- readRDS(argv$est) + +# Optional caller-supplied param override (NaN escape hatch). +if (nzchar(argv$param_override) && argv$param_override != "." && + file.exists(argv$param_override)) { + override <- readRDS(argv$param_override) + if (!is.null(override$group_prior)) + est$param$group_prior <- override$group_prior + if (!is.null(override$group_prior_var)) + est$param$group_prior_var <- override$group_prior_var + cat("Applied param override from ", argv$param_override, "\n", sep = "") +} + +screened <- screenCtwasRegions( + est, + L = argv$L, + ncore = argv$ncore, + filter_L = !argv$no_filter_L, + min_nonSNP_PIP = argv$min_nonsnp_pip) + +final <- finemapCtwasRegions( + screened, + L = argv$L, + ncore = argv$ncore) + +dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE) +saveRDS(final, argv$output) +cat(sprintf("Wrote finemapCtwasRegions result to %s\n", argv$output)) +fm <- final$finemap_res +if (!is.null(fm) && nrow(fm) > 0L) { + g <- fm[fm$type != "SNP", , drop = FALSE] + cat(sprintf(" finemap_res rows: %d (%d gene-level, %d SNP-level)\n", + nrow(fm), nrow(g), nrow(fm) - nrow(g))) +} else { + cat(" finemap_res: NULL (no regions surviving filter_L >= 1)\n") +} diff --git a/code/script/pecotmr_integration/fine_mapping.R b/code/script/pecotmr_integration/fine_mapping.R index 17c0c5f1..a9e81d0a 100644 --- a/code/script/pecotmr_integration/fine_mapping.R +++ b/code/script/pecotmr_integration/fine_mapping.R @@ -1,32 +1,55 @@ #!/usr/bin/env Rscript # fine_mapping.R # -# Per-gene fine-mapping worker. Loads a pre-built pecotmr::QtlDataset RDS -# and calls fineMappingPipeline() for a single trait. Designed to be -# invoked once per gene by the SoS fan-out in fine_mapping.ipynb. +# SuSiE fine-mapping worker for either QTL (per-gene / per-region over a +# pre-built QtlDataset) or GWAS (per-block over a GwasSumStats RDS). +# Designed to be invoked once per fan-out unit by the SoS step in +# fine_mapping.ipynb. `pecotmr::fineMappingPipeline` dispatches on the +# input class. # -# Inputs: -# --qtl-dataset Path to a QtlDataset RDS (output of qtl_dataset_construct.R) -# --gene-id Single trait identifier to fine-map -# --cis-window cis-window in bp around the trait's TSS -# --coverage SuSiE credible-set coverage (default 0.95) -# --output Output RDS path (one QtlFineMappingResult) +# Input modes (exactly one of --qtl-dataset / --gwas-sumstats): +# +# QTL — fan-out per gene or per region inside a single QtlDataset: +# --qtl-dataset pecotmr::QtlDataset (from qtl_dataset.ipynb) +# --gene-id ENSG... --cis-window 1000000 (gene mode) +# --region chr22:15000000-16000000 (region mode) +# +# GWAS — one call per per-block GwasSumStats RDS (each carrying its own +# z-scores + LD sketch; no gene/region concept): +# --gwas-sumstats pecotmr::GwasSumStats (per LD block, +# typically from gwas_sumstats_construct.R) +# +# Shared: +# --methods Comma-separated method tokens. Default "susie". +# --coverage SuSiE credible-set coverage. Default 0.95. +# --output Output RDS path. suppressPackageStartupMessages({ library(argparser) library(pecotmr) + library(GenomicRanges) + library(IRanges) }) -parser <- arg_parser("Per-gene SuSiE fine-mapping over a pre-built QtlDataset") +parser <- arg_parser("SuSiE fine-mapping over a pecotmr S4 input (QtlDataset or GwasSumStats)") parser <- add_argument(parser, "--qtl-dataset", - help = "Path to a QtlDataset RDS", - type = "character") + help = "Path to a QtlDataset RDS (QTL mode)", + type = "character", default = "") +parser <- add_argument(parser, "--gwas-sumstats", + help = "Path to a GwasSumStats RDS (GWAS mode)", + type = "character", default = "") parser <- add_argument(parser, "--gene-id", - help = "Single trait identifier to fine-map", - type = "character") + help = "Trait identifier (QTL gene mode); mutually exclusive with --region", + type = "character", default = "") parser <- add_argument(parser, "--cis-window", - help = "cis-window in bp around the trait's TSS", + help = "cis-window in bp around the trait's TSS (QTL gene mode)", type = "integer", default = 1000000L) +parser <- add_argument(parser, "--region", + help = "Genomic region as chr:start-end (QTL region mode)", + type = "character", default = "") +parser <- add_argument(parser, "--methods", + help = "Comma-separated fine-mapping method tokens", + type = "character", default = "susie") parser <- add_argument(parser, "--coverage", help = "SuSiE credible-set coverage", type = "numeric", default = 0.95) @@ -34,16 +57,61 @@ parser <- add_argument(parser, "--output", help = "Output RDS path", type = "character") argv <- parse_args(parser) -qd <- readRDS(argv$qtl_dataset) +parse_region <- function(s) { + m <- regmatches(s, regexec("^([^:]+):([0-9]+)-([0-9]+)$", s))[[1L]] + if (length(m) != 4L) + stop("--region must be in chr:start-end format (got: ", s, ")") + GRanges(seqnames = m[[2L]], + ranges = IRanges(start = as.integer(m[[3L]]), + end = as.integer(m[[4L]]))) +} + +has_qtl <- nzchar(argv$qtl_dataset) +has_gwas <- nzchar(argv$gwas_sumstats) +if (has_qtl && has_gwas) + stop("--qtl-dataset and --gwas-sumstats are mutually exclusive; pass exactly one.") +if (!has_qtl && !has_gwas) + stop("Specify either --qtl-dataset (QTL mode) or --gwas-sumstats (GWAS mode).") + +methods <- trimws(strsplit(argv$methods, ",", fixed = TRUE)[[1L]]) -res <- fineMappingPipeline( - qd, - methods = "susie", - traitId = argv$gene_id, - cisWindow = argv$cis_window, - coverage = argv$coverage) +if (has_gwas) { + # ----- GWAS mode ------------------------------------------------------- + gss <- readRDS(argv$gwas_sumstats) + res <- fineMappingPipeline( + gss, + methods = methods, + coverage = argv$coverage) + label <- paste0("GwasSumStats '", basename(argv$gwas_sumstats), "'") +} else { + # ----- QTL mode -------------------------------------------------------- + has_gene <- nzchar(argv$gene_id) + has_region <- nzchar(argv$region) + if (has_gene && has_region) + stop("--gene-id and --region are mutually exclusive (QTL mode); pass exactly one.") + if (!has_gene && !has_region) + stop("QTL mode requires --gene-id (with --cis-window) or --region.") + qd <- readRDS(argv$qtl_dataset) + res <- if (has_region) { + fineMappingPipeline( + qd, + methods = methods, + region = parse_region(argv$region), + cisWindow = argv$cis_window, + coverage = argv$coverage) + } else { + fineMappingPipeline( + qd, + methods = methods, + traitId = argv$gene_id, + cisWindow = argv$cis_window, + coverage = argv$coverage) + } + label <- if (has_region) paste0("region '", argv$region, "'") + else paste0("gene '", argv$gene_id, "'") +} dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE) saveRDS(res, argv$output) -cat(sprintf("Wrote fineMapping result for gene '%s' (%d row(s)) to %s\n", - argv$gene_id, nrow(res), argv$output)) +cat(sprintf("Wrote fineMapping result for %s (%d row(s)) to %s\n", + label, nrow(res), argv$output)) diff --git a/code/script/pecotmr_integration/gwas_sumstats_construct.R b/code/script/pecotmr_integration/gwas_sumstats_construct.R new file mode 100644 index 00000000..04f505c7 --- /dev/null +++ b/code/script/pecotmr_integration/gwas_sumstats_construct.R @@ -0,0 +1,189 @@ +#!/usr/bin/env Rscript +# gwas_sumstats_construct.R +# +# Build one pecotmr::GwasSumStats over a single LD block from a GWAS +# summary-statistics TSV and an LD-reference meta file, run +# summaryStatsQc(), and serialize to RDS. The resulting RDS is the +# per-(study, block) input to twasWeightsPipeline downstream consumers, +# causalInferencePipeline, and ctwasPipeline. +# +# Inputs: +# --study Single study identifier +# --gwas-tsv Path to a tabix-indexed (or plain) GWAS TSV +# Standard columns (with aliases): +# #chrom|chrom|chr, pos, variant_id|SNP, +# A1, A2, z|Z, n_sample|N +# Optional: effect_allele_frequency, p, beta, se +# --ld-block Genomic interval as chr:start-end (one LD block) +# --ld-meta LD-meta TSV (#chr, start, end, path) +# --genome Genome build label (e.g. "GRCh38"). Default "GRCh38" +# --output Output RDS path + +suppressPackageStartupMessages({ + library(argparser) + library(pecotmr) + library(GenomicRanges) + library(IRanges) + library(S4Vectors) +}) + +parser <- arg_parser("Build a per-LD-block GwasSumStats RDS") +parser <- add_argument(parser, "--study", + help = "Single study identifier", + type = "character") +parser <- add_argument(parser, "--gwas-tsv", + help = "Path to GWAS summary-statistics TSV", + type = "character") +parser <- add_argument(parser, "--ld-block", + help = "LD block as chr:start-end", + type = "character") +parser <- add_argument(parser, "--ld-meta", + help = "Path to LD-meta TSV", + type = "character") +parser <- add_argument(parser, "--genome", + help = "Genome build label", + type = "character", default = "GRCh38") +parser <- add_argument(parser, "--skip-qc", + help = "Skip summaryStatsQc() (debug-only; the output GwasSumStats will have qcInfo = list() and fineMappingPipeline / twasWeightsPipeline will reject it)", + flag = TRUE) +parser <- add_argument(parser, "--output", + help = "Output RDS path", type = "character") +argv <- parse_args(parser) + +parse_region <- function(s) { + m <- regmatches(s, regexec("^([^:]+):([0-9]+)-([0-9]+)$", s))[[1L]] + if (length(m) != 4L) + stop("--ld-block must be in chr:start-end format (got: ", s, ")") + list(chr = m[[2L]], start = as.integer(m[[3L]]), end = as.integer(m[[4L]])) +} + +block <- parse_region(argv$ld_block) + +# ----- Read GWAS TSV and subset to the LD block ----------------------------- +gwas <- read.table(if (grepl("\\.gz$", argv$gwas_tsv)) gzfile(argv$gwas_tsv) + else argv$gwas_tsv, + header = TRUE, sep = "\t", + stringsAsFactors = FALSE, check.names = FALSE, + comment.char = "") + +pick <- function(opts, where) intersect(opts, names(where))[1L] +chr_col <- pick(c("#chrom", "chrom", "chr"), gwas) +pos_col <- pick(c("pos", "position", "BP"), gwas) +snp_col <- pick(c("variant_id", "SNP", "rsid"), gwas) +a1_col <- pick(c("A1", "a1"), gwas) +a2_col <- pick(c("A2", "a2"), gwas) +z_col <- pick(c("z", "Z"), gwas) +n_col <- pick(c("n_sample", "N", "n"), gwas) + +if (any(is.na(c(chr_col, pos_col, snp_col, a1_col, a2_col, z_col, n_col)))) + stop("--gwas-tsv missing one of required columns ", + "(chrom/pos/variant_id/A1/A2/z/n_sample) in: ", argv$gwas_tsv) + +# Normalise chromosome label to match the LD block's +chrom_vals <- as.character(gwas[[chr_col]]) +if (!startsWith(chrom_vals[[1L]], "chr")) + chrom_vals <- paste0("chr", chrom_vals) + +pos_vals <- as.integer(gwas[[pos_col]]) +keep <- chrom_vals == block$chr & pos_vals >= block$start & pos_vals <= block$end +sub <- gwas[keep, , drop = FALSE] +if (nrow(sub) == 0L) + stop("No GWAS variants fall in LD block ", argv$ld_block, + " (after chromosome normalisation).") + +# ----- Build the LD-reference GenotypeHandle for the block ------------------ +# Resolve the LD-meta row covering this block directly: pecotmr's +# GenotypeHandle(ldMeta=…, region=…) currently round-trips the meta file +# path as the data path (an upstream bug), so we read the meta TSV +# ourselves and call the appropriate GenotypeHandle constructor. +ld_meta_dir <- dirname(normalizePath(argv$ld_meta)) +ld_meta_df <- read.table(argv$ld_meta, header = TRUE, sep = "\t", + stringsAsFactors = FALSE, check.names = FALSE, + comment.char = "") +# Header may be "#chr", "#chrom", "chr", or "chrom". +ld_chr_col <- intersect(c("#chr", "#chrom", "chr", "chrom"), + names(ld_meta_df))[1L] +if (is.na(ld_chr_col)) + stop("Could not find a chromosome column in ", argv$ld_meta, + " (expected one of '#chr' / '#chrom' / 'chr' / 'chrom'); got: ", + paste(names(ld_meta_df), collapse = ", ")) +# Normalize both sides to the bare chromosome label (no "chr" prefix). +ld_chr_norm <- sub("^chr", "", + as.character(ld_meta_df[[ld_chr_col]]), + ignore.case = TRUE) +block_chr_norm <- sub("^chr", "", block$chr, ignore.case = TRUE) +# `start = 0, end = 0` is the meta convention for "whole chromosome" — any +# block on that chromosome is covered. +ld_start <- suppressWarnings(as.integer(ld_meta_df$start)) +ld_end <- suppressWarnings(as.integer(ld_meta_df$end)) +whole_chrom <- !is.na(ld_start) & !is.na(ld_end) & + ld_start == 0L & ld_end == 0L +covers <- which(ld_chr_norm == block_chr_norm & + (whole_chrom | + (ld_start <= block$start & ld_end >= block$end))) +if (length(covers) == 0L) + stop("No LD-meta row in ", argv$ld_meta, " fully covers ", argv$ld_block) +if (length(covers) > 1L) + stop("Multiple LD-meta rows cover ", argv$ld_block, + "; restrict to a single LD block.") +ld_prefix <- ld_meta_df$path[covers[[1L]]] +if (!startsWith(ld_prefix, "/")) + ld_prefix <- file.path(ld_meta_dir, ld_prefix) + +# Detect data format from companion file extensions +if (file.exists(paste0(ld_prefix, ".pgen"))) { + ld_handle <- GenotypeHandle(plink2Prefix = ld_prefix) +} else if (file.exists(paste0(ld_prefix, ".bed"))) { + ld_handle <- GenotypeHandle(plink1Prefix = ld_prefix) +} else if (file.exists(paste0(ld_prefix, ".gds"))) { + ld_handle <- GenotypeHandle(path = paste0(ld_prefix, ".gds")) +} else if (file.exists(paste0(ld_prefix, ".vcf.gz"))) { + ld_handle <- GenotypeHandle(path = paste0(ld_prefix, ".vcf.gz")) +} else { + stop("Could not find a recognised genotype payload at LD-meta prefix: ", ld_prefix, + " (looked for .pgen / .bed / .gds / .vcf.gz)") +} + +# ----- Build GwasSumStats entry GRanges ------------------------------------- +entry_gr <- GRanges( + seqnames = chrom_vals[keep], + ranges = IRanges(start = pos_vals[keep], width = 1L)) +mcols(entry_gr) <- DataFrame( + SNP = as.character(sub[[snp_col]]), + A1 = as.character(sub[[a1_col]]), + A2 = as.character(sub[[a2_col]]), + Z = as.numeric(sub[[z_col]]), + N = as.integer(sub[[n_col]])) +# Optional columns when present +opt_cols <- c(beta = "beta", se = "se", p = c("p", "pvalue"), + maf = c("effect_allele_frequency", "maf", "MAF"), + info = c("info", "INFO")) +for (slot in c("BETA", "SE", "P", "MAF", "INFO")) { + src <- intersect(c(slot, tolower(slot), + if (slot == "MAF") c("effect_allele_frequency"), + if (slot == "P") c("pvalue", "p")), + names(sub))[1L] + if (!is.na(src)) { + mcols(entry_gr)[[slot]] <- if (slot == "N") as.integer(sub[[src]]) + else as.numeric(sub[[src]]) + } +} + +# ----- Construct + QC + save ------------------------------------------------ +gss <- GwasSumStats( + study = argv$study, + entry = list(entry_gr), + genome = argv$genome, + ldSketch = ld_handle) +gss_out <- if (argv$skip_qc) { + message("--skip-qc set; serialising raw GwasSumStats without summaryStatsQc().") + gss +} else { + summaryStatsQc(gss) +} + +dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE) +saveRDS(gss_out, argv$output) +cat(sprintf("Wrote %sGwasSumStats for study '%s' over %s (%d variants in) to %s\n", + if (argv$skip_qc) "(skip-QC) " else "QC'd ", + argv$study, argv$ld_block, length(entry_gr), argv$output)) diff --git a/code/script/pecotmr_integration/legacy_ctwas_weights_to_s4.R b/code/script/pecotmr_integration/legacy_ctwas_weights_to_s4.R new file mode 100644 index 00000000..9394f93f --- /dev/null +++ b/code/script/pecotmr_integration/legacy_ctwas_weights_to_s4.R @@ -0,0 +1,157 @@ +#!/usr/bin/env Rscript +# legacy_ctwas_weights_to_s4.R +# +# One-shot extractor: legacy `*.ctwas_weights.*.rds` (the OUTPUT of the +# legacy [ctwas_1] step) into the new S4 `pecotmr::TwasWeights` shape. +# +# Why this exists: the legacy MWE ships +# `protocol_example.ctwas_weights.protocol_example_twas_chr22.chr22.rds` +# with strong weights (max|w| ≈ 1.35) that demonstrably drive ctwas to +# the documented gene-Z = 5.46 result, BUT those weights are not +# reproducible from the upstream `.reshaped_toy.rds` via the documented +# pipeline (the legacy code path produces ~1e-5 values, not ~1e-0). +# The legacy `ctwas_weights.rds` therefore appears to come from a +# stronger upstream fit that isn't shipped. To smoke-test our new +# `ctwasPipeline` end-to-end with default thresholds, we extract those +# already-on-correlation-scale weights into an S4 TwasWeights so the +# pipeline sees the same numerical input the legacy ctwas_3 step did. +# +# The extracted weights are STANDARDIZED (already on the correlation +# scale — the legacy multiplied by sqrt(variance) when building the +# file), so the resulting TwasWeightsEntry carries `standardized = TRUE` +# and our `.ctwasBuildWeights` will skip the sqrt(variance) scaling. +# +# Usage: +# Rscript legacy_ctwas_weights_to_s4.R \ +# --legacy \ +# --study \ +# --method (default "susie") +# --ld-meta --ld-block (or --ld-prefix) +# --output + +suppressPackageStartupMessages({ + library(argparser) + library(pecotmr) +}) + +parser <- arg_parser("Convert legacy ctwas_weights.rds to S4 TwasWeights") +parser <- add_argument(parser, "--legacy", + help = "Path to legacy *.ctwas_weights.*.rds", + type = "character") +parser <- add_argument(parser, "--study", + help = "Study label to stamp on every entry", + type = "character") +parser <- add_argument(parser, "--method", + help = "Method label to stamp on every entry (default 'susie')", + type = "character", default = "susie") +parser <- add_argument(parser, "--ld-meta", + help = "LD-meta TSV path (used with --ld-block)", + type = "character", default = "") +parser <- add_argument(parser, "--ld-block", + help = "LD block as chr:start-end (with --ld-meta)", + type = "character", default = "") +parser <- add_argument(parser, "--ld-prefix", + help = "Explicit LD prefix (bypasses LD-meta lookup)", + type = "character", default = "") +parser <- add_argument(parser, "--output", + help = "Output RDS path", type = "character") +argv <- parse_args(parser) + +resolve_ld_prefix <- function(meta_path, block_str) { + m <- regmatches(block_str, regexec("^([^:]+):([0-9]+)-([0-9]+)$", block_str))[[1L]] + if (length(m) != 4L) + stop("--ld-block must be chr:start-end (got: ", block_str, ")") + block <- list(chr = m[[2L]], start = as.integer(m[[3L]]), + end = as.integer(m[[4L]])) + meta_dir <- dirname(normalizePath(meta_path)) + meta <- read.table(meta_path, header = TRUE, sep = "\t", + stringsAsFactors = FALSE, check.names = FALSE, + comment.char = "") + chr_col <- intersect(c("#chr", "#chrom", "chr", "chrom"), names(meta))[1L] + chr_norm <- sub("^chr", "", as.character(meta[[chr_col]]), ignore.case = TRUE) + block_norm <- sub("^chr", "", block$chr, ignore.case = TRUE) + s <- suppressWarnings(as.integer(meta$start)) + e <- suppressWarnings(as.integer(meta$end)) + whole <- !is.na(s) & !is.na(e) & s == 0L & e == 0L + hit <- which(chr_norm == block_norm & + (whole | (s <= block$start & e >= block$end))) + if (length(hit) != 1L) + stop("LD-meta lookup for ", block_str, " returned ", length(hit), " rows.") + pfx <- meta$path[hit] + if (!startsWith(pfx, "/")) pfx <- file.path(meta_dir, pfx) + pfx +} + +open_handle <- function(prefix) { + if (file.exists(paste0(prefix, ".pgen"))) GenotypeHandle(plink2Prefix = prefix) + else if (file.exists(paste0(prefix, ".bed"))) GenotypeHandle(plink1Prefix = prefix) + else if (file.exists(paste0(prefix, ".gds"))) GenotypeHandle(path = paste0(prefix, ".gds")) + else if (file.exists(paste0(prefix, ".vcf.gz"))) GenotypeHandle(path = paste0(prefix, ".vcf.gz")) + else stop("No genotype payload at LD prefix: ", prefix) +} + +ld_handle <- if (nzchar(argv$ld_prefix)) { + open_handle(argv$ld_prefix) +} else if (nzchar(argv$ld_meta) && nzchar(argv$ld_block)) { + open_handle(resolve_ld_prefix(argv$ld_meta, argv$ld_block)) +} else { + stop("Provide either --ld-prefix or (--ld-meta AND --ld-block).") +} + +# --- Walk the legacy structure and build TwasWeightsEntry per gene ---- +legacy <- readRDS(argv$legacy) +if (!is.list(legacy) || length(legacy) == 0L) + stop("Legacy ctwas_weights RDS is empty or not a list: ", argv$legacy) + +studies <- character(0) +contexts <- character(0) +traits <- character(0) +methods <- character(0) +entries <- list() + +# Legacy keys: "|_"; values are a list with +# wgt (variants × 1 matrix), molecular_id, type, context, etc. +for (gene_key in names(legacy)) { + entry_obj <- legacy[[gene_key]] + if (!is.list(entry_obj) || is.null(entry_obj$wgt)) next + wgt_mat <- entry_obj$wgt + if (nrow(wgt_mat) == 0L) next + + vids <- rownames(wgt_mat) + wvec <- as.numeric(wgt_mat[, 1L]) + + trait <- entry_obj$molecular_id %||% sub("\\|.*$", "", gene_key) + ctx_label <- entry_obj$context %||% sub("^.*\\|", "", gene_key) + + weights_one_col <- matrix(wvec, ncol = 1L, + dimnames = list(vids, argv$method)) + + tw_entry <- TwasWeightsEntry( + variantIds = vids, + weights = weights_one_col, + # Already on the correlation scale (legacy multiplied by + # sqrt(variance)); skip the sqrt(variance) step downstream. + standardized = TRUE) + + studies <- c(studies, argv$study) + contexts <- c(contexts, ctx_label) + traits <- c(traits, trait) + methods <- c(methods, argv$method) + entries <- c(entries, list(tw_entry)) +} + +if (length(entries) == 0L) + stop("Converter produced 0 entries from ", argv$legacy) + +tw <- TwasWeights( + study = studies, + context = contexts, + trait = traits, + method = methods, + entry = entries, + ldSketch = ld_handle) + +dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE) +saveRDS(tw, argv$output) +cat(sprintf("Wrote S4 TwasWeights (%d entries; %d unique traits) to %s\n", + length(entries), length(unique(traits)), argv$output)) diff --git a/code/script/pecotmr_integration/legacy_twas_weights_to_s4.R b/code/script/pecotmr_integration/legacy_twas_weights_to_s4.R new file mode 100644 index 00000000..f224636c --- /dev/null +++ b/code/script/pecotmr_integration/legacy_twas_weights_to_s4.R @@ -0,0 +1,206 @@ +#!/usr/bin/env Rscript +# legacy_twas_weights_to_s4.R +# +# One-shot converter: legacy `*.univariate_twas_weights.rds` (top-level +# gene IDs -> context names -> `twas_weights` + `twas_cv_result` etc.) +# into the new S4 `pecotmr::TwasWeights` collection. +# +# Usage: +# Rscript legacy_twas_weights_to_s4.R \ +# --legacy \ +# --study \ +# --ld-meta \ +# --ld-block (or --ld-prefix ) \ +# --output +# +# The LD-block / LD-prefix is required because TwasWeights carries an +# `ldSketch` slot that must match the GwasSumStats' LD sketch downstream. + +suppressPackageStartupMessages({ + library(argparser) + library(pecotmr) +}) + +parser <- arg_parser("Convert legacy TwasWeights RDS to S4 TwasWeights") +parser <- add_argument(parser, "--legacy", + help = "Path to legacy *.univariate_twas_weights.rds", + type = "character") +parser <- add_argument(parser, "--study", + help = "Study label to stamp on every entry", + type = "character") +parser <- add_argument(parser, "--ld-meta", + help = "Path to LD-meta TSV (used when --ld-block is given)", + type = "character", default = "") +parser <- add_argument(parser, "--ld-block", + help = "LD block as chr:start-end (resolved against --ld-meta)", + type = "character", default = "") +parser <- add_argument(parser, "--ld-prefix", + help = "Explicit PLINK2/PLINK1/GDS/VCF prefix; bypasses LD-meta lookup", + type = "character", default = "") +parser <- add_argument(parser, "--output", + help = "Output RDS path", type = "character") +argv <- parse_args(parser) + +# --- Build an LD GenotypeHandle (same logic as gwas_sumstats_construct.R) ---- +resolve_ld_prefix <- function(meta_path, block_str) { + m <- regmatches(block_str, regexec("^([^:]+):([0-9]+)-([0-9]+)$", block_str))[[1L]] + if (length(m) != 4L) + stop("--ld-block must be chr:start-end (got: ", block_str, ")") + block <- list(chr = m[[2L]], start = as.integer(m[[3L]]), + end = as.integer(m[[4L]])) + meta_dir <- dirname(normalizePath(meta_path)) + meta <- read.table(meta_path, header = TRUE, sep = "\t", + stringsAsFactors = FALSE, check.names = FALSE, + comment.char = "") + chr_col <- intersect(c("#chr", "#chrom", "chr", "chrom"), names(meta))[1L] + if (is.na(chr_col)) + stop("Could not find a chromosome column in ", meta_path) + chr_norm <- sub("^chr", "", as.character(meta[[chr_col]]), ignore.case = TRUE) + block_norm <- sub("^chr", "", block$chr, ignore.case = TRUE) + s <- suppressWarnings(as.integer(meta$start)) + e <- suppressWarnings(as.integer(meta$end)) + whole <- !is.na(s) & !is.na(e) & s == 0L & e == 0L + hit <- which(chr_norm == block_norm & + (whole | (s <= block$start & e >= block$end))) + if (length(hit) != 1L) + stop("LD-meta lookup for ", block_str, " returned ", length(hit), " rows.") + pfx <- meta$path[hit] + if (!startsWith(pfx, "/")) pfx <- file.path(meta_dir, pfx) + pfx +} + +open_handle <- function(prefix) { + if (file.exists(paste0(prefix, ".pgen"))) GenotypeHandle(plink2Prefix = prefix) + else if (file.exists(paste0(prefix, ".bed"))) GenotypeHandle(plink1Prefix = prefix) + else if (file.exists(paste0(prefix, ".gds"))) GenotypeHandle(path = paste0(prefix, ".gds")) + else if (file.exists(paste0(prefix, ".vcf.gz"))) GenotypeHandle(path = paste0(prefix, ".vcf.gz")) + else stop("No recognised genotype payload at LD prefix: ", prefix) +} + +ld_handle <- if (nzchar(argv$ld_prefix)) { + open_handle(argv$ld_prefix) +} else if (nzchar(argv$ld_meta) && nzchar(argv$ld_block)) { + open_handle(resolve_ld_prefix(argv$ld_meta, argv$ld_block)) +} else { + stop("Provide either --ld-prefix or (--ld-meta AND --ld-block).") +} + +# --- Walk the legacy structure and build TwasWeightsEntry per (gene, context) ---- +legacy <- readRDS(argv$legacy) +if (!is.list(legacy) || length(legacy) == 0L) + stop("Legacy RDS is empty or not a list: ", argv$legacy) + +studies <- character(0) +contexts <- character(0) +traits <- character(0) +methods <- character(0) +entries <- list() + +for (gene in names(legacy)) { + gene_obj <- legacy[[gene]] + if (!is.list(gene_obj)) next + for (ctx_name in names(gene_obj)) { + ctx_obj <- gene_obj[[ctx_name]] + if (!is.list(ctx_obj) || + is.null(ctx_obj$twas_weights) || + is.null(ctx_obj$twas_cv_result)) next + wts_list <- ctx_obj$twas_weights + if (length(wts_list) == 0L) next + + # Variant ids: rownames of the first matrix-shaped weight column. Some + # legacy fields are length-N numeric vectors (e.g. `ensemble_weights`); + # coerce them to single-column matrices below. + first_mat <- Find(is.matrix, wts_list) + if (is.null(first_mat)) + stop("Legacy weights for ", gene, "/", ctx_name, + " contain no matrix-shaped entry; cannot recover variant IDs.") + vids <- rownames(first_mat) + method_names <- sub("_weights$", "", names(wts_list)) + + # CV performance per method (subset of methods that actually had CV). + perf_list <- ctx_obj$twas_cv_result$performance + perf_method_names <- sub("_performance$", "", names(perf_list)) + cv_lookup <- setNames(perf_list, perf_method_names) + + # context_name follows the legacy convention "_" — strip + # the trailing trait suffix so the new schema's context is a clean label. + context_label <- sub(paste0("_", gene, "$"), "", ctx_name) + + # Legacy susie_weights_intermediate carries the SuSiE posterior + # tensor (mu, lbf_variable, X_column_scale_factors, etc.) that + # ctwasPipeline's alpha renormalization needs when variants get + # dropped downstream of the original fit. Attach it on susie-method + # entries via the TwasWeightsEntry@fits slot. + legacy_intermediate <- ctx_obj$susie_weights_intermediate + + # One TwasWeightsEntry per method (TwasWeights schema requires one row + # per (study, context, trait, method) tuple). + for (mi in seq_along(method_names)) { + m <- method_names[[mi]] + wm <- wts_list[[mi]] + wv <- if (is.matrix(wm)) as.numeric(wm[, 1L]) else as.numeric(wm) + # Skip a method when every weight is zero or NA — it would never + # contribute a TWAS-Z and downstream would silently drop it anyway. + if (all(is.na(wv)) || all(wv == 0, na.rm = TRUE)) next + + weights_mat <- matrix(wv, ncol = 1L, + dimnames = list(vids, m)) + + perf_row <- cv_lookup[[m]] + cv_df <- if (!is.null(perf_row)) { + df <- as.data.frame(perf_row, stringsAsFactors = FALSE) + names(df)[names(df) == "adj_rsq"] <- "adjusted_rsq" + df$method <- m + df <- df[, intersect(c("method", "rsq", "adjusted_rsq", "pval", + "corr", "RMSE", "MAE"), + names(df)), drop = FALSE] + rownames(df) <- NULL + df + } else { + data.frame(method = m, rsq = NA_real_, adjusted_rsq = NA_real_, + pval = NA_real_, stringsAsFactors = FALSE) + } + + # Attach SuSiE intermediate on susie-family methods so + # ctwasPipeline's renormalization branch can recompute weights + # over a filtered variant set. + entry_fits <- if (m %in% c("susie", "susie_inf", "susie_ash") && + !is.null(legacy_intermediate)) { + legacy_intermediate + } else { + NULL + } + + entry <- TwasWeightsEntry( + variantIds = vids, + weights = weights_mat, + fits = entry_fits, + cvPerformance = cv_df, + standardized = FALSE) + + studies <- c(studies, argv$study) + contexts <- c(contexts, context_label) + traits <- c(traits, gene) + methods <- c(methods, m) + entries <- c(entries, list(entry)) + } + } +} + +if (length(entries) == 0L) + stop("Converter produced 0 entries from ", argv$legacy) + +tw <- TwasWeights( + study = studies, + context = contexts, + trait = traits, + method = methods, + entry = entries, + ldSketch = ld_handle) + +dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE) +saveRDS(tw, argv$output) +cat(sprintf("Wrote S4 TwasWeights (%d entries; %d unique traits) to %s\n", + length(entries), + length(unique(traits)), + argv$output)) diff --git a/code/script/pecotmr_integration/qtl_dataset_construct.R b/code/script/pecotmr_integration/qtl_dataset_construct.R index de430907..4f879819 100644 --- a/code/script/pecotmr_integration/qtl_dataset_construct.R +++ b/code/script/pecotmr_integration/qtl_dataset_construct.R @@ -2,26 +2,36 @@ # qtl_dataset_construct.R # # Build one pecotmr::QtlDataset from a single study's data and serialize -# to RDS. The resulting RDS becomes the upstream dependency for every -# per-gene fineMappingPipeline / twasWeightsPipeline / colocboostPipeline -# task; gene-level parallelization happens against this single object. +# to RDS. The resulting RDS is the upstream dependency for every per-gene +# fineMappingPipeline / twasWeightsPipeline / colocboostPipeline task; +# gene-level parallelization happens against this single object. # -# Inputs (per-context phenotype + phenotype-covariate files use -# CONTEXT=PATH pairs joined with commas, e.g. -# --phenotype DLPFC=foo.bed.gz,AC=bar.bed.gz): -# --study Study identifier -# --genotype-prefix PLINK2 pgen/pvar/psam prefix -# --phenotype CONTEXT=PATH list of BED.GZ phenotype files -# --phenotype-covariates CONTEXT=PATH list of per-context molecular-trait -# PC TSVs (samples as rows, PCs as columns — -# same shape as the genotype-PC TSV). -# Optional; default none. -# --genotype-covariates TSV of genotype-derived covariates (e.g. ancestry -# PCs) applied uniformly across all contexts -# (samples as rows, PCs as columns). Optional. -# --maf-cutoff Pass-through to QtlDataset(mafCutoff = ...) -# --xvar-cutoff Pass-through to QtlDataset(xvarCutoff = ...) -# --output Output RDS path +# Inputs: +# --study Study identifier (length-1 character). +# --genotype-prefix PLINK1 bed/bim/fam prefix (no extension). +# --phenotype-manifest Path to a TSV manifest. Required columns: +# ID trait identifier +# path bgzipped BED phenotype file +# cond context name +# Optional column: +# cov_path per-context covariate TSV +# One row per (region, context). Phenotype / +# covariate paths may be absolute or relative to +# the manifest's own directory. Multiple rows can +# share the same context; per-context the +# phenotype/cov path must agree. +# --genotype-covariates Optional TSV of genotype-derived covariates +# (e.g. ancestry PCs) applied uniformly across all +# contexts. Same shape as the per-context covariate +# files. +# --transpose-covariates When set, transposes every covariate TSV +# (phenotype + genotype) before treating it as +# samples-as-rows. Use this for QTLtools-format +# inputs where rows are covariate names and +# columns are samples. +# --maf-cutoff Pass-through MAF cutoff for QtlDataset(). +# --xvar-cutoff Pass-through variance cutoff for QtlDataset(). +# --output Output RDS path. suppressPackageStartupMessages({ library(argparser) @@ -35,44 +45,26 @@ parser <- arg_parser("Build a pecotmr QtlDataset for one study and save to RDS") parser <- add_argument(parser, "--study", help = "Study identifier", type = "character") parser <- add_argument(parser, "--genotype-prefix", - help = "PLINK2 pgen/pvar/psam prefix", + help = "PLINK1 bed/bim/fam prefix", type = "character") +parser <- add_argument(parser, "--phenotype-manifest", + help = "TSV manifest path (ID, path, cond, cov_path)", type = "character") -parser <- add_argument(parser, "--phenotype", - help = "Comma-joined CONTEXT=PATH phenotype BED files", - type = "character") -parser <- add_argument(parser, "--phenotype-covariates", - help = "Comma-joined CONTEXT=PATH PC TSV files", - type = "character", default = "") parser <- add_argument(parser, "--genotype-covariates", help = "TSV of uniformly-applied genotype PCs", type = "character", default = "") +parser <- add_argument(parser, "--transpose-covariates", + help = "Transpose covariate TSVs (QTLtools format)", + flag = TRUE) parser <- add_argument(parser, "--maf-cutoff", - help = "Pass-through MAF cutoff for QtlDataset()", + help = "QtlDataset() MAF cutoff", type = "numeric", default = 0) parser <- add_argument(parser, "--xvar-cutoff", - help = "Pass-through variance cutoff for QtlDataset()", + help = "QtlDataset() variance cutoff", type = "numeric", default = 0) parser <- add_argument(parser, "--output", help = "Output RDS path", type = "character") argv <- parse_args(parser) -# ----- CONTEXT=PATH list parser --------------------------------------------- -parse_kv <- function(s) { - if (!nzchar(s)) return(list()) - out <- list() - for (p in strsplit(s, ",", fixed = TRUE)[[1L]]) { - kv <- strsplit(p, "=", fixed = TRUE)[[1L]] - if (length(kv) != 2L) stop("Expected CONTEXT=PATH, got: ", p) - out[[trimws(kv[[1L]])]] <- trimws(kv[[2L]]) - } - out -} - -phenotype_files <- parse_kv(argv$phenotype) -pheno_cov_files <- parse_kv(argv$phenotype_covariates) -contexts <- names(phenotype_files) -if (length(contexts) == 0L) stop("--phenotype must list at least one context") - # ----- File readers --------------------------------------------------------- read_pheno_bed <- function(path) { read.table(gzfile(path), header = TRUE, sep = "\t", @@ -80,15 +72,68 @@ read_pheno_bed <- function(path) { comment.char = "") } -# Both --phenotype-covariates and --genotype-covariates files share this -# shape: TSV with a sample-id column followed by one column per PC. The -# returned matrix has samples as rows and PCs as columns. -read_pcs_tsv <- function(path) { - as.matrix(read.table(path, header = TRUE, sep = "\t", row.names = 1, - check.names = FALSE, comment.char = "")) +# Read a covariate TSV. Default expects samples-as-rows + PC columns. +# With transpose = TRUE, expects QTLtools-format input (covariates as rows, +# samples as columns) and transposes to samples-as-rows. +read_pcs_tsv <- function(path, transpose = FALSE) { + raw <- read.table(path, header = TRUE, sep = "\t", row.names = 1, + check.names = FALSE, comment.char = "") + m <- as.matrix(raw) + if (transpose) m <- t(m) + m +} + +# ----- Manifest ingestion --------------------------------------------------- +manifest <- read.table(argv$phenotype_manifest, header = TRUE, sep = "\t", + stringsAsFactors = FALSE, check.names = FALSE, + comment.char = "") +required <- c("ID", "path", "cond") +missingCols <- setdiff(required, names(manifest)) +if (length(missingCols) > 0L) { + stop("--phenotype-manifest missing required column(s): ", + paste(missingCols, collapse = ", ")) +} +has_cov_col <- "cov_path" %in% names(manifest) + +# Resolve manifest-relative paths against the manifest's own directory. +manifest_dir <- dirname(normalizePath(argv$phenotype_manifest)) +resolve_path <- function(p) { + if (is.na(p) || !nzchar(p)) return("") + if (startsWith(p, "/")) return(p) + file.path(manifest_dir, p) +} + +# Collapse to one (phenotype, cov_path) pair per context. Multiple rows per +# context (different traits) must agree on the file paths. +contexts <- unique(manifest$cond) +phenotype_files <- list() +pheno_cov_files <- list() +for (cx in contexts) { + sub <- manifest[manifest$cond == cx, , drop = FALSE] + paths_here <- unique(sub$path) + if (length(paths_here) > 1L) { + stop(sprintf( + "Context '%s' references multiple phenotype paths: %s.", + cx, paste(paths_here, collapse = ", "))) + } + phenotype_files[[cx]] <- paths_here[[1L]] + if (has_cov_col) { + covs_here <- unique(sub$cov_path[!is.na(sub$cov_path) & + nzchar(sub$cov_path)]) + if (length(covs_here) > 1L) { + stop(sprintf( + "Context '%s' references multiple cov_path values: %s.", + cx, paste(covs_here, collapse = ", "))) + } + pheno_cov_files[[cx]] <- if (length(covs_here) == 1L) covs_here[[1L]] + else "" + } else { + pheno_cov_files[[cx]] <- "" + } } -build_se <- function(bed_path, pcov_path) { +# ----- Per-context SummarizedExperiment builder ----------------------------- +build_se <- function(bed_path, pcov_path, transpose_cov) { bed <- read_pheno_bed(bed_path) chr_col <- intersect(c("#chr", "chrom", "chr"), names(bed))[1L] start_col <- intersect(c("start", "Start"), names(bed))[1L] @@ -113,7 +158,7 @@ build_se <- function(bed_path, pcov_path) { names(rr) <- meta$gene_id cd <- if (nzchar(pcov_path)) { - pcov <- read_pcs_tsv(pcov_path) + pcov <- read_pcs_tsv(pcov_path, transpose = transpose_cov) common <- intersect(rownames(pcov), colnames(expr)) if (length(common) == 0L) stop("No shared samples between phenotype and phenotype-covariate: ", @@ -130,18 +175,19 @@ build_se <- function(bed_path, pcov_path) { phenotypes <- setNames( lapply(contexts, function(cx) - build_se(phenotype_files[[cx]], - if (!is.null(pheno_cov_files[[cx]])) pheno_cov_files[[cx]] else "")), + build_se(resolve_path(phenotype_files[[cx]]), + resolve_path(pheno_cov_files[[cx]]), + argv$transpose_covariates)), contexts) -# ----- Genotype handle (PLINK2) + uniform genotype covariates --------------- -geno_handle <- GenotypeHandle(plink2Prefix = argv$genotype_prefix) +# ----- Genotype handle (PLINK1) + uniform genotype covariates --------------- +geno_handle <- GenotypeHandle(plink1Prefix = argv$genotype_prefix) geno_cov_path <- argv$genotype_covariates has_geno_cov <- nzchar(geno_cov_path) && geno_cov_path != "." && file.exists(geno_cov_path) genoCov <- if (has_geno_cov) { - read_pcs_tsv(geno_cov_path) + read_pcs_tsv(geno_cov_path, transpose = argv$transpose_covariates) } else { matrix(numeric(0), nrow = 0, ncol = 0) } @@ -157,5 +203,7 @@ qd <- QtlDataset( dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE) saveRDS(qd, argv$output) -cat(sprintf("Wrote QtlDataset for study '%s' (%d contexts) to %s\n", - argv$study, length(phenotypes), argv$output)) +cat(sprintf("Wrote QtlDataset for study '%s' (%d contexts: %s) to %s\n", + argv$study, length(phenotypes), + paste(names(phenotypes), collapse = ", "), + argv$output)) diff --git a/code/script/pecotmr_integration/qtl_enrichment.R b/code/script/pecotmr_integration/qtl_enrichment.R new file mode 100644 index 00000000..aea5acfb --- /dev/null +++ b/code/script/pecotmr_integration/qtl_enrichment.R @@ -0,0 +1,68 @@ +#!/usr/bin/env Rscript +# qtl_enrichment.R +# +# xQTL-GWAS enrichment worker. Loads an S4 `QtlFineMappingResult` and +# an S4 `GwasFineMappingResult`, calls +# `pecotmr::qtlEnrichmentPipeline()`, and saves the resulting +# per-(gwasStudy, qtlContext) enrichment data.frame. The output feeds +# downstream into `coloc.R` (via colocPipeline's `enrichment` arg) as +# the prior-adjustment factor for colocalization. +# +# Inputs: +# --qtl-fine-mapping Path to S4 QtlFineMappingResult RDS +# (output of fine_mapping.ipynb / fineMappingPipeline) +# --gwas-fine-mapping Path to S4 GwasFineMappingResult RDS +# --num-gwas Number of GWAS variants per study (pass-through) +# --pi-qtl Optional QTL pi value (pass-through) +# --lambda Pass-through (default 1) +# --imp-n Pass-through (default 25) +# --ncore Number of threads (default 1) +# --output Output RDS path (per-pair enrichment data.frame) + +suppressPackageStartupMessages({ + library(argparser) + library(pecotmr) +}) + +parser <- arg_parser("xQTL-GWAS enrichment via qtlEnrichmentPipeline()") +parser <- add_argument(parser, "--qtl-fine-mapping", + help = "Path to S4 QtlFineMappingResult RDS", + type = "character") +parser <- add_argument(parser, "--gwas-fine-mapping", + help = "Path to S4 GwasFineMappingResult RDS", + type = "character") +parser <- add_argument(parser, "--num-gwas", + help = "Number of GWAS variants (per study; pass-through)", + type = "numeric", default = NA) +parser <- add_argument(parser, "--pi-qtl", + help = "Optional QTL pi value (pass-through)", + type = "numeric", default = NA) +parser <- add_argument(parser, "--lambda", + help = "Pass-through lambda", + type = "numeric", default = 1) +parser <- add_argument(parser, "--imp-n", + help = "Pass-through impN", + type = "integer", default = 25L) +parser <- add_argument(parser, "--ncore", + help = "Pass-through ncore", + type = "integer", default = 1L) +parser <- add_argument(parser, "--output", + help = "Output RDS path", type = "character") +argv <- parse_args(parser) + +qtlFmr <- readRDS(argv$qtl_fine_mapping) +gwasFmr <- readRDS(argv$gwas_fine_mapping) + +res <- qtlEnrichmentPipeline( + gwasFineMappingResult = gwasFmr, + qtlFineMappingResult = qtlFmr, + numGwas = if (is.na(argv$num_gwas)) NULL else argv$num_gwas, + piQtl = if (is.na(argv$pi_qtl)) NULL else argv$pi_qtl, + lambda = argv$lambda, + impN = argv$imp_n, + numThreads = argv$ncore) + +dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE) +saveRDS(res, argv$output) +cat(sprintf("Wrote qtlEnrichmentPipeline result (%d (gwasStudy, qtlContext) pair(s)) to %s\n", + nrow(res), argv$output)) diff --git a/code/script/pecotmr_integration/twas.R b/code/script/pecotmr_integration/twas.R new file mode 100644 index 00000000..bbffce9c --- /dev/null +++ b/code/script/pecotmr_integration/twas.R @@ -0,0 +1,62 @@ +#!/usr/bin/env Rscript +# twas.R +# +# Per-gene TWAS Z-score + Mendelian-Randomization worker. Loads a +# pre-computed per-gene TwasWeights RDS and a per-LD-block GwasSumStats +# RDS, optionally a matching per-gene FineMappingResult, and calls +# pecotmr::causalInferencePipeline(). +# +# Inputs: +# --twas-weights Per-gene TwasWeights RDS (output of twas_weights.ipynb) +# --gwas-sumstats Per-LD-block GwasSumStats RDS +# --fine-mapping-result Optional per-gene FineMappingResult RDS +# --mr-pip-cutoff Pass-through (default 0.5) +# --mr-method Pass-through; "ivwPerVariant" or "csAware". +# Default "ivwPerVariant". +# --output Output RDS path + +suppressPackageStartupMessages({ + library(argparser) + library(pecotmr) +}) + +parser <- arg_parser("Per-gene TWAS Z + MR via causalInferencePipeline()") +parser <- add_argument(parser, "--twas-weights", + help = "Per-gene TwasWeights RDS", + type = "character") +parser <- add_argument(parser, "--gwas-sumstats", + help = "Per-LD-block GwasSumStats RDS", + type = "character") +parser <- add_argument(parser, "--fine-mapping-result", + help = "Optional per-gene FineMappingResult RDS", + type = "character", default = "") +parser <- add_argument(parser, "--mr-pip-cutoff", + help = "Pass-through PIP cutoff for MR", + type = "numeric", default = 0.5) +parser <- add_argument(parser, "--mr-method", + help = "MR method: ivwPerVariant or csAware", + type = "character", default = "ivwPerVariant") +parser <- add_argument(parser, "--output", + help = "Output RDS path", type = "character") +argv <- parse_args(parser) + +tw <- readRDS(argv$twas_weights) +gss <- readRDS(argv$gwas_sumstats) + +fmr_path <- argv$fine_mapping_result +fmr <- if (nzchar(fmr_path) && fmr_path != "." && file.exists(fmr_path)) { + readRDS(fmr_path) +} else { + NULL +} + +res <- causalInferencePipeline( + gwasSumStats = gss, + twasWeights = tw, + fineMappingResult = fmr, + mrPipCutoff = argv$mr_pip_cutoff, + mrMethod = argv$mr_method) + +dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE) +saveRDS(res, argv$output) +cat(sprintf("Wrote causalInferencePipeline result to %s\n", argv$output)) diff --git a/code/script/pecotmr_integration/twas_weights.R b/code/script/pecotmr_integration/twas_weights.R index 749149ce..4fe72878 100644 --- a/code/script/pecotmr_integration/twas_weights.R +++ b/code/script/pecotmr_integration/twas_weights.R @@ -1,36 +1,46 @@ #!/usr/bin/env Rscript # twas_weights.R # -# Per-gene TWAS-weights worker. Loads a pre-built pecotmr::QtlDataset RDS -# and calls twasWeightsPipeline() for a single trait. Designed to be -# invoked once per gene by the SoS fan-out in twas_weights.ipynb. +# Per-region TWAS-weights worker. Loads a pre-built pecotmr::QtlDataset +# RDS and calls twasWeightsPipeline() for either a single trait (gene +# mode) or a single genomic region (region mode). Designed to be +# invoked once per fan-out unit by the SoS step in twas_weights.ipynb. +# +# Modes (mutually exclusive): +# gene : --gene-id ENSG... --cis-window 1000000 +# region : --region chr22:15000000-16000000 # # Inputs: # --qtl-dataset Path to a QtlDataset RDS -# --gene-id Single trait identifier -# --cis-window cis-window in bp around the trait's TSS -# --fine-mapping-result Optional pre-fit FineMappingResult RDS; when -# supplied, SuSiE-family TWAS methods reuse the -# fits via the `fineMappingResult` cache instead -# of refitting. Pass an empty string ("") or +# --gene-id (gene mode) trait identifier +# --cis-window (gene mode) cis-window in bp +# --region (region mode) chr:start-end string +# --fine-mapping-result Optional pre-fit FineMappingResult RDS; +# SuSiE-family methods reuse the cached fits +# via the fineMappingResult cache. Pass "" or # "." to skip. # --output Output RDS path (one TwasWeights) suppressPackageStartupMessages({ library(argparser) library(pecotmr) + library(GenomicRanges) + library(IRanges) }) -parser <- arg_parser("Per-gene default-preset TWAS weights over a pre-built QtlDataset") +parser <- arg_parser("Per-gene or per-region default-preset TWAS weights over a pre-built QtlDataset") parser <- add_argument(parser, "--qtl-dataset", help = "Path to a QtlDataset RDS", type = "character") parser <- add_argument(parser, "--gene-id", - help = "Single trait identifier", - type = "character") + help = "Trait identifier (gene mode); mutually exclusive with --region", + type = "character", default = "") parser <- add_argument(parser, "--cis-window", - help = "cis-window in bp around the trait's TSS", + help = "cis-window in bp around the trait's TSS (gene mode)", type = "integer", default = 1000000L) +parser <- add_argument(parser, "--region", + help = "Genomic region as chr:start-end (region mode); mutually exclusive with --gene-id", + type = "character", default = "") parser <- add_argument(parser, "--fine-mapping-result", help = "Optional pre-fit FineMappingResult RDS", type = "character", default = "") @@ -38,6 +48,22 @@ parser <- add_argument(parser, "--output", help = "Output RDS path", type = "character") argv <- parse_args(parser) +parse_region <- function(s) { + m <- regmatches(s, regexec("^([^:]+):([0-9]+)-([0-9]+)$", s))[[1L]] + if (length(m) != 4L) + stop("--region must be in chr:start-end format (got: ", s, ")") + GRanges(seqnames = m[[2L]], + ranges = IRanges(start = as.integer(m[[3L]]), + end = as.integer(m[[4L]]))) +} + +has_gene <- nzchar(argv$gene_id) +has_region <- nzchar(argv$region) +if (has_gene && has_region) + stop("--gene-id and --region are mutually exclusive; pass exactly one.") +if (!has_gene && !has_region) + stop("Specify either --gene-id (with --cis-window) or --region.") + qd <- readRDS(argv$qtl_dataset) fmr_path <- argv$fine_mapping_result @@ -47,14 +73,25 @@ fmr <- if (nzchar(fmr_path) && fmr_path != "." && file.exists(fmr_path)) { NULL } -res <- twasWeightsPipeline( - qd, - methods = "default", - traitId = argv$gene_id, - cisWindow = argv$cis_window, - fineMappingResult = fmr) +res <- if (has_region) { + twasWeightsPipeline( + qd, + methods = NULL, + region = parse_region(argv$region), + cisWindow = argv$cis_window, + fineMappingResult = fmr) +} else { + twasWeightsPipeline( + qd, + methods = NULL, + traitId = argv$gene_id, + cisWindow = argv$cis_window, + fineMappingResult = fmr) +} dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE) saveRDS(res, argv$output) -cat(sprintf("Wrote TWAS weights for gene '%s' (%d row(s)) to %s\n", - argv$gene_id, nrow(res), argv$output)) +cat(sprintf("Wrote TWAS weights for %s (%d row(s)) to %s\n", + if (has_region) paste0("region '", argv$region, "'") + else paste0("gene '", argv$gene_id, "'"), + nrow(res), argv$output)) diff --git a/dev/pecotmr_migration_survey.md b/dev/pecotmr_migration_survey.md new file mode 100644 index 00000000..c8ace96f --- /dev/null +++ b/dev/pecotmr_migration_survey.md @@ -0,0 +1,87 @@ +# pecotmr API migration — survey of legacy callers in xqtl-protocol + +MWE inspected: `/Users/danielnachun/Downloads/fungen_xqtl/xqtl-protocol`. +Out of scope for this round: the four new wrappers under `code/SoS/pecotmr_integration/` (`qtl_dataset`, `fine_mapping`, `twas_weights`, `colocboost`). + +## Notebooks examined + +User-specified subset: + +| Notebook | SoS steps with pecotmr work | Migration shape | +|---|---|---| +| `code/SoS/association_scan/quantile_models/qr_and_twas.ipynb` | `[quantile_qtl_twas_weight]` — inline R | new wrapper | +| `code/SoS/mnm_analysis/mnm_methods/mnm_regression.ipynb` | `[susie_twas]` only (per user directive) — invokes `code/script/mnm_analysis/mnm_methods/susie_twas.R` | replace step + script | +| `code/SoS/mnm_analysis/mnm_methods/colocboost.ipynb` | `[colocboost]` — inline R, calls `library(colocboost)` + `library(pecotmr)` (plus the deprecated multi-task loader pattern via the SoS `[get_analysis_regions]` step) | replace with new colocboost.ipynb pattern | +| `code/SoS/mnm_analysis/mnm_methods/rss_analysis.ipynb` | `[univariate_rss]` — invokes `code/script/pecotmr_integration/univariate_rss.R` | replace script + step | +| `code/SoS/pecotmr_integration/twas_ctwas.ipynb` | `[twas]`, `[ctwas_1]`, `[ctwas_2]`, `[ctwas_3]`, `[quantile_twas]` — all inline R | refactor; needs design | +| `code/SoS/pecotmr_integration/SuSiE_enloc.ipynb` | `[xqtl_gwas_enrichment]`, `[susie_coloc]` — both inline R | refactor; needs design | +| `code/SoS/pecotmr_integration/intact.ipynb` | `[intact]` — inline R (uses `fastenloc`, `run_intact`, etc.) | might be downstream-only (no pecotmr fitting) — needs check | + +## Deprecated pecotmr symbols referenced per step + +(Searched both snake_case and camelCase forms; both are in pecotmr's `R/deprecated.R`.) + +- `qr_and_twas.ipynb [quantile_qtl_twas_weight]`: **`load_regional_association_data`** + quantile-regression-specific helpers. +- `mnm_regression.ipynb [susie_twas]` → `susie_twas.R`: **`load_regional_univariate_data`**, **`univariate_analysis_pipeline`**. The legacy script is ~567 lines and wraps both data loading and pipeline fitting into one CLI. +- `mnm_regression.ipynb [ensemble_twas_weight]`: **`ensemble_weights`** (+ `load_regional_univariate_data` for resume-cache reads). (Out of scope per user directive but logged here for context.) +- `colocboost.ipynb [colocboost]`: uses `library(colocboost)` and `library(pecotmr)`; relies on the SoS `[get_analysis_regions]` upstream which uses the deprecated multi-task data-loader convention (manifest-driven). The R block itself likely calls `pecotmr` helpers for data assembly and the `colocboost::colocboost()` solver directly. +- `rss_analysis.ipynb [univariate_rss]` → `univariate_rss.R`: **`load_LD_matrix`** (deprecated → `loadLdMatrix`). The script is ~265 lines. +- `twas_ctwas.ipynb`: + - `[twas]` — **`twas_pipeline`**, **`load_twas_weights`**, **`batch_load_twas_weights`**, **`twas_z`** + various `mr_*` MR helpers. + - `[ctwas_1]` — **`get_ctwas_meta_data`**, **`harmonize_gwas`**, **`load_LD_matrix`**, **`twas_analysis`**, plus internal `ctwas_*` data structures. + - `[ctwas_2]` — internal `ctwas_param_*` flow. + - `[ctwas_3]` — **`ctwas_bimfile_loader`**, **`get_ctwas_meta_data`**, plus `ctwas_*` plotting/postprocess helpers. + - `[quantile_twas]` — `twas_pipeline`, `mr_result`. +- `SuSiE_enloc.ipynb`: + - `[xqtl_gwas_enrichment]` — **`xqtl_enrichment_wrapper`** (deprecated → `qtlEnrichmentPipeline`). + - `[susie_coloc]` — calls `pecotmr:::get_nested_element` (internal helper) on a per-context fine-mapped RDS; the per-context coloc loop is built around `library(coloc)` directly. +- `intact.ipynb [intact]` — `fastenloc`, `run_intact`, `intact_pip`, `intact_res`. No direct pecotmr-pipeline call observed; mostly a downstream synthesizer. Needs deeper check before concluding. + +## MWE data conventions discovered + +- **Genotype**: BOTH PLINK1 (e.g. `input/colocboost/example.chr22.bed/.bim/.fam`) AND VCF / bgzipped formats (`input/genotype/protocol_example.genotype.chr*.vcf.gz`, `*.bgz`). **No PLINK2 (pgen/pvar/psam) files present** in the MWE. +- **Phenotype BED**: standard QTLtools format with `#chr, start, end, ID, SAMPLE_001, SAMPLE_002, ...`. Matches what our new `qtl_dataset_construct.R` parses. +- **Covariate file**: **QTLtools convention** — `#id` header + rows like `sex`, `age`, `PC1`, ... with sample columns. **This is the opposite orientation from what the new wrapper expects** (we updated the script to take samples-as-rows since the user described `--phenotype-covariates` as "molecular trait PCs, same shape as genotype PCs"). +- **Per-region manifest TSVs** (`fine_mapping_meta.tsv`, `pheno_manifest_multicontext.tsv`): columns `#chr, start, end, ID, path, cond, cov_path` — one row per (region, context). The same `cov_path` is reused across contexts in the MWE examples. +- **xQTL meta TSVs** (TWAS / cTWAS / enloc inputs): point at pre-computed per-(study, gene) `*.univariate_susie_twas_weights.rds` files. These workflows do NOT recompute fits; they consume the pre-built RDS files. +- **GWAS inputs** (`input/rss_analysis/`, `input/twas/`): tabix-indexed `.tsv.gz` plus a YAML column-mapping file. + +## Migration sketch by notebook + +1. **`mnm_regression.ipynb` `[susie_twas]` step**: delete the step and `susie_twas.R`; replace with the existing `fine_mapping.ipynb` + `twas_weights.ipynb` against a pre-built QtlDataset RDS. +2. **`rss_analysis.ipynb`**: new wrapper `gwas_finemap.R` that takes a QC'd `GwasSumStats` RDS and calls `fineMappingPipeline(methods = "susie")`. The legacy `[univariate_rss]` step's inputs (sumstats TSV + LD meta) feed a prior `summaryStatsQc` step that builds the GwasSumStats RDS. +3. **`colocboost.ipynb`** (mnm_analysis): the existing notebook covers individual-level cross-context coloc with `colocboost::colocboost()` invoked through pecotmr's old data assembly. The new `pecotmr_integration/colocboost.ipynb` we wrote covers the QtlDataset focal-gene case; multi-condition / GWAS-jointed colocboost can be added similarly via `colocboostPipeline(jointGwas = TRUE, ...)`. +4. **`qr_and_twas.ipynb`**: quantile-regression TWAS. Need to confirm whether pecotmr's S4 refactor exposes a new entry point for quantile TWAS or whether the legacy `load_regional_association_data` + a `quantile_*` worker is still the only path. +5. **`twas_ctwas.ipynb`**: maps to the new `ctwasPipeline` + `causalInferencePipeline` in pecotmr. Each step (`twas`, `ctwas_1/2/3`, `quantile_twas`) becomes a thin wrapper that takes the pre-computed TwasWeights RDS + GWAS RDS + LD meta and calls one pipeline function. The data flow stays the same; only the inline R block is replaced. +6. **`SuSiE_enloc.ipynb`**: + - `[xqtl_gwas_enrichment]` → `qtlEnrichmentPipeline(gwasFineMappingResult, qtlFineMappingResult)`. Input meta TSV stays the same; the wrapper just calls the new pipeline. + - `[susie_coloc]` → `colocPipeline(qtlFineMappingResult, gwasFineMappingResult)` or similar. The new pipeline handles the per-(study, region) coloc loop internally; no need to hand-write the loop in the SoS notebook. +7. **`intact.ipynb`**: downstream of TWAS + enloc; uses `fastenloc` and `run_intact`. These look like INTACT-specific functions, not pecotmr S4 pipeline targets. Likely the inline R only consumes RDS outputs from earlier steps and a thin wrapper that calls the same `run_intact` is enough. Confirm whether `run_intact` is a pecotmr export or an INTACT-package function. + +## Status update (2026-06-22) + +**colocboost extension done.** `colocboost.R` + `colocboost.ipynb` now accept `--gwas-sumstats ` plus the three pipeline-variant flags (`--no-xqtl-coloc`, `--joint-gwas`, `--separate-gwas`), with the matching SoS parameter validations. xqtl-only regression and the GWAS-variant validations all smoke-tested against the MWE. Joint/separate-gwas end-to-end testing still needs a QC'd GwasSumStats RDS. + +**twas_ctwas.ipynb migration — wrappers + notebooks written.** Three new pairs under `code/SoS/pecotmr_integration/` + `code/script/pecotmr_integration/`: + +- `gwas_sumstats_construct.R` + `gwas_sumstats.ipynb` — per-LD-block GwasSumStats builder. Reads a GWAS TSV, subsets to the block, resolves the LD-meta row, builds a PLINK1/PLINK2/GDS/VCF `GenotypeHandle`, calls `GwasSumStats(...)`. Has a `--skip-qc` flag escape hatch because `summaryStatsQc()` triggers MungeSumstats which requires the `SNPlocs.Hsapiens.dbSNP155.GRCh38` BioConductor package. **Smoke-tested against the MWE** — produces a valid `GwasSumStats` over chr22:10516173-17414263 (112 variants) with the PLINK2 LD sketch attached. +- `twas.R` + `twas.ipynb` — per-gene `causalInferencePipeline`. Notebook is now manifest-driven (rows: `gene_id`, `twas_weights_rds`, `gwas_sumstats_rds`, optional `fine_mapping_result_rds`), matching the `ctwas.ipynb` pattern. Per-row fan-out, one `twas.R` call per gene. Output: `{cwd}/{study}.{gene_id}.twas.rds`. +- `fine_mapping.R` + `fine_mapping.ipynb` — supports BOTH QTL (`fine_mapping` workflow, `--qtl-dataset` + `--genes`/`--regions` fan-out) AND GWAS (`fine_mapping_gwas` workflow, `--gwas-sumstats-list ...` per-block fan-out). One wrapper script dispatches on which S4 input was supplied; `pecotmr::fineMappingPipeline` dispatches on input class (QtlDataset / GwasSumStats). Smoked end-to-end on the MWE: produces `QtlFineMappingResult` for ENSG00000130538 and `GwasFineMappingResult` for all 20 chr22 LD blocks. +- `ctwas_assemble.R` + `ctwas_est.R` + `ctwas_finemap.R` + `ctwas.ipynb` — three-step cTWAS pipeline mirroring the legacy `[ctwas_1/2/3]`. Each step calls one of pecotmr's exported sub-pipelines (`assembleCtwasInputs` / `estCtwasParam` / `screenCtwasRegions` + `finemapCtwasRegions`). The SoS notebook chains them via intermediate RDS files so analysts can re-run only the screen/finemap step when tuning knobs. + +**Bugs surfaced in installed pecotmr 0.5.3 during testing**: +1. `.twasNormalizeMethods("default")` echoes the literal `"default"` as the only token (skips the `methodList`-based expansion that the `is.null` branch uses). Workaround: wrappers pass `methods = NULL`. +2. `pecotmr:::getRegionalLdMeta` returns the LD-meta file path as the *value* with the data path as the *name* of a single-element character vector, so `GenotypeHandle(ldMeta=…, region=…)` ends up checking the meta file's `.tsv` extension instead of the genotype payload's. Workaround: `gwas_sumstats_construct.R` resolves LD-meta rows directly and calls `GenotypeHandle(plink2Prefix=…)` (or the matching constructor) itself. +3. `cisWindow` is required in `fineMappingPipeline(... region = ...)` despite the doc saying it's required only with `traitId`. Wrappers pass `cisWindow` in both modes. + +## Outstanding questions for the user + +1. **Covariate orientation**: MWE files are QTLtools format (covariates × samples). Our new `qtl_dataset_construct.R` expects samples × PCs (transposed). Options: (a) transpose inside the wrapper, (b) make orientation a flag, (c) keep current and require pre-transposition. Which? +2. **Genotype format**: MWE has PLINK1 + VCF, no PLINK2. Do you want to (a) convert MWE inputs to PLINK2 once, (b) extend `qtl_dataset_construct.R` to accept PLINK1 / VCF, or (c) leave the wrapper PLINK2-only? +3. **Meta TSV consumption**: the legacy workflow ingests per-region manifest TSVs (`fine_mapping_meta.tsv`) and expands them into per-region data dicts via `[get_analysis_regions]`. Under the new design, where does the meta TSV live? Options: (a) supply CLI args directly to `qtl_dataset_construct.R` (current), (b) add a CLI flag that reads a meta TSV and builds the args internally, (c) leave the TSV→args translation as a one-off pre-step. +4. **`--region` vs `--window`**: legacy CLI distinguished a per-gene fine-mapping region from a larger association window for variant selection. The new wrappers only expose `--cis-window` (per-trait). Confirm this consolidation is intentional and the wider association-window concept is gone. +5. **Quantile TWAS**: does pecotmr's new S4 API have a quantile-TWAS pipeline (`quantileTwasPipeline` or similar), or is quantile TWAS still bolted on top of the legacy `load_regional_association_data` flow? +6. **Pre-computed TWAS weights inputs**: `twas_ctwas` / `SuSiE_enloc` / `intact` all consume pre-computed `*.univariate_susie_twas_weights.rds` files. After the migration of `twas_weights.ipynb`, the RDS schema is the new `TwasWeights` S4 collection — which should be a drop-in replacement everywhere the old `.rds` was consumed via `loadTwasWeights` / `batchLoadTwasWeights`. Confirm: are we OK requiring downstream notebooks to take new-format `TwasWeights` RDS files, or do we need a back-compat shim? +7. **`ensemble_weights` (mnm_regression `[ensemble_twas_weight]`)**: still wants to be a callable. Has it been moved to a camelCase `ensembleWeights` (still exported), or removed in the S4 refactor? Affects whether the ensemble step is a 1-line wrapper or needs reimplementation. +8. **Scope confirmation**: per user directive, `mnm_regression.ipynb` is treated as just `[susie_twas]` for this round; `[ensemble_twas_weight]`, `[mnm]`, `[mnm_genes]`, `[fsusie]`, `[mvfsusie]` are deferred. Confirm. +9. **`intact.ipynb`**: does the `intact` step call into pecotmr (deprecated or otherwise) or is it purely the INTACT R package + RDS file munging? Determines whether this notebook needs migration at all. diff --git a/dev/python-replacement-scope.md b/dev/python-replacement-scope.md new file mode 100644 index 00000000..e22ea5fe --- /dev/null +++ b/dev/python-replacement-scope.md @@ -0,0 +1,174 @@ +# Python replacement scope + +Inventory of all Python code in xqtl-protocol, categorized by what we plan to +do with it. Companion to the pecotmr refactor (see +`StatFunGen/pecotmr/dev/refactor-design.md`) — most of the "internal Python" +absorption happens inside pecotmr constructors and pipelines after that +refactor lands. + +## Out of scope + +The following Python is **not** considered for replacement and is excluded +from every category below. + +- **SoS framework scaffolding** (~3,065 LOC across all SoS notebooks): the + `[step_name]` headers, `parameter:`, `input:`, `output:`, `task:`, and + `depends:` directives, plus action-block headers (`R:`, `bash:`, + `python:`, etc.). These use Python syntax but are SoS framework + declarations, not Python you write. Stays implicitly because SoS itself + stays. +- **Snakemake** (`code/snakemake/` directory in its entirety, including + `compat/python/sitecustomize.py`, `Snakefile`, and `rules/*`). Out of + scope for this work. + +## Categories + +There are four buckets of Python in xqtl-protocol. Categories 1 contains +code we keep; categories 2–4 contain code we plan to replace. + +### Category 1 — Vendored Python scripts and external tools (KEEP) + +Vendored scripts and external tool wrappers we do not plan to replace. +These are either copies of upstream tools or wrappers around large +external libraries (ML / GPU frameworks) with no clean R equivalent. They +are invoked from the workflow layer via shell-out (`bash:` blocks or +`subprocess`), so the rest of the codebase doesn't care about their +language. + +| File | LOC | Origin / purpose | +|---|---|---| +| `code/SoS/molecular_phenotypes/calling/apa/DaPars_Extract_Anno.py` | 152 | Vendored from the [DaPars](https://github.com/ZhengXia/DaPars) APA-calling project. Extracts 3'UTR annotation from gene BED + symbol map. **Python 2.** | +| `code/SoS/molecular_phenotypes/calling/apa/Dapars2_Multi_Sample.py` | 448 | Vendored DaPars2 multi-sample driver. Same project family as `DaPars_Extract_Anno.py`. **Python 2.** | +| `code/SoS/molecular_phenotypes/calling/apa/gtf2bed12.py` | 209 | Vendored GTF→BED12 converter (Xudong Zou, 2021). Preprocessing step for APA workflows. **Python 2.** | +| `code/SoS/xqtl_modifier_score/gems_pipeline.py` | 678 | GEMS (Generalized Expression Modifier Scores) ML pipeline. Uses sklearn, torch, xgboost, optuna. Self-contained CLI with `train` and `predict` subcommands; invoked from `ems_training.ipynb` / `ems_prediction.ipynb` via `bash:` blocks. | +| `code/script/association_scan/TensorQTL/TensorQTL.py` | 693 | Wrapper around the `tensorqtl` Python library for QTL association testing (cis/trans, nominal + permutation). | +| **Total** | **2,180** | | + +### Category 2 — SoS notebooks with step-body Python (REPLACE) + +Notebooks running under the SoS kernel that contain Python written +**outside** the `R:` / `bash:` action blocks. This includes both: + +- Step-body Python at the SoS step level (manifest assembly, region + parsing, file-path resolution, conditional logic, output-filename + templating) +- `python:` / `python3:` action block bodies (Python executed as the + step's action) + +After the pecotmr refactor, the data-construction slice of this Python +(region parsing, GWAS-meta TSV reading, manifest assembly, per-trait / +per-context path resolution) is absorbed into pecotmr constructors. The +remaining step-body Python — per-job argument shaping, output-filename +templating, control flow — is what we want to replace with bash + R. + +**82 SoS notebooks have some step-body Python.** The 16 heavy hitters +(≥100 LOC) are where the bulk of the work lives: + +| Notebook | Step-body LOC | +|---|---| +| `code/SoS/pecotmr_integration/twas_ctwas.ipynb` | 673 | +| `code/SoS/mnm_analysis/mnm_methods/colocboost.ipynb` | 589 | +| `code/SoS/mnm_analysis/mnm_methods/mnm_regression.ipynb` | 389 | +| `code/SoS/commands_generator/eQTL_analysis_commands.ipynb` | 267 | +| `code/SoS/mnm_analysis/mnm_postprocessing.ipynb` | 247 | +| `code/SoS/association_scan/quantile_models/qr_and_twas.ipynb` | 243 | +| `code/SoS/molecular_phenotypes/QC/pseudobulk_expression_aggregation_QC_norm.ipynb` | 243 *(mixed-kernel — see Category 3)* | +| `code/SoS/pecotmr_integration/SuSiE_enloc.ipynb` | 239 | +| `code/SoS/enrichment/sldsc_enrichment.ipynb` | 211 | +| `code/SoS/molecular_phenotypes/calling/RNA_calling.ipynb` | 182 | +| `code/SoS/data_preprocessing/genotype/genotype_formatting.ipynb` | 151 | +| `code/SoS/data_preprocessing/genotype/GWAS_QC.ipynb` | 137 | +| `code/SoS/molecular_phenotypes/QC/pseudobulk_mega_expression_QC_and_normalization.ipynb` | 130 *(mixed-kernel — see Category 3)* | +| `code/SoS/mnm_analysis/mnm_methods/rss_analysis.ipynb` | 129 | +| `code/SoS/association_scan/TensorQTL/TensorQTL.ipynb` | 112 | +| `code/SoS/data_preprocessing/genotype/PCA.ipynb` | 104 | + +**32 more notebooks have 20–99 LOC of step-body Python** (medium size; mostly +smaller workflows and graveyard files). **34 more have <20 LOC** (incidental +and cleanable in passing). + +Aggregate Python in SoS code cells across all SoS notebooks: + +- **5,655 LOC** of step-body Python (outside action blocks) +- **1,011 LOC** of `python:` action block bodies +- **106 LOC** of helper `def`/`class` definitions + +Total Category 2 LOC: **~6,772**. + +### Category 3 — Notebooks with Python kernels (REPLACE) + +Notebooks whose notebook-level `kernelspec` is Python (i.e., not SoS +workflows — standalone Python notebooks). These are mostly analysis, +plotting, and ML-driver notebooks. + +| Notebook | Code LOC | Notes | +|---|---|---| +| `code/SoS/cv2f/notebooks/analyze_selection_criteria.ipynb` | 738 | Analysis of cross-validation selection in fine-mapping | +| `code/SoS/reference_data/rss_ld_sketch.ipynb` | 419 | **Mixed-kernel.** Notebook-level kernel is Python, but most cells are R-implicit (R code in cells with no per-cell kernel). True Python content is small. | +| `code/SoS/molecular_phenotypes/QC/pseudobulk_expression_aggregation_QC_norm.ipynb` | 282 | **Mixed-kernel.** Also appears in Category 2 with 243 LOC of SoS step-body Python. | +| `code/SoS/molecular_phenotypes/QC/pseudobulk_mega_expression_QC_and_normalization.ipynb` | 149 | **Mixed-kernel.** Also appears in Category 2 with 130 LOC of SoS step-body Python. | +| `code/SoS/reference_data/ld_reference_generation.ipynb` | 146 | LD reference data prep | +| `code/SoS/xqtl_modifier_score/ems_prediction.ipynb` | 117 | Driver for `gems_pipeline.py predict` (Category 1). Mostly shell-out invocations. | +| `code/SoS/xqtl_modifier_score/ems_training.ipynb` | 76 | Driver for `gems_pipeline.py train`. Mostly shell-out invocations. | +| `code/SoS/mnm_analysis/mnm_miniprotocol.ipynb` | 62 | Mini-protocol tutorial | +| `code/SoS/mnm_analysis/multivariate_multigene_fine_mapping_vignette.ipynb` | 21 | Vignette | +| `code/SoS/mnm_analysis/multivariate_fine_mapping_vignette.ipynb` | 21 | Vignette | +| `code/SoS/mnm_analysis/univariate_fine_mapping_fsusie_vignette.ipynb` | 11 | Vignette | +| `code/SoS/mnm_analysis/summary_stats_finemapping_vignette.ipynb` | 9 | Vignette | +| `code/SoS/pecotmr_integration/twas_vignette.ipynb` | 0 | Markdown-only | +| **Total** | **~2,051** | | + +Two additional notes: + +- `website/nature_protocol/conversion_notebook.ipynb` (516 LOC) is a + Python-kernel notebook used for the protocol website's doc generation. + Listed for completeness but is not analysis code; treat separately if / + when website automation gets touched. +- The three **mixed-kernel notebooks** (`rss_ld_sketch.ipynb`, + `pseudobulk_expression_aggregation_QC_norm.ipynb`, + `pseudobulk_mega_expression_QC_and_normalization.ipynb`) appear in both + Category 2 (their SoS-cell step-body Python) and Category 3 (their + notebook-level Python kernel). When tackling these, the work spans + both flavors. + +### Category 4 — Standalone Python scripts to replace (REPLACE) + +`.py` files outside of the vendored / external set, mostly under +`code/script/`. These are real production code that wraps existing R-side +libraries (edgeR, limma, tabix, plink) or does data manipulation that +ports cleanly to R or bash. + +| File | LOC | What it does | Replacement target | +|---|---|---|---| +| `code/script/association_scan/TensorQTL/TensorQTL.py` | 693 | *(Category 1 — kept as-is; listed here only to flag that we are not replacing it)* | — | +| `code/script/data_preprocessing/phenotype/gene_annotation.py` | 589 | Maps leafcutter / psichomics splicing clusters and isoforms to genes. Pandas-heavy. | R | +| `code/script/data_preprocessing/phenotype/phenotype_formatting.py` | 325 | Phenotype file format conversions (BED, tabix prep). | Bash + R | +| `code/script/molecular_phenotypes/calling/RNA_calling.py` | 316 | Merges STAR / RSEM GCT outputs into a combined expression matrix. | R | +| `code/script/molecular_phenotypes/QC/bulk_expression_normalization.py` | 256 | TMM / CPM / voom / quantile normalization. Already wraps edgeR / limma via rpy2. | Native R port | +| `code/script/data_preprocessing/genotype/genotype_formatting.py` | 247 | VCF / PLINK format conversions and splitting. Mostly `subprocess` calls. | Bash | +| **Total to replace** | **~1,733** | | | + +(Note: `TensorQTL.py` is shown in the table for completeness but is part of +Category 1 — we are explicitly not replacing it. Excluding it, the +replacement target in Category 4 is ~1,733 LOC.) + +## Summary + +| Category | Disposition | Approx. LOC | File / notebook count | +|---|---|---|---| +| 1. Vendored Python / external tools | Keep | 2,180 | 5 files | +| 2. SoS notebooks with step-body Python | Replace | 6,772 | 82 notebooks (16 heavy hitters) | +| 3. Notebooks with Python kernels | Replace | 2,051 | 13 notebooks (3 mixed-kernel) | +| 4. Standalone Python scripts (excl. TensorQTL) | Replace | 1,733 | 5 files | +| **Replacement target** | | **10,556** | | +| **Out of scope** (SoS scaffolding) | Implicit keep | ~3,065 | — | +| **Out of scope** (Snakemake) | Excluded | ~125+ | `code/snakemake/` tree | + +After the pecotmr refactor absorbs the data-construction slice of Category 2 +(rough estimate: 2,000–3,000 LOC of region parsing / manifest assembly / +GWAS-meta TSV reading), the remaining replacement target is approximately +**7,500–8,500 LOC** of step-body orchestration Python, Python-kernel analysis +notebooks, and standalone scripts. Roughly 80% of this is straightforwardly +portable to R or bash; the other 20% is per-step orchestration that needs to +be rewritten in the SoS notebooks (as inline bash + R, replacing the +imperative Python that's there today). diff --git a/dev/s4_migration_plan.md b/dev/s4_migration_plan.md new file mode 100644 index 00000000..ee82e25e --- /dev/null +++ b/dev/s4_migration_plan.md @@ -0,0 +1,345 @@ +# S4 migration plan: pecotmr ↔ xqtl-protocol + +Working document. Captures the design conversation from 2026-06-13/15 covering the +S4 migration of pecotmr, the conversion of xqtl-protocol to thin wrappers, and +the architectural constraints we agreed on. + +## Constraints + +1. **xqtl-protocol R code must never use `@` or `slot()`** on a pecotmr object. Only + accessor functions (`getX()`) are allowed. +2. **xqtl-protocol R code should rarely use any accessor either.** A per-region + script should be: parse args → loader → pipeline → save. If a script needs to + call more than a couple of accessors, pecotmr should encapsulate that work. +3. **Every loader returns an S4** of an appropriate class. +4. **Every per-region pipeline accepts that S4 directly** via S4 method dispatch. + The per-condition loop lives **inside pecotmr**, never in xqtl-protocol. +5. **pecotmr's own code follows the Bioconductor rule**: `@slot` is allowed only + inside accessor method bodies (`setMethod("getX", "Class", ...)`), in + `new(...)` / constructor bodies, in `slot<-` assignment, and in validity + functions. Everywhere else, slot access must go through accessors. +6. Genome-aggregating analyses (mash, S-LDSC) stay two-stage. That is a method- + level fact, not a smell. + +## Where we are today + +### Snake → camel rename (done) + +68 mechanical function-name renames applied across `code/SoS/` notebooks and +`code/script/` R files. Scoped to `name(` and `pecotmr::name` patterns inside R +contexts only; skipped Python contexts in SoS notebooks; skipped local-collision +cases (`filter_relatedness` in GWAS_QC.R, `meta_analysis_per_cell` / +`update_mash_model_cov` in mash_posterior.ipynb). + +### Mechanical S4 access conversion (done) + +63 substitutions across 7 files (`fsusie.R`, `susie_twas.R`, `mnm.R`, +`mnm_regression.ipynb`, `qr_and_twas.ipynb`, `twas_ctwas.ipynb`, +`rss_ld_sketch.ipynb`): + +- `fdat$residual_X[[r]]` → `getResidualX(fdat, r)` +- `fdat$residual_Y[[r]]` → `getResidualY(fdat, r)` +- `fdat$residual_X_scalar[[r]]` → `getResidualXScalar(fdat, r)` +- `fdat$residual_Y_scalar[[r]]` → `getResidualYScalar(fdat, r)` +- `fdat$X_variance[[r]]` → `getXVariance(fdat, r)` +- `fdat$X` → `getGenotypeMatrix(fdat)` (both RegionalData and Multivariate) +- `fdat$Y[[r]]` on RegionalData → `getPhenotypes(fdat)[[r]]` +- `fdat$Y` on MultivariateRegionalData → `getY(fdat)` +- `fdat$maf[[r]]` → `getMaf(fdat)[[r]]` (will work once getMaf for RegionalData lands) +- `fdat$covar[[r]]` → `getCovariates(fdat)[[r]]` +- `length/names/seq_along(fdat$residual_Y)` → `length/names/seq_along(getPhenotypes(fdat))` + (count- and name-equivalent by construction; not "iterate residuals" semantically, + but functionally correct) +- LdData: `$LD_variants` → `getVariantIds`, `$ref_panel` → `getRefPanel`, + `$LD_matrix` → `getCorrelation` or `getGenotypes` (depending on `returnGenotype` + flag at the call site) + +**Reassignments skipped** (would have produced invalid R syntax): +- `fdat$residual_X[[r]] <- NA` and `fdat$residual_Y[[r]] <- NA` (intent was + memory-clearing; moot under lazy residual computation — should be deleted in + cleanup pass). +- `LD_list$LD_matrix <- LD_list$LD_matrix[-dup_idx, -dup_idx]` (the whole + dedup block should be deleted — pecotmr deduplicates internally now). + +**Zero `@` introduced** in any of the 63 substitutions. + +### Pending S4 access points + +30 remaining; after subtracting false positives (`opt$maf` is optparse, not +pecotmr; `qr_results$maf` is a local list), the real remaining work is ~22 sites +in four categories: + +1. **Slot renames** (~12 sites): `fdat$dropped_samples` → `@droppedSamples`; + `fdat$Y_coordinates` / `phenotype_coordiates` → `@coordinates`. These cannot + be mechanically rewritten without accessors — see Phase 0 below. +2. **Reassignments** (~5 sites): delete the lines. +3. **Wrong access on Multivariate** (~2 sites): `$residual_Y` on + `MultivariateRegionalData` → `getY()` (no residualization on that class). +4. **Judgment calls** (~3 sites): `$X_data`, `$X_variance` without index, + `$residual_Y` matrix indexing in `mnm_regression.ipynb`. Need surrounding code + review. + +### Architectural realization + +`fdat` is purely a variable name in xqtl-protocol — it doesn't appear in +pecotmr. With S4, the type is exposed by the loader name and the class name; the +variable name is doing zero work. + +More importantly, **xqtl-protocol is reaching into low-level pecotmr internals**. +The current shape of `fsusie.R` / `susie_twas.R` / `mnm.R` is a per-condition +loop that unpacks `fdat` into pieces and feeds them to pipeline functions that +were designed for the old list-based API. Two smoking guns: + +1. `susie_post_processor` (called from fsusie.R) was **removed from pecotmr in + March 2024** (commit d85aca2). It doesn't exist anywhere — the script would + fail at runtime. The replacement (`postprocessFinemappingFits`) is meant to be + called *inside* `univariateAnalysisPipeline`, not by external callers. +2. `univariateAnalysisPipeline`, `multivariateAnalysisPipeline`, + `twasWeightsPipeline`, `fsusieWrapper` all still take loose pieces (X, Y, + maf, ...). With S4, this signature is a smell: the caller is forced to ask + the object for its slots and feed them back. + +## Target architecture + +### pecotmr public API: S4 in, structured out + +Each per-region pipeline accepts the loader's S4 directly. No loose-piece public +signatures. Loose-piece logic survives as private helpers. + +| Pipeline | Public signature | Returns | +|---|---|---| +| `univariateAnalysisPipeline` | `(regional = RegionalData, ...)` | named list per condition | +| `multivariateAnalysisPipeline` | `(regional = MultivariateRegionalData, ...)` | single result | +| `twasWeightsPipeline` | `(regional = RegionalData, ...)` | named list of per-condition `TwasWeights` | +| `fsusieWrapper` | `(regional = RegionalData, ...)` | named list of per-condition fSuSiE results | +| `functionalAnalysisPipeline` *(new)* | `(regional = RegionalData, ...)` | named list of per-condition composites (top-PC + SuSiE-on-PC + TWAS + fSuSiE) | +| `quantileAnalysisPipeline` *(new)* | `(regional = RegionalData, ...)` | named list of per-condition results | +| `colocboostPipeline` | `(regional = MultitaskRegionalData, focalTrait, ...)` | single result | +| `rssAnalysisPipeline` | unchanged — already takes `LdData` | unchanged | +| `susieRssPipeline` | `(qcResult = QcResult, ldData = LdData, ...)` (tighten from loose) | single result | +| `twasPipeline` (cTWAS) | unchanged — per-LD-block by design | unchanged | +| `mashPipeline`, `sldscPostprocessingPipeline` | unchanged — genome-scope by design | unchanged | + +### Per-gene SoS parallelization compatibility + +Confirmed: pecotmr's per-region pipelines are scoped exactly right for "one SoS +job per gene". Three caveats: + +- **cTWAS** is per-LD-block (`twasPipeline(twasWeightsData, ldMetaFilePath, + gwasMetaFile, regionBlock, ...)`) — joint inference across genes sharing an LD + region. `twas_ctwas.ipynb` already drives this correctly. +- **mash / S-LDSC** are genome-aggregating; the SoS pipeline is two-stage + (per-gene fan-out → genome gather). `mash_preprocessing` → `mash_fit` → + `mash_posterior` already has this shape. +- **`loadMultitaskRegionalData` currently returns a plain `list`**, not S4. + Needs S4-ification along with a `MultitaskRegionalData` class. + +### xqtl-protocol target shape + +Each per-region R script collapses to ~10 lines: + +```r +opt <- parse_args(...) +regional <- loadRegional*Data( + genotype = opt$genotype, phenotype = phenotype_files, + covariate = covariate_files, region = opt$region, + conditions = conditions, ... +) +result <- xxxAnalysisPipeline(regional, ...) +result$region_info <- list( + region_coord = parseRegion(opt$region), + grange = parseRegion(opt$window), + region_name = opt[["region-name"]] +) +saveRDS(result, opt$output) +``` + +No `fdat$...`. No `@`. No accessors in the success path. No per-condition loop. + +## Plan + +### Phase 0 — pecotmr accessor foundation + +**Must come first** because every later phase needs internal pecotmr code to +read slots via accessors (Bioconductor rule). + +#### Phase 0A — Accessor completion (~25-30 new accessors) + +Add missing `setGeneric` + `setMethod("get*", ...)` for every slot read anywhere +in pecotmr code outside of accessor bodies / constructors / validity. + +**`RegionalData`** (4 new): +- `getScaleResiduals(x)` → `x@scaleResiduals` +- `getMaf(x)` for `RegionalData` (mirror of the `MultivariateRegionalData` method) +- `getRegion(x)` → `x@region` (the `GRanges` directly; complementary to existing + derived `getChrom`/`getGrange`) +- `getDroppedSamples(x)` → `x@droppedSamples` +- `getCoordinates(x)` → `x@coordinates` + +**`MultivariateRegionalData`** (3 new, same generics): +- `getDroppedSamples`, `getRegion`, `getCoordinates` + +**`LdData`** (2 new): +- `getSnpIdx(x)` → `x@snpIdx` +- `getNRef(x)` → `x@nRef` + +**`GenotypeHandle`** (5 new): +- `getPath`, `getFormat`, `getSnpInfo`, `getNSamples`, `getSampleIds` +- (`getPgenPtr` only if any non-method code reads it) + +**`GwasSumStats`** (3 new): +- `getSumstats`, `getGenome`, `getTraitName` + +**`FineMappingResult`** (2 new): +- `getMethod`, `getSumstats` + +**`H2Estimate`, `LdBlocks`, `LdEigen`, `LdScore`, `AnnotationMatrix`** (~10 +more across the heritability / LD-statistic infrastructure, exact set after +internal audit) + +All exported. Each gets a roxygen page and a test against a fixture object. + +#### Phase 0B — Internal `@` → accessor refactor + +Scripted pass across all of `R/`. For every `@` read outside of: + +- `setMethod("get*", "Class", ...)` bodies +- `new("Class", ...)` and constructor function bodies +- `validity = function(object) { ... }` bodies (Bioconductor permits `@` here + because the object is being constructed) +- `slot(x, "name") <- value` assignment + +…replace with the corresponding accessor call. No behavior change. All tests +must pass. Baseline for `R CMD BiocCheck` compliance. + +Existing scope: ~357 `@` access sites across 43 files. Bulk in `file_utils.R`, +`twas.R`, `univariate_pipeline.R`, `multivariate_pipeline.R`, `LD.R`, +`ld_loader.R`, `mash_wrapper.R`, `sumstats_qc.R`. + +### Phase 1 — Pipelines accept only S4 + +**pecotmr PR A**: Refactor `univariateAnalysisPipeline`. Rename current body to +private `.univariateAnalysisOneCondition` (still uses accessors internally — no +`@`). New `univariateAnalysisPipeline(regional, ...)` requires `RegionalData`, +iterates conditions via `seq_along(getPhenotypes(regional))`, calls the private +helper. Update tests + roxygen examples to construct `RegionalData`. + +**pecotmr PR B**: Same refactor for `multivariateAnalysisPipeline`, +`twasWeightsPipeline`, `fsusieWrapper`. + +**pecotmr PR C**: New `MultitaskRegionalData` S4 class. Update +`loadMultitaskRegionalData` to return it. Update `regionDataToIndInput`, +`regionDataToRssInput`, `regionDataToColocboostInput` to accept it. +`colocboostPipeline(regional = MultitaskRegionalData, ...)`. + +**pecotmr PR D**: New `functionalAnalysisPipeline(regional, ...)` — absorbs +the per-condition composite from `fsusie.R` (top-PC + SuSiE-on-PC + TWAS + +fSuSiE). + +**pecotmr PR E**: New `quantileAnalysisPipeline(regional, ...)` — absorbs the +quantile-regression logic from `qr_and_twas.ipynb`. + +**pecotmr PR F**: Tighten `susieRssPipeline` to take `QcResult` + `LdData`. + +**pecotmr PR G**: Documentation pass. `NEWS.md`: "Per-region pipelines accept +S4 inputs only. Loose-piece signatures are no longer public. Construct a +`RegionalData` via the loader or the constructor." pkgdown index updated. + +#### Open call: hard vs. soft deprecation + +If the only real consumer of pecotmr is xqtl-protocol + this team's local +scripts, **hard break** is fine and saves a release cycle. If there are +external workflows, **soft path** with `.Deprecated()` for a release is safer. +Decide before Phase 1. + +### Phase 2 — xqtl-protocol thin wrappers + +Each per-region script becomes ~10 lines (template above). Per file: + +- `code/script/mnm_analysis/mnm_methods/susie_twas.R` (~440 lines → ~15) +- `code/script/mnm_analysis/mnm_methods/fsusie.R` (~250 lines → ~15) +- `code/script/mnm_analysis/mnm_methods/mnm.R` (~180 lines → ~15) +- `code/SoS/mnm_analysis/mnm_methods/mnm_regression.ipynb` (R-in-SoS blocks) +- `code/SoS/association_scan/quantile_models/qr_and_twas.ipynb` +- `code/SoS/mnm_analysis/mnm_methods/colocboost.ipynb` (R-block only — the + Python side stays for now) + +One file at a time. Toy-dataset end-to-end verification between each. + +### Phase 3 — Cleanup + +- Delete `filter_fdat_except_specific_names` from fsusie.R (replaced by + `minMarkers` arg in `loadRegionalFunctionalData`). +- Delete the cTWAS dedup block in `twas_ctwas.ipynb` (pecotmr's `loadLdMatrix` + already deduplicates internally). +- Delete `fdat$residual_X[[r]] <- NA` and `fdat$residual_Y[[r]] <- NA` + reassignments (memory-clearing is moot under lazy compute). +- Audit thin-wrapper outputs vs. pre-refactor outputs to confirm no behavioral + drift. + +## Sequencing summary + +| Order | Work | Why | +|---|---|---| +| 1 | Phase 0A — add missing accessors | Pure addition, no risk. Unblocks 0B and Phase 1. | +| 2 | Phase 0B — internal `@` → accessor refactor | Establishes Bioconductor compliance. No behavior change. | +| 3 | Phase 1 PRs A–G — pipelines accept S4 only | Sequentially; each PR is one pipeline family. | +| 4 | Phase 2 — xqtl-protocol thin wrappers | One file at a time, verified against toy. | +| 5 | Phase 3 — cleanup | After thin wrappers are confirmed working. | + +## Verification strategy + +For each Phase 1 pipeline PR: + +```r +test_that("univariateAnalysisPipeline.RegionalData matches one-condition path", { + regional <- makeToyRegionalData() + out_via_helper <- .univariateAnalysisOneCondition( + X = getResidualX(regional, 1), + Y = getResidualY(regional, 1), + maf = getMaf(regional)[[1]], + ... + ) + out_via_public <- univariateAnalysisPipeline(regional, ...) + expect_equal(out_via_public[[1]], out_via_helper) +}) +``` + +For each xqtl-protocol thin wrapper: + +1. Run pre-refactor script on toy dataset; capture output RDS as fixture. +2. Refactor to thin wrapper. +3. Run new wrapper on same toy dataset. +4. Diff RDS structures and key numerics (PIPs, z-scores, weights). + +A wrapper is "done" only when its output matches the fixture, not when it +compiles. + +## Per-condition iteration semantic note + +For non-indexed `length()` / `names()` / `seq_along()` wrappers around +`fdat$residual_Y`, the mechanical conversion used `getPhenotypes(fdat)` rather +than materializing a residuals list. This is functionally equivalent because +`getResidualY(fdat, r)` returns a residual for each phenotype condition with the +same name. The conversion is correct for count- and name-based iteration; a +purist would precompute the residual list once outside the loop and index it +inside, which is both clearer and faster but a structural change beyond the +mechanical scope. + +## Files modified by the mechanical passes (committed state TBD) + +- `code/SoS/association_scan/quantile_models/qr_and_twas.ipynb` +- `code/SoS/enrichment/sldsc_enrichment.ipynb` +- `code/SoS/mnm_analysis/mnm_methods/colocboost.ipynb` +- `code/SoS/mnm_analysis/mnm_methods/mnm_regression.ipynb` +- `code/SoS/mnm_analysis/mnm_postprocessing.ipynb` +- `code/SoS/multivariate_genome/MASH/mash_preprocessing.ipynb` +- `code/SoS/pecotmr_integration/SuSiE_enloc.ipynb` +- `code/SoS/pecotmr_integration/twas_ctwas.ipynb` +- `code/SoS/reference_data/rss_ld_sketch.ipynb` +- `code/script/mnm_analysis/mnm_methods/fsusie.R` +- `code/script/mnm_analysis/mnm_methods/mnm.R` +- `code/script/mnm_analysis/mnm_methods/susie_twas.R` + +`pipeline/*.ipynb` are symlinks into `code/SoS/`, so they pick up the changes +transparently.