From b6a7786edff63918769232b757fdaec663ce4a2c Mon Sep 17 00:00:00 2001 From: Yining97 Date: Tue, 23 Jun 2026 23:36:39 -0400 Subject: [PATCH] Make rss_analysis.ipynb the GWAS RSS fine-mapping workflow Ports the 4-step manifest -> construct -> fine_mapping -> plot workflow into rss_analysis.ipynb (replacing the legacy univariate_rss two-step) and removes the duplicate gwas_rss_fine_mapping.ipynb. Repoints the remaining doc references. Co-Authored-By: Claude Opus 4.8 --- .../mnm_methods/rss_analysis.ipynb | 617 +++++------------- .../summary_stats_finemapping_vignette.ipynb | 2 +- .../fine_mapping_postprocessing.ipynb | 2 +- .../gwas_rss_fine_mapping.ipynb | 263 -------- .../fine_mapping_pip_plot.R | 2 +- .../pecotmr_integration/gwas_rss_manifest.R | 2 +- 6 files changed, 179 insertions(+), 709 deletions(-) delete mode 100644 code/SoS/pecotmr_integration/gwas_rss_fine_mapping.ipynb diff --git a/code/SoS/mnm_analysis/mnm_methods/rss_analysis.ipynb b/code/SoS/mnm_analysis/mnm_methods/rss_analysis.ipynb index 6089a1a1..49d8a89a 100644 --- a/code/SoS/mnm_analysis/mnm_methods/rss_analysis.ipynb +++ b/code/SoS/mnm_analysis/mnm_methods/rss_analysis.ipynb @@ -2,502 +2,237 @@ "cells": [ { "cell_type": "markdown", - "id": "5b903d91", - "metadata": { - "kernel": "SoS" - }, + "metadata": {}, "source": [ "# RSS Fine-mapping with GWAS Summary Statistics\n", "\n", - "## Description\n", + "End-to-end SuSiE-RSS fine-mapping over GWAS summary statistics, per study \u00d7 per LD block, using\n", + "the current pecotmr S4 API (`GwasSumStats` / `summaryStatsQc()` / `fineMappingPipeline()`). Four SoS\n", + "steps:\n", "\n", - "This notebook performs high-dimensional regression with GWAS summary statistics using the **Regression with Summary Statistics (RSS)** model. Given a list of GWAS summary-statistic files and an LD-reference panel, it fine-maps causal variants region by region with the SuSiE-RSS model (and supports conditional-regression variants).\n", + "1. **`[generate_manifest]`** \u2014 `gwas_rss_manifest.R` resolves `--gwas-meta` + `--gwas-tsv-list` against\n", + " `--region-list` + `--regions` into one manifest TSV (one row per study \u00d7 region).\n", + "2. **`[generate_gwas_sumstats]`** \u2014 `gwas_sumstats_construct.R` builds one `GwasSumStats` RDS per\n", + " study \u00d7 LD block, running `summaryStatsQc()` (optional SLALOM/DENTIST QC + RAISS imputation;\n", + " harmonization re-keys variants to the LD-panel id convention).\n", + "3. **`[gwas_fine_mapping]`** \u2014 `fine_mapping.R` dispatches to\n", + " `pecotmr::fineMappingPipeline(gwasSumStats, methods = ...)` (SuSiE-RSS).\n", + "4. **`[gwas_rss_plot]`** (opt-out via `--no-plot`) \u2014 `gwas_rss_plot.R` renders a PIP plot.\n", "\n", - "The workflow has three steps: `get_analysis_regions` builds the list of genomic regions to analyze (from the LD-reference metadata and/or user-supplied regions); `univariate_rss` runs allele-quality control, optional summary-statistic imputation (RAISS), and the chosen fine-mapping method on each region; `univariate_plot` renders a PIP plot for a result.\n", + "This replaces the legacy two-step workflow (`get_analysis_regions` + `univariate_rss`, which called the\n", + "removed `rss_analysis_pipeline()`). The LD reference must be **genotype-backed** (PLINK2/PLINK1/GDS/VCF);\n", + "a precomputed `.cor.xz` LD matrix is not supported for the `GwasSumStats` path.\n", "\n", - "**Synthetic-data note:** the minimal working example below uses a small, clearly-labelled toy GWAS summary-statistic file (`protocol_example.gwas_sumstats.chr22.tsv.gz`) derived from a public Alzheimer's GWAS, restricted to a single chr22 region, together with the toy PLINK LD reference under `input/ld_reference/`. No access-controlled individual-level genotype data is used.\n", + "## Inputs\n", "\n", - "## LD reference and the LD sketch\n", + "- `--gwas-meta` \u2014 TSV with columns `study_id`, `path` (relative to the meta file or absolute), and\n", + " optional `column_mapping` (per-study YAML). One row per study.\n", + "- `--gwas-tsv-list S1=path1 ...` \u2014 explicit STUDY=PATH pairs (alternative to `--gwas-meta`; no\n", + " per-study column mapping).\n", + "- `--region-list` \u2014 BED-like TSV of LD blocks; and/or `--regions chr:start-end ...`.\n", + "- `--ld-meta` \u2014 LD-reference meta TSV (`#chr`, `start`, `end`, `path` -> genotype prefix). One LD block\n", + " per row (or one row with `start=end=0` for a whole chromosome).\n", "\n", - "RSS fine-mapping relies on a **linkage-disequilibrium (LD) reference** that describes the correlation structure among variants in each region. For large cohorts the full per-region LD matrices are big, so the protocol uses an **LD sketch**: a compact, region-partitioned representation of the LD reference (low-rank / block-sparse summaries plus per-region metadata) that is precomputed by the LD reference processing module. `get_analysis_regions` reads this LD-sketch metadata to enumerate the regions to fine-map, and `univariate_rss` loads the matching per-region LD block for allele QC and the SuSiE-RSS fit.\n", + "## QC knobs (forwarded to `gwas_sumstats_construct.R` -> `summaryStatsQc()`)\n", "\n", - "For this minimal working example the toy PLINK LD reference under `input/ld_reference/` already plays the role of the LD sketch for the single chr22 region analyzed here; on real data you would point `--ld-reference-meta` at the LD-sketch metadata produced for your cohort." - ] - }, - { - "cell_type": "markdown", - "id": "16f8457c", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Input\n", + "- `--qc-method` \u2014 `none` (default), `slalom`, or `dentist`.\n", + "- `--impute` \u2014 enable RAISS sumstat imputation.\n", + "- `--qc-args '{\"key\":value,...}'` \u2014 JSON passthrough for any other `summaryStatsQc()` kwarg (e.g.\n", + " `{\"mafCutoff\":0.01,\"nCutoff\":0,\"alleleFlipKriging\":true}`).\n", "\n", - "| File | Description |\n", - "|------|-------------|\n", - "| `--gwas-meta-data` | TSV listing each GWAS study: columns `study_id`, `chrom` (0 = genome-wide), `file_path` (summary-statistics file), `column_mapping_file` (YAML mapping standard column names), `n_sample`, `n_case`, `n_control`. Paths are resolved relative to this file's directory or the project root. |\n", - "| `--ld-meta-data` | TSV describing the LD reference: columns `#chrom`, `start`, `end`, `path`. Two formats are supported: pre-computed LD blocks (`.cor.xz` with `ld_file,bim_file`) or per-chromosome PLINK genotype prefixes (`.pgen/.pvar/.psam` or `.bed/.bim/.fam`) with `start=0`, `end=0`. |\n", - "| column-mapping YAML | Maps standard names (`chrom`, `pos`, `A1`, `A2`, `z` or `beta`+`se`, `pvalue`, `maf`, `n_case`, `n_control`, `n_sample`) to the original column names in the summary-statistics file. |\n", - "| `--region-name` / `--region-list` | One or more regions in `chr:start-end` format, or a file listing them. If neither is given, all regions in the LD-reference metadata are analyzed. |\n", + "## Fine-mapping knobs (forwarded to `fine_mapping.R`)\n", "\n", - "Key parameters: `--qc-method` (`slalom` or `dentist`), `--impute` (impute summary stats for SNPs in LD but missing from sumstats via RAISS), `--finemapping-method` (`single_effect`, `susie_rss`, or `bayesian_conditional_regression`), `--L` (max number of causal effects), `--maf` / `--imiss` filtering thresholds, `--skip-analysis-pip-cutoff`." - ] - }, - { - "cell_type": "markdown", - "id": "de914f2f", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Output\n", + "- `--methods` (default `susie`), `--coverage` (0.95), `--secondary-coverage` (`0.7,0.5`).\n", + "- `--min-abs-corr` (0.8), `--median-abs-corr` (off; OR-logic purity), `--pip-cutoff` (0.025).\n", + "- `--L` (20) / `--L-greedy` (5) \u2014 SuSiE fit defaults; `--method-args '{\"susie\":{...}}'` overrides per key.\n", "\n", - "All results are written under `--cwd`, one set of files per analyzed LD-block region:\n", + "## Outputs\n", "\n", - "- `{qc_method}_{imputed}.chr{N}_{start}_{end}.univariate_susie_rss.rds` — the SuSiE-RSS fine-mapping result for that block (e.g. `SLALOM_RAISS_imputed.chr22_49355984_50799822.univariate_susie_rss.rds`), holding the `single_effect_regression`, `noqc`, `qc_only`, `qc_impute` and (where applicable) `conditional_regression_*` objects plus the QC'd / imputed `sumstats_*` tables; an extra diagnostic table is added when `--diagnostics` is set.\n", - "- `{prefix}.sumstats.tsv.gz` — a companion harmonized summary-statistics table accompanying each RDS for convenience.\n", - "- `{step_name}/*.pip_plot.png`, plus per-region `.stderr` / `.stdout` logs — the posterior-inclusion-probability plot and run logs.\n" + "- `{cwd}/{manifest_name}.manifest.tsv`\n", + "- `{cwd}/sumstats/{study}.{chrN_S_E}.gwas_sumstats.rds`\n", + "- `{cwd}/fine_mapping/{study}.{chrN_S_E}.gwas_finemap.rds` \u2014 one `GwasFineMappingResult` per (study, block).\n", + "- `{cwd}/plots/{study}.{chrN_S_E}.pip_plot.png` (unless `--no-plot`).\n" ] }, { "cell_type": "markdown", - "id": "b7dcb2b6", - "metadata": { - "kernel": "SoS" - }, + "metadata": {}, "source": [ - "## Steps\n", - "\n", - "**Step 1.** Build the list of analysis regions from the LD-reference metadata. This step is normally invoked automatically as a dependency of `univariate_rss`, but it can be run on its own to inspect the regions that will be analyzed." - ] - }, - { - "cell_type": "markdown", - "id": "03e0c0e9", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "**Timing:** ~3-5 min on typical compute infrastructure." + "## Minimal working example\n", + "\n", + "Genotype-LD MWE (AD_Bellenguez chr22 toy data under `input/`). There is no single combined workflow\n", + "name \u2014 chain the four steps with `+`:\n", + "\n", + "```bash\n", + "sos run pipeline/rss_analysis.ipynb \\\n", + " generate_manifest+generate_gwas_sumstats+gwas_fine_mapping+gwas_rss_plot \\\n", + " --cwd output/rss_analysis --modular-script-dir code/script \\\n", + " --gwas-meta input/rss_analysis/protocol_example.rss_mwe.gwas_meta.tsv \\\n", + " --regions chr22:49355984-50799822 \\\n", + " --ld-meta input/ld_reference/protocol_example.ld_meta_file.tsv\n", + "```\n", + "\n", + "This fine-maps one study \u00d7 one LD block and writes the manifest, the QC'd `GwasSumStats`, the\n", + "`GwasFineMappingResult`, and a PIP plot under `output/rss_analysis/`.\n", + "\n", + "With SLALOM QC + RAISS imputation + per-method tuning:\n", + "\n", + "```bash\n", + "sos run pipeline/rss_analysis.ipynb \\\n", + " generate_manifest+generate_gwas_sumstats+gwas_fine_mapping+gwas_rss_plot \\\n", + " --cwd output/rss_analysis --modular-script-dir code/script \\\n", + " --gwas-meta input/rss_analysis/protocol_example.rss_mwe.gwas_meta.tsv \\\n", + " --regions chr22:49355984-50799822 \\\n", + " --ld-meta input/ld_reference/protocol_example.ld_meta_file.tsv \\\n", + " --qc-method slalom --impute --qc-args '{\"mafCutoff\":0.01}' \\\n", + " --min-abs-corr 0.5 --method-args '{\"susie\":{\"L\":10}}'\n", + "```\n" ] }, { "cell_type": "code", "execution_count": null, - "id": "6f00aba2", - "metadata": { - "kernel": "Bash" - }, - "outputs": [], - "source": [ - "sos run pipeline/rss_analysis.ipynb get_analysis_regions \\\n", - " --ld-meta-data input/ld_reference/protocol_example.ld_meta_file.tsv \\\n", - " --gwas-meta-data input/rss_analysis/protocol_example.gwas_meta_data.tsv \\\n", - " --region-name 22:49355984-50799822 \\\n", - " --cwd output/rss_analysis" - ] - }, - { - "cell_type": "markdown", - "id": "aa33b614", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "**Step 2.** Fine-map the region with the RSS model. This runs allele QC (`--qc-method slalom`), imputes missing summary statistics against the LD reference (`--impute`, RAISS), and fits SuSiE-RSS (`--finemapping-method susie_rss`), writing one `.rds` result per region." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "af86b919", - "metadata": { - "kernel": "Bash" - }, - "outputs": [], - "source": [ - "sos run pipeline/rss_analysis.ipynb univariate_rss \\\n", - " --ld-meta-data input/ld_reference/protocol_example.ld_meta_file.tsv \\\n", - " --gwas-meta-data input/rss_analysis/protocol_example.gwas_meta_data.tsv \\\n", - " --qc-method slalom --impute \\\n", - " --finemapping-method susie_rss \\\n", - " --skip-analysis-pip-cutoff 0 \\\n", - " --region-name 22:49355984-50799822 \\\n", - " --cwd output/rss_analysis" - ] - }, - { - "cell_type": "markdown", - "id": "c7ad4bf1", "metadata": { "kernel": "SoS" }, - "source": [ - "**Step 3.** Render a PIP (posterior inclusion probability) plot from a fine-mapping result `.rds` produced in Step 2. Point `--region-list` / inputs at the result of interest; the plot is written as a PNG under `--cwd`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b96be300", - "metadata": { - "kernel": "Bash" - }, "outputs": [], "source": [ - "sos run pipeline/rss_analysis.ipynb univariate_plot \\\n", - " --ld-meta-data input/ld_reference/protocol_example.ld_meta_file.tsv \\\n", - " --gwas-meta-data input/rss_analysis/protocol_example.gwas_meta_data.tsv \\\n", - " --region-name 22:49355984-50799822 \\\n", - " --cwd output/rss_analysis \\\n", - " --input output/rss_analysis/univariate_rss/SLALOM_RAISS_imputed.chr22_49355984_50799822.univariate_susie_rss.rds" - ] - }, - { - "cell_type": "markdown", - "id": "fed7c15a", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Command interface\n", - "\n", - "Run the cell below to list every workflow and its options." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6efec7af", - "metadata": { - "kernel": "Bash" - }, - "outputs": [], - "source": [ - "sos run pipeline/rss_analysis.ipynb -h" - ] - }, - { - "cell_type": "markdown", - "id": "a117442f", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Workflow implementation\n", - "\n", - "The cells below contain the original SoS workflow definitions (`[global]`, `[get_analysis_regions]`, `[univariate_rss]`, `[univariate_plot]`) preserved verbatim." + "[global]\n", + "parameter: cwd = path('output')\n", + "parameter: modular_script_dir = path('code/script')\n", + "# --- per-study GWAS inputs (use either or both) ---------------------\n", + "parameter: gwas_meta = path('.')\n", + "parameter: gwas_tsv_list = [] # list of STUDY=PATH items\n", + "# --- region inputs (use either or both) -----------------------------\n", + "parameter: region_list = path('.')\n", + "parameter: regions = [] # list of chr:start-end strings\n", + "# --- LD reference + genome ------------------------------------------\n", + "parameter: ld_meta = path\n", + "parameter: genome = 'GRCh38'\n", + "# --- QC knobs (forwarded to gwas_sumstats_construct.R) --------------\n", + "parameter: qc_method = 'none' # none | slalom | dentist\n", + "parameter: impute = False\n", + "parameter: qc_args = '' # JSON object spliced into summaryStatsQc()\n", + "# --- fine-mapping knobs (forwarded to fine_mapping.R) ---------------\n", + "parameter: methods = 'susie'\n", + "parameter: coverage = 0.95\n", + "parameter: secondary_coverage = '0.7,0.5'\n", + "parameter: min_abs_corr = 0.8\n", + "parameter: median_abs_corr = '' # empty -> off (OR-logic purity; needs pecotmr step-1)\n", + "parameter: pip_cutoff = 0.025\n", + "parameter: L = 20\n", + "parameter: L_greedy = 5\n", + "parameter: method_args = '' # nested per-method JSON object\n", + "# --- plot opt-out ---------------------------------------------------\n", + "parameter: no_plot = False\n", + "# --- output prefix for the manifest file ----------------------------\n", + "parameter: manifest_name = 'gwas_rss'\n", + "# --- infrastructure -------------------------------------------------\n", + "parameter: container = ''\n", + "parameter: job_size = 1\n", + "parameter: walltime = '2h'\n", + "parameter: mem = '16G'\n", + "parameter: numThreads = 1" ] }, { "cell_type": "code", "execution_count": null, - "id": "6af07265", "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ - "[global]\n", - "parameter: modular_script_dir = path('code/script') # override with --modular-script-dir\n", - "parameter: cwd = path(\"output/\")\n", - "parameter: gwas_meta_data = path()\n", - "parameter: ld_meta_data = path()\n", - "parameter: gwas_name = []\n", - "parameter: gwas_data = []\n", - "parameter: column_mapping = []\n", - "parameter: region_list = path()\n", - "parameter: region_name = []\n", - "parameter: ld_reference_size = 10000\n", - "parameter: ld_mismatch_correction = False\n", - "parameter: container = ''\n", - "parameter: skip_regions = []\n", - "import re\n", - "parameter: entrypoint= ('micromamba run -a \"\" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\\.sif)$', '', container.split('/')[-1])) if container else \"\"\n", - "parameter: job_size = 5\n", - "parameter: walltime = \"10h\"\n", - "parameter: mem = \"16G\"\n", - "parameter: numThreads = 1\n", - "def group_by_region(lst, partition):\n", - " # from itertools import accumulate\n", - " # partition = [len(x) for x in partition]\n", - " # Compute the cumulative sums once\n", - " # cumsum_vector = list(accumulate(partition))\n", - " # Use slicing based on the cumulative sums\n", - " # return [lst[(cumsum_vector[i-1] if i > 0 else 0):cumsum_vector[i]] for i in range(len(partition))]\n", - " return partition\n", - "import os\n", - "if (not os.path.isfile(region_list)) and len(region_name) == 0:\n", - " region_list = ld_meta_data" + "[generate_manifest]\n", + "# Resolve --gwas-meta / --gwas-tsv-list x --region-list / --regions into a\n", + "# single manifest TSV via gwas_rss_manifest.R. Downstream steps fan out\n", + "# over its rows; no Python parsing in this notebook.\n", + "input: None\n", + "output: f\"{cwd}/{manifest_name}.manifest.tsv\"\n", + "task: trunk_workers = 1, trunk_size = 1, walltime = '15m', mem = '2G', cores = 1, tags = f\"{step_name}_{_output:bn}\"\n", + "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", + " Rscript ${modular_script_dir}/pecotmr_integration/gwas_rss_manifest.R \\\n", + " ${('--gwas-meta ' + str(gwas_meta)) if gwas_meta.is_file() else ''} \\\n", + " ${('--gwas-tsv-list ' + ' '.join(str(x) for x in gwas_tsv_list)) if gwas_tsv_list else ''} \\\n", + " ${('--region-list ' + str(region_list)) if region_list.is_file() else ''} \\\n", + " ${('--regions ' + ' '.join(str(x) for x in regions)) if regions else ''} \\\n", + " --output ${_output}" ] }, { "cell_type": "code", "execution_count": null, - "id": "38034c12", "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ - "[get_analysis_regions: shared = \"regional_data\"]\n", - "import os\n", - "import pandas as pd\n", - "from collections import OrderedDict\n", - "\n", - "def file_exists(file_path, relative_path=None):\n", - " \"\"\"Check if a file exists at the given path or relative to a specified path.\"\"\"\n", - " if os.path.exists(file_path) and os.path.isfile(file_path):\n", - " return True\n", - " elif relative_path:\n", - " relative_file_path = os.path.join(relative_path, file_path)\n", - " return os.path.exists(relative_file_path) and os.path.isfile(relative_file_path)\n", - " return False\n", - "\n", - "def check_required_columns(df, required_columns):\n", - " \"\"\"Check if the required columns are present in the dataframe.\"\"\"\n", - " missing_columns = [col for col in required_columns if col not in df.columns]\n", - " if missing_columns:\n", - " raise ValueError(f\"Missing required columns: {', '.join(missing_columns)}\")\n", - "\n", - "def parse_region(region):\n", - " \"\"\"Parse a region string in 'chr:start-end' format into a list [chr, start, end].\"\"\"\n", - " chrom, rest = region.split(':')\n", - " start, end = rest.split('-')\n", - " return [int(chrom), int(start), int(end)]\n", - "\n", - "def load_regional_rss_data(gwas_meta_data, gwas_name, gwas_data, column_mapping, region_name=None, region_list=None):\n", - " \"\"\"\n", - " Extracts data from GWAS metadata files and additional GWAS data provided. \n", - " Optionally filters data based on specified regions.\n", - "\n", - " Args:\n", - " - gwas_meta_data (str): File path to the GWAS metadata file.\n", - " - gwas_name (list): Vector of GWAS study names.\n", - " - gwas_data (list): Vector of GWAS data.\n", - " - column_mapping (list, optional): Vector of column mapping files.\n", - " - region_name (list, optional): List of region names in 'chr:start-end' format.\n", - " - region_list (str, optional): File path to a file containing regions.\n", - "\n", - " Returns:\n", - " - GWAS Dictionary: Maps study IDs to a list containing chromosome number, \n", - " GWAS file path, and optional column mapping file path.\n", - " - Region Dictionary: Maps region names to lists [chr, start, end].\n", - "\n", - " Raises:\n", - " - FileNotFoundError: If any specified file path does not exist.\n", - " - ValueError: If required columns are missing in the input files or vector lengths mismatch.\n", - " \"\"\"\n", - " # Check vector lengths\n", - " if len(gwas_name) != len(gwas_data):\n", - " raise ValueError(\"gwas_name and gwas_data must be of equal length\")\n", - " \n", - " if len(column_mapping) > 0 and len(column_mapping) != len(gwas_name):\n", - " raise ValueError(\"If column_mapping is provided, it must be of the same length as gwas_name and gwas_data\")\n", - "\n", - " # Required columns for GWAS file type\n", - " required_gwas_columns = ['study_id', 'chrom', 'file_path']\n", - "\n", - " # Base directory of the metadata files\n", - " gwas_base_dir = os.path.dirname(gwas_meta_data)\n", - " \n", - " # Reading the GWAS metadata file\n", - " gwas_df = pd.read_csv(gwas_meta_data, sep=\"\\t\")\n", - " check_required_columns(gwas_df, required_gwas_columns)\n", - " gwas_dict = OrderedDict()\n", - "\n", - " # Process additional GWAS data from vectors\n", - " for name, data, mapping in zip(gwas_name, gwas_data, column_mapping or [None]*len(gwas_name)):\n", - " gwas_dict[name] = {0: [data, mapping]}\n", - "\n", - " for _, row in gwas_df.iterrows():\n", - " file_path = row['file_path']\n", - " mapping_file = row.get('column_mapping_file')\n", - " n_sample = row.get('n_sample')\n", - " n_case = row.get('n_case')\n", - " n_control = row.get('n_control')\n", - "\n", - " # Check if the file and optional mapping file exist\n", - " if not file_exists(file_path, gwas_base_dir) or (mapping_file and not file_exists(mapping_file, gwas_base_dir)):\n", - " raise FileNotFoundError(f\"File {file_path} not found for {row['study_id']}\")\n", - " \n", - " # Adjust paths if necessary\n", - " file_path = file_path if file_exists(file_path) else os.path.join(gwas_base_dir, file_path)\n", - " if mapping_file:\n", - " mapping_file = mapping_file if file_exists(mapping_file) else os.path.join(gwas_base_dir, mapping_file)\n", - " \n", - " # Create or update the entry for the study_id\n", - " if row['study_id'] not in gwas_dict:\n", - " gwas_dict[row['study_id']] = {}\n", - "\n", - " # Expand chrom 0 to chrom 1-22 or use the specified chrom\n", - " chrom_range = range(1, 23) if row['chrom'] == 0 else [row['chrom']]\n", - " for chrom in chrom_range:\n", - " if chrom in gwas_dict[row['study_id']]:\n", - " existing_entry = gwas_dict[row['study_id']][chrom]\n", - " raise ValueError(f\"Duplicate chromosome specification for study_id {row['study_id']}, chrom {chrom}. \"\n", - " f\"Conflicting entries: {existing_entry} and {[file_path, mapping_file]}\")\n", - " gwas_dict[row['study_id']][chrom] = [file_path, mapping_file, n_sample, n_case, n_control, row.get('ld_meta_data', str(ld_meta_data))]\n", - "\n", - " # Process region_list and region_name\n", - " region_dict = dict()\n", - " if region_list and os.path.isfile(region_list):\n", - " with open(region_list, 'r') as file:\n", - " for line in file:\n", - " # Skip empty lines\n", - " if not line.strip():\n", - " continue\n", - " if line.startswith(\"#\"):\n", - " continue\n", - " parts = line.strip().split()\n", - " if len(parts) == 1:\n", - " region = parse_region(parts[0])\n", - " elif len(parts) == 3:\n", - " region = [int(parts[0].replace(\"chr\", \"\")), int(parts[1]), int(parts[2])]\n", - " elif len(parts) >= 4 and region_list != ld_meta_data : # for eQTL where chr:start:end:gene_id:gene_name, and path if LD_meta are used.\n", - " region = [int(parts[0].replace(\"chr\", \"\")), int(parts[1]), int(parts[2]),parts[3]]\n", - " else:\n", - " raise ValueError(\"Invalid region format in region_list\")\n", - " \n", - " region_dict[f\"{region[0]}:{region[1]}_{region[2]}\"] = region\n", - "\n", - " if region_name:\n", - " for region in region_name:\n", - " parsed_region = parse_region(region)\n", - " region_key = f\"{parsed_region[0]}:{parsed_region[1]}_{parsed_region[2]}\"\n", - " if region_key not in region_dict:\n", - " region_dict[region_key] = parsed_region\n", - "\n", - " return gwas_dict, region_dict\n", - "\n", - "gwas_dict, region_dict = load_regional_rss_data(gwas_meta_data, gwas_name, gwas_data, column_mapping, region_name, region_list)\n", - "regional_data = dict([(\"GWAS\", gwas_dict), (\"regions\", region_dict)])" + "[generate_gwas_sumstats]\n", + "# Fan out over the manifest's rows: one (study, region) GwasSumStats per\n", + "# row. Manifest columns: study_id, gwas_tsv, column_mapping, chr, start,\n", + "# end, region_id, gwas_tsv_basename.\n", + "import csv\n", + "jobs = list(csv.DictReader(open(f\"{cwd}/{manifest_name}.manifest.tsv\"), delimiter='\\t'))\n", + "input: for_each = 'jobs'\n", + "output: f\"{cwd}/sumstats/{_jobs['study_id']}.{_jobs['region_id']}.gwas_sumstats.rds\"\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\n", + "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", + " Rscript ${modular_script_dir}/pecotmr_integration/gwas_sumstats_construct.R \\\n", + " --study ${_jobs['study_id']} \\\n", + " --gwas-tsv ${_jobs['gwas_tsv']} \\\n", + " --ld-block ${_jobs['chr']}:${_jobs['start']}-${_jobs['end']} \\\n", + " --ld-meta ${ld_meta} \\\n", + " --genome ${genome} \\\n", + " ${('--column-mapping ' + _jobs['column_mapping']) if _jobs['column_mapping'] else ''} \\\n", + " --qc-method ${qc_method} \\\n", + " ${'--impute' if impute else ''} \\\n", + " ${('--qc-args ' + repr(qc_args)) if qc_args else ''} \\\n", + " --output ${_output}" ] }, { "cell_type": "code", "execution_count": null, - "id": "73360a75", "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ - "[univariate_rss]\n", - "parameter: L = 15\n", - "parameter: L_greedy = 5\n", - "# If available the column that indicates sample size within the sumstats\n", - "# filtering threshold for raiss imputation\n", - "parameter: rcond = 0.01\n", - "parameter: lamb = 0.01\n", - "parameter: R2_threshold = 0.6\n", - "parameter: pip_cutoff = 0.025\n", - "parameter: skip_analysis_pip_cutoff = 0.025\n", - "parameter: minimum_ld = 5\n", - "parameter: coverage = [0.95, 0.7, 0.5]\n", - "# Whether to impute the sumstat for all the snp in LD but not in sumstat.\n", - "parameter: impute = False \n", - "# summary stats QC methods: slalom, dentist\n", - "parameter: qc_method = \"\"\n", - "# analysis method: single_effect, susie_rss, bayesian_conditional_regression\n", - "parameter: finemapping_method = \"single_effect\"\n", - "parameter: extract_region_name = \"NULL\"\n", - "parameter: region_name_col = \"NULL\"\n", - "# remove a variant if it has more than imiss missing individual level data\n", - "parameter: imiss = 1.0\n", - "# MAF cutoff\n", - "parameter: maf = 0.0025\n", - "parameter: comment_string = \"NULL\"\n", - "parameter: diagnostics = False\n", - "depends: sos_variable(\"regional_data\")\n", - "regions = list(regional_data['regions'].keys())\n", - "input: for_each = \"regions\"\n", - "output: f'{cwd:a}/{step_name}/{\"%s\" % qc_method.upper() if qc_method else \"noQC\"}{\"_RAISS_imputed\" if impute else \"\"}.chr{_regions.replace(\":\", \"_\")}.univariate{\"_%s\" % finemapping_method if finemapping_method else \"\"}.rds' if not diagnostics else f'{cwd:a}/{step_name}/Automated_QC.chr{_regions.replace(\":\", \"_\")}.univariate.rds'\n", - "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'\n", - "bash: expand= \"${ }\", stderr = f\"{_output:n}.stderr\", stdout = f\"{_output:n}.stdout\", container = container, entrypoint = entrypoint\n", - " Rscript ${modular_script_dir}/pecotmr_integration/univariate_rss.R \\\n", - " --ld-meta-data ${ld_meta_data} \\\n", - " --studies \"${\",\".join(regional_data[\"GWAS\"].keys())}\" \\\n", - " --sumstat-paths \"${\",\".join([regional_data[\"GWAS\"][x][regional_data[\"regions\"][_regions][0]][0] for x in regional_data[\"GWAS\"].keys()])}\" \\\n", - " --column-file-paths \"${\",\".join([str(regional_data[\"GWAS\"][x][regional_data[\"regions\"][_regions][0]][1]) for x in regional_data[\"GWAS\"].keys()])}\" \\\n", - " --n-samples \"${\",\".join([str(regional_data[\"GWAS\"][x][regional_data[\"regions\"][_regions][0]][2]) for x in regional_data[\"GWAS\"].keys()])}\" \\\n", - " --n-cases \"${\",\".join([str(regional_data[\"GWAS\"][x][regional_data[\"regions\"][_regions][0]][3]) for x in regional_data[\"GWAS\"].keys()])}\" \\\n", - " --n-controls \"${\",\".join([str(regional_data[\"GWAS\"][x][regional_data[\"regions\"][_regions][0]][4]) for x in regional_data[\"GWAS\"].keys()])}\" \\\n", - " --ld-meta-data-paths \"${\",\".join([str(regional_data[\"GWAS\"][x][regional_data[\"regions\"][_regions][0]][5]) for x in regional_data[\"GWAS\"].keys()])}\" \\\n", - " --region \"chr${_regions.replace(\":\", \"_\")}\" \\\n", - " ${\"--skip-regions \" + \",\".join(skip_regions) if skip_regions else \"\"} \\\n", - " ${\"--extract-region-name \" + extract_region_name if extract_region_name != \"NULL\" else \"\"} \\\n", - " ${\"--region-name-col \" + region_name_col if region_name_col != \"NULL\" else \"\"} \\\n", - " --imiss ${imiss} \\\n", - " --maf ${maf} \\\n", + "[gwas_fine_mapping]\n", + "# Per-RDS fan-out: one fine-mapping task per (study, region) GwasSumStats.\n", + "output: f\"{cwd}/fine_mapping/{_input:bnn}.gwas_finemap.rds\", group_by = 1\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\n", + "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", + " Rscript ${modular_script_dir}/pecotmr_integration/fine_mapping.R \\\n", + " --gwas-sumstats ${_input} \\\n", + " --methods ${methods} \\\n", + " --coverage ${coverage} \\\n", + " --secondary-coverage ${secondary_coverage} \\\n", + " --min-abs-corr ${min_abs_corr} \\\n", + " --pip-cutoff ${pip_cutoff} \\\n", " --L ${L} \\\n", " --L-greedy ${L_greedy} \\\n", - " --pip-cutoff ${pip_cutoff} \\\n", - " --skip-analysis-pip-cutoff ${skip_analysis_pip_cutoff} \\\n", - " --coverage \"${\",\".join([str(x) for x in coverage])}\" \\\n", - " --ld-reference-size ${ld_reference_size} \\\n", - " ${\"--ld-mismatch-correction\" if ld_mismatch_correction else \"\"} \\\n", - " ${\"--finemapping-method \" + finemapping_method if finemapping_method else \"\"} \\\n", - " ${\"--impute\" if impute else \"\"} \\\n", - " --rcond ${rcond} \\\n", - " --lamb ${lamb} \\\n", - " --r2-threshold ${R2_threshold} \\\n", - " --minimum-ld ${minimum_ld} \\\n", - " ${\"--qc-method \" + qc_method if qc_method else \"\"} \\\n", - " ${\"--comment-string \" + comment_string if comment_string != \"NULL\" else \"\"} \\\n", - " ${\"--diagnostics\" if diagnostics else \"\"} \\\n", - " --output-prefix ${_output:nn} \\\n", - " --output ${_output}\n" + " ${('--median-abs-corr ' + str(median_abs_corr)) if str(median_abs_corr) != '' else ''} \\\n", + " ${('--method-args ' + repr(method_args)) if method_args else ''} \\\n", + " --output ${_output}" ] }, { "cell_type": "code", "execution_count": null, - "id": "7263f706", "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ - "[univariate_plot]\n", - "parameter: input = path\n", - "input: input\n", - "output: pip_plot = f\"{cwd}/{_input:bn}.png\"\n", - "task: trunk_workers = 1, trunk_size = job_size, walltime = '12h', mem = '20G', cores = numThreads, tags = f'{step_name}_{_output:bn}'\n", - "bash: expand= \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint = entrypoint\n", - " Rscript ${modular_script_dir}/pecotmr_integration/univariate_plot.R \\\n", - " --input \"${_input}\" \\\n", - " --output \"${_output[0]}\"\n" - ] - }, - { - "cell_type": "markdown", - "id": "c7aefa8d", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Troubleshooting\n", - "\n", - "| Error | Cause | Solution |\n", - "|-------|-------|----------|\n", - "| `'arg' should be one of \"slalom\", \"dentist\"` | An unsupported value was passed to `--qc-method`. | Use `--qc-method slalom` or `--qc-method dentist` (or omit it to skip QC). |\n", - "| `ERROR: Incorrect workflow name` | The workflow name was omitted, so SoS read the first `--param` value as a workflow. | Always name the workflow explicitly, e.g. `sos run pipeline/rss_analysis.ipynb univariate_rss --...`. |\n", - "| `Failed to locate pipeline/...ipynb.sos` | The `pipeline/` symlink is missing or points to the wrong relative target. | Ensure `pipeline/rss_analysis.ipynb` -> `../code/mnm_analysis/mnm_methods/rss_analysis.ipynb`. |\n", - "| `Remaining variants after filter: 0` / empty result | The region has too few well-matched variants after MAF / LD-score / R2 filtering (common for small toy cohorts). | Relax `--maf`, `--minimum-ld`, or `--R2-threshold`, choose a denser region, or supply a larger LD reference. |\n", - "| No summary statistics found for region | The region name does not overlap any rows in the summary-statistics file, or the LD-block path is wrong. | Confirm `--region-name chr:start-end` overlaps the data and that the LD-reference `path` resolves relative to the LD-metadata file. |" - ] - }, - { - "cell_type": "markdown", - "id": "8923c002", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Anticipated Results\n", - "\n", - "Produces `.susie_rss.rds` files with SuSiE-RSS fine-mapping posteriors for each GWAS locus against the LD reference." + "[gwas_rss_plot]\n", + "stop_if(no_plot, '--no-plot set; skipping PIP plot step.')\n", + "output: f\"{cwd}/plots/{_input:bnn}.pip_plot.png\", group_by = 1\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = '30m', mem = '4G', cores = 1, tags = f\"{step_name}_{_output:bn}\"\n", + "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", + " Rscript ${modular_script_dir}/pecotmr_integration/gwas_rss_plot.R \\\n", + " --input ${_input} \\\n", + " --output ${_output}" ] } ], @@ -512,22 +247,20 @@ "file_extension": ".sos", "mimetype": "text/x-sos", "name": "sos", - "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter", - "pygments_lexer": "sos" - }, - "sos": { - "kernels": [ - [ - "SoS", - "sos", - "sos", - "", - "" - ] - ], - "version": "" + "pygments_lexer": "python", + "sos": { + "kernels": [ + [ + "SoS", + "sos", + "", + "" + ] + ], + "version": "0.22.4" + } } }, "nbformat": 4, - "nbformat_minor": 5 -} + "nbformat_minor": 4 +} \ No newline at end of file diff --git a/code/SoS/mnm_analysis/summary_stats_finemapping_vignette.ipynb b/code/SoS/mnm_analysis/summary_stats_finemapping_vignette.ipynb index b6441d42..8e62d106 100644 --- a/code/SoS/mnm_analysis/summary_stats_finemapping_vignette.ipynb +++ b/code/SoS/mnm_analysis/summary_stats_finemapping_vignette.ipynb @@ -81,7 +81,7 @@ "source": [ "# Build per-(study, LD-block) GwasSumStats, run SuSiE-RSS fine-mapping,\n", "# and render PIP plots end-to-end with the new pecotmr-integration pipeline:\n", - "sos run pipeline/gwas_rss_fine_mapping.ipynb \\\n", + "sos run pipeline/rss_analysis.ipynb \\\n", " generate_gwas_sumstats+gwas_fine_mapping+gwas_rss_plot \\\n", " --cwd output/rss_analysis \\\n", " --gwas-meta data/mnm_regression/gwas_meta_data.txt \\\n", diff --git a/code/SoS/pecotmr_integration/fine_mapping_postprocessing.ipynb b/code/SoS/pecotmr_integration/fine_mapping_postprocessing.ipynb index a628ad5b..d0a6a4db 100644 --- a/code/SoS/pecotmr_integration/fine_mapping_postprocessing.ipynb +++ b/code/SoS/pecotmr_integration/fine_mapping_postprocessing.ipynb @@ -8,7 +8,7 @@ "\n", "## Description\n", "\n", - "Postprocessing utilities for the S4 `QtlFineMappingResult` / `GwasFineMappingResult` RDSes produced by `fine_mapping.ipynb`, `multivariate_fine_mapping.ipynb`, `functional_fine_mapping.ipynb`, or `gwas_rss_fine_mapping.ipynb`. Replaces the legacy `mnm_analysis/mnm_postprocessing.ipynb` (which consumed the older list-shaped RDS layout).\n", + "Postprocessing utilities for the S4 `QtlFineMappingResult` / `GwasFineMappingResult` RDSes produced by `fine_mapping.ipynb`, `multivariate_fine_mapping.ipynb`, `functional_fine_mapping.ipynb`, or `rss_analysis.ipynb`. Replaces the legacy `mnm_analysis/mnm_postprocessing.ipynb` (which consumed the older list-shaped RDS layout).\n", "\n", "All steps consume FMR RDSes through the pecotmr S4 accessor surface (`getPip`, `getCs`, `getTopLoci`, `getMarginalEffects`, `getVariantIds`) so they work uniformly for QTL and GWAS outputs.\n", "\n", diff --git a/code/SoS/pecotmr_integration/gwas_rss_fine_mapping.ipynb b/code/SoS/pecotmr_integration/gwas_rss_fine_mapping.ipynb deleted file mode 100644 index f0c0e85d..00000000 --- a/code/SoS/pecotmr_integration/gwas_rss_fine_mapping.ipynb +++ /dev/null @@ -1,263 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# GWAS RSS fine-mapping (per study \u00d7 per LD block)\n", - "\n", - "## Description\n", - "\n", - "End-to-end SuSiE-RSS fine-mapping over GWAS summary statistics. Four SoS steps:\n", - "\n", - "1. **`[generate_manifest]`** — calls `gwas_rss_manifest.R` to resolve `--gwas-meta` + `--gwas-tsv-list` against `--region-list` + `--regions` into a single TSV manifest (one row per study x region job). All cross-product / dedup / sanitisation logic lives in R.\n", - "2. **`[generate_gwas_sumstats]`** — per (study \u00d7 LD block) fan-out. Calls `gwas_sumstats_construct.R` to build one `GwasSumStats` RDS per study \u00d7 block, with `summaryStatsQc()` (optionally including SLALOM/DENTIST QC and RAISS imputation).\n", - "3. **`[gwas_fine_mapping]`** — per RDS, runs `fine_mapping.R --gwas-sumstats` which dispatches to `pecotmr::fineMappingPipeline(gwasSumStats, methods = ...)` (default `susie`; the SuSiE-RSS variant inside pecotmr).\n", - "4. **`[gwas_rss_plot]`** (optional, opt-out via `--no-plot`) — per fine-mapping RDS, renders a PIP plot via `gwas_rss_plot.R`.\n", - "\n", - "Replaces the legacy `mnm_analysis/mnm_methods/rss_analysis.ipynb` two-step workflow. Reuses `fine_mapping.R` (no new fine-mapping worker required) and the new `--method-args` / `--qc-args` JSON passthroughs for per-method and per-QC tuning.\n", - "\n", - "## Inputs\n", - "\n", - "### Per-study inputs (one or both forms; the workflow fans out over their union)\n", - "\n", - "- `--gwas-meta` — TSV with columns `study_id`, `path` (relative to the meta file or absolute), and optionally `column_mapping` (path to a per-study YAML; see `gwas_sumstats_construct.R --column-mapping`). One row per study.\n", - "- `--gwas-tsv-list S1=path1 S2=path2 ...` — explicit STUDY=PATH pairs (alternative to / additional to `--gwas-meta`).\n", - "\n", - "### Region inputs (one or both; their union is the fan-out target)\n", - "\n", - "- `--region-list` — BED-like TSV (`#chr`, `start`, `end`, ...) listing LD blocks.\n", - "- `--regions chr:start-end ...` — explicit region strings.\n", - "\n", - "### LD reference\n", - "\n", - "- `--ld-meta` — LD-reference meta TSV (`#chr`, `start`, `end`, `path`). One LD block per row, or one row with `start = end = 0` to mean “whole chromosome.”\n", - "\n", - "### QC knobs (forwarded to `gwas_sumstats_construct.R`)\n", - "\n", - "- `--qc-method` — `none` (default), `slalom`, or `dentist`.\n", - "- `--impute` — flag to enable RAISS sumstat imputation.\n", - "- `--qc-args '{\"key\":value, ...}'` — JSON passthrough for any other `summaryStatsQc()` kwarg.\n", - "\n", - "### Fine-mapping knobs (forwarded to `fine_mapping.R --gwas-sumstats`)\n", - "\n", - "- `--methods` — comma-separated method tokens. Default `susie` (= SuSiE-RSS internally for `GwasSumStats` input).\n", - "- `--coverage` — SuSiE credible-set coverage. Default 0.95.\n", - "- `--method-args '{\"susie\":{\"L\":1}}'` — per-method kwargs spliced into `fineMappingPipeline()`.\n", - "\n", - "### Infrastructure\n", - "\n", - "- `--cwd` — output directory (default `output`).\n", - "- `--modular-script-dir` — directory holding the per-task R workers (default `code/script`).\n", - "- `--genome` — genome build label (default `GRCh38`).\n", - "- `--no-plot` — skip the PIP plot step.\n", - "\n", - "## Outputs\n", - "\n", - "- `{cwd}/sumstats/{study}.{chrN_S_E}.gwas_sumstats.rds` — one per (study, LD block).\n", - "- `{cwd}/fine_mapping/{study}.{chrN_S_E}.gwas_finemap.rds` — one `GwasFineMappingResult` per (study, LD block).\n", - "- `{cwd}/plots/{study}.{chrN_S_E}.pip_plot.png` — per (study, LD block) PIP plot (when `--no-plot` is unset).\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Example\n", - "\n", - "Minimal run with explicit inputs:\n", - "```bash\n", - "sos run pipeline/gwas_rss_fine_mapping.ipynb gwas_rss_fine_mapping \\\n", - " --cwd output --modular-script-dir /path/to/xqtl-protocol/code/script \\\n", - " --gwas-tsv-list MyGWAS=input/twas/protocol_example.twas.gwas_sumstats.chr22.tsv.gz \\\n", - " --regions chr22:10516173-17414263 \\\n", - " --ld-meta input/rss_analysis/protocol_example.ld_meta.tsv\n", - "```\n", - "\n", - "With SLALOM QC + RAISS imputation + per-method susie tuning:\n", - "```bash\n", - "sos run pipeline/gwas_rss_fine_mapping.ipynb gwas_rss_fine_mapping \\\n", - " --cwd output --modular-script-dir /path/to/code/script \\\n", - " --gwas-meta input/gwas/gwas_meta.tsv \\\n", - " --region-list input/rss_analysis/ld_blocks.bed \\\n", - " --ld-meta input/rss_analysis/ld_meta.tsv \\\n", - " --qc-method slalom --impute --qc-args '{\"mafCutoff\":0.01}' \\\n", - " --methods susie --method-args '{\"susie\":{\"L\":10}}'\n", - "```\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[global]\n", - "parameter: cwd = path('output')\n", - "parameter: modular_script_dir = path('code/script')\n", - "# --- per-study GWAS inputs (use either or both) ---------------------\n", - "parameter: gwas_meta = path('.')\n", - "parameter: gwas_tsv_list = [] # list of STUDY=PATH items\n", - "# --- region inputs (use either or both) -----------------------------\n", - "parameter: region_list = path('.')\n", - "parameter: regions = [] # list of chr:start-end strings\n", - "# --- LD reference + genome ------------------------------------------\n", - "parameter: ld_meta = path\n", - "parameter: genome = 'GRCh38'\n", - "# --- QC knobs (forwarded to gwas_sumstats_construct.R) --------------\n", - "parameter: qc_method = 'none' # none | slalom | dentist\n", - "parameter: impute = False\n", - "parameter: qc_args = '' # JSON object spliced into summaryStatsQc()\n", - "# --- fine-mapping knobs (forwarded to fine_mapping.R) ---------------\n", - "parameter: methods = 'susie'\n", - "parameter: coverage = 0.95\n", - "parameter: secondary_coverage = '0.7,0.5'\n", - "parameter: min_abs_corr = 0.8\n", - "parameter: median_abs_corr = '' # empty -> off (OR-logic purity; needs pecotmr step-1)\n", - "parameter: pip_cutoff = 0.025\n", - "parameter: L = 20\n", - "parameter: L_greedy = 5\n", - "parameter: method_args = '' # nested per-method JSON object\n", - "# --- plot opt-out ---------------------------------------------------\n", - "parameter: no_plot = False\n", - "# --- output prefix for the manifest file ----------------------------\n", - "parameter: manifest_name = 'gwas_rss'\n", - "# --- infrastructure -------------------------------------------------\n", - "parameter: container = ''\n", - "parameter: job_size = 1\n", - "parameter: walltime = '2h'\n", - "parameter: mem = '16G'\n", - "parameter: numThreads = 1" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[generate_manifest]\n", - "# Resolve --gwas-meta / --gwas-tsv-list x --region-list / --regions into a\n", - "# single manifest TSV via gwas_rss_manifest.R. Downstream steps fan out\n", - "# over its rows; no Python parsing in this notebook.\n", - "input: None\n", - "output: f\"{cwd}/{manifest_name}.manifest.tsv\"\n", - "task: trunk_workers = 1, trunk_size = 1, walltime = '15m', mem = '2G', cores = 1, tags = f\"{step_name}_{_output:bn}\"\n", - "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", - " Rscript ${modular_script_dir}/pecotmr_integration/gwas_rss_manifest.R \\\n", - " ${('--gwas-meta ' + str(gwas_meta)) if gwas_meta.is_file() else ''} \\\n", - " ${('--gwas-tsv-list ' + ' '.join(str(x) for x in gwas_tsv_list)) if gwas_tsv_list else ''} \\\n", - " ${('--region-list ' + str(region_list)) if region_list.is_file() else ''} \\\n", - " ${('--regions ' + ' '.join(str(x) for x in regions)) if regions else ''} \\\n", - " --output ${_output}" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[generate_gwas_sumstats]\n", - "# Fan out over the manifest's rows: one (study, region) GwasSumStats per\n", - "# row. Manifest columns: study_id, gwas_tsv, column_mapping, chr, start,\n", - "# end, region_id, gwas_tsv_basename.\n", - "import csv\n", - "jobs = list(csv.DictReader(open(f\"{cwd}/{manifest_name}.manifest.tsv\"), delimiter='\\t'))\n", - "input: for_each = 'jobs'\n", - "output: f\"{cwd}/sumstats/{_jobs['study_id']}.{_jobs['region_id']}.gwas_sumstats.rds\"\n", - "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\n", - "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", - " Rscript ${modular_script_dir}/pecotmr_integration/gwas_sumstats_construct.R \\\n", - " --study ${_jobs['study_id']} \\\n", - " --gwas-tsv ${_jobs['gwas_tsv']} \\\n", - " --ld-block ${_jobs['chr']}:${_jobs['start']}-${_jobs['end']} \\\n", - " --ld-meta ${ld_meta} \\\n", - " --genome ${genome} \\\n", - " ${('--column-mapping ' + _jobs['column_mapping']) if _jobs['column_mapping'] else ''} \\\n", - " --qc-method ${qc_method} \\\n", - " ${'--impute' if impute else ''} \\\n", - " ${('--qc-args ' + repr(qc_args)) if qc_args else ''} \\\n", - " --output ${_output}" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[gwas_fine_mapping]\n", - "# Per-RDS fan-out: one fine-mapping task per (study, region) GwasSumStats.\n", - "output: f\"{cwd}/fine_mapping/{_input:bnn}.gwas_finemap.rds\", group_by = 1\n", - "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\n", - "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", - " Rscript ${modular_script_dir}/pecotmr_integration/fine_mapping.R \\\n", - " --gwas-sumstats ${_input} \\\n", - " --methods ${methods} \\\n", - " --coverage ${coverage} \\\n", - " --secondary-coverage ${secondary_coverage} \\\n", - " --min-abs-corr ${min_abs_corr} \\\n", - " --pip-cutoff ${pip_cutoff} \\\n", - " --L ${L} \\\n", - " --L-greedy ${L_greedy} \\\n", - " ${('--median-abs-corr ' + str(median_abs_corr)) if str(median_abs_corr) != '' else ''} \\\n", - " ${('--method-args ' + repr(method_args)) if method_args else ''} \\\n", - " --output ${_output}" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[gwas_rss_plot]\n", - "stop_if(no_plot, '--no-plot set; skipping PIP plot step.')\n", - "output: f\"{cwd}/plots/{_input:bnn}.pip_plot.png\", group_by = 1\n", - "task: trunk_workers = 1, trunk_size = job_size, walltime = '30m', mem = '4G', cores = 1, tags = f\"{step_name}_{_output:bn}\"\n", - "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n", - " Rscript ${modular_script_dir}/pecotmr_integration/gwas_rss_plot.R \\\n", - " --input ${_input} \\\n", - " --output ${_output}" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "SoS", - "language": "sos", - "name": "sos" - }, - "language_info": { - "codemirror_mode": "sos", - "file_extension": ".sos", - "mimetype": "text/x-sos", - "name": "sos", - "pygments_lexer": "python", - "sos": { - "kernels": [ - [ - "SoS", - "sos", - "", - "" - ] - ], - "version": "0.22.4" - } - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} \ No newline at end of file diff --git a/code/script/pecotmr_integration/fine_mapping_pip_plot.R b/code/script/pecotmr_integration/fine_mapping_pip_plot.R index b1aba5fe..fa180106 100644 --- a/code/script/pecotmr_integration/fine_mapping_pip_plot.R +++ b/code/script/pecotmr_integration/fine_mapping_pip_plot.R @@ -4,7 +4,7 @@ # Render a per-region PIP landscape plot from a QtlFineMappingResult or # GwasFineMappingResult RDS (output of fine_mapping.ipynb / # multivariate_fine_mapping.ipynb / functional_fine_mapping.ipynb / -# gwas_rss_fine_mapping.ipynb). One PNG per input RDS, one facet panel +# rss_analysis.ipynb). One PNG per input RDS, one facet panel # per row in the FMR. # # Inputs: diff --git a/code/script/pecotmr_integration/gwas_rss_manifest.R b/code/script/pecotmr_integration/gwas_rss_manifest.R index 03ab7d54..904a0dec 100644 --- a/code/script/pecotmr_integration/gwas_rss_manifest.R +++ b/code/script/pecotmr_integration/gwas_rss_manifest.R @@ -1,7 +1,7 @@ #!/usr/bin/env Rscript # gwas_rss_manifest.R # -# Resolve `gwas_rss_fine_mapping.ipynb`'s per-study GWAS sources (a +# Resolve `rss_analysis.ipynb`'s per-study GWAS sources (a # --gwas-meta TSV and/or --gwas-tsv-list STUDY=PATH items) together with # its region inputs (a --region-list BED-ish file and/or --regions # chr:start-end strings) into a single per-job manifest TSV. One row per