From 305d38cf578bb2adb79d5b6482b772c1aaaf19e7 Mon Sep 17 00:00:00 2001 From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com> Date: Tue, 23 Jun 2026 11:59:38 -0400 Subject: [PATCH 1/6] Delete code/SoS/reference_data/rss_ld_sketch.ipynb --- code/SoS/reference_data/rss_ld_sketch.ipynb | 738 -------------------- 1 file changed, 738 deletions(-) delete mode 100644 code/SoS/reference_data/rss_ld_sketch.ipynb diff --git a/code/SoS/reference_data/rss_ld_sketch.ipynb b/code/SoS/reference_data/rss_ld_sketch.ipynb deleted file mode 100644 index c1ac6c9a..00000000 --- a/code/SoS/reference_data/rss_ld_sketch.ipynb +++ /dev/null @@ -1,738 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "8bdb623a", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "# RSS LD Sketch Pipeline" - ] - }, - { - "cell_type": "markdown", - "id": "4b8d670a", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Description\n", - "\n", - "This pipeline generates a stochastic genotype sample **U = W\u1d40G** from whole-genome sequencing VCF files and stores it as a PLINK2 pgen file for use as an LD reference panel with SuSiE-RSS fine-mapping.\n", - "\n", - "**Key idea:** Rather than storing the full genotype matrix G (n \u00d7 p), we compute U = W\u1d40G (B \u00d7 p) using a random projection matrix $W \\sim N(0, 1/\\sqrt{n})$. The approximate LD matrix $R = U^T U / B \\approx G^T G / n$ by the Johnson\u2013Lindenstrauss lemma. G is never stored.\n", - "\n", - "**Matrix dimensions:**\n", - "- G : (n \u00d7 p) \u2014 n individuals \u00d7 p variants\n", - "- W : (n \u00d7 B) \u2014 projection matrix, generated once per cohort\n", - "- U : (B \u00d7 p) \u2014 stochastic genotype sample = W\u1d40G, stored in pgen\n", - "- $\\hat{R}$ : (p \u00d7 p) \u2014 approximate LD matrix, computed on-the-fly by SuSiE-RSS from U\n", - "\n", - "The workflow has three steps run in order: `generate_W` (build the projection matrix), `process_block` (read VCF per LD block and write per-block dosage sketches), and `merge_chrom` (merge per-block dosages into one per-chromosome pgen).\n", - "\n", - "**Note on data:** This example runs on a clearly-labeled synthetic toy dataset \u2014 a chr22 VCF (`protocol_example.genotype.chr22.bgz`, 60 individuals) and a 3-block LD-block BED (`protocol_example.ld_blocks.bed`). No access-controlled individual-level human genomic data is used." - ] - }, - { - "cell_type": "markdown", - "id": "ffdf8d92", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Input\n", - "\n", - "- **LD-block BED** (`--ld-block-file`): tab-separated file with columns `chr`, `start`, `end` (0-based half-open) defining the regions to sketch. Toy file: `input/rss_ld_sketch/protocol_example.ld_blocks.bed` (3 chr22 blocks).\n", - "- **VCF directory** (`--vcf-base`) and **prefix** (`--vcf-prefix`): bgzipped (`.bgz`) + tabix-indexed VCF(s) named `{vcf_prefix}{chr}.*.bgz`. Toy file: `input/rss_ld_sketch/protocol_example.genotype.chr22.bgz` (60 individuals), discovered with `--vcf-base input/rss_ld_sketch --vcf-prefix protocol_example.genotype.`\n", - "- **`--n-samples`**: number of individuals in the VCF (here 60). Must match the VCF sample count.\n", - "- **`--B`**: number of sketch (pseudo-)samples / projection dimension (the toy uses a small B for speed; production uses ~10000).\n", - "- **`--chrom`**: chromosome to process (e.g. 22; 0 = all autosomes found).\n", - "- **`--cohort-id`**: label used to name output files.\n", - "\n", - "Filter thresholds (defaults shown in the implementation): `--maf-min 0.0005`, `--mac-min 5`, `--msng-min 0.05`." - ] - }, - { - "cell_type": "markdown", - "id": "dce33bc7", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Steps\n", - "\n", - "Run the three workflows in order. `generate_W` builds the shared projection matrix once; `process_block` sketches each LD block; `merge_chrom` assembles the per-chromosome pgen." - ] - }, - { - "cell_type": "markdown", - "id": "4633fe31", - "metadata": {}, - "source": [ - "**Timing:** ~10-20 min (chr22) on typical compute infrastructure." - ] - }, - { - "cell_type": "markdown", - "id": "6ffb31a4-7ad4-479d-8955-ba598a16ef07", - "metadata": {}, - "source": [ - "### Step 1. Generate the projection matrix W (run once per cohort; `--n-samples` must equal the VCF sample count)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "03612385", - "metadata": {}, - "outputs": [], - "source": [ - "sos run pipeline/rss_ld_sketch.ipynb generate_W \\\n", - " --n-samples 60 \\\n", - " --output-dir output/rss_ld_sketch \\\n", - " --B 50 \\\n", - " --seed 123 \\\n", - " --cwd output/rss_ld_sketch" - ] - }, - { - "cell_type": "markdown", - "id": "b62d37b8-da5d-4d5f-a3b7-7633ff5ff70f", - "metadata": {}, - "source": [ - "### Step 2. Process all LD blocks for the chromosome \u2014 read the VCF, filter variants, and write per-block dosage sketches U = W\u1d40G.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d2eccffd", - "metadata": { - "kernel": "Bash" - }, - "outputs": [], - "source": [ - "sos run pipeline/rss_ld_sketch.ipynb process_block \\\n", - " --ld-block-file input/rss_ld_sketch/protocol_example.ld_blocks.bed \\\n", - " --chrom 22 \\\n", - " --vcf-base input/rss_ld_sketch \\\n", - " --vcf-prefix protocol_example.genotype. \\\n", - " --output-dir output/rss_ld_sketch \\\n", - " --W-matrix output/rss_ld_sketch/W_B50.npy \\\n", - " --B 50 \\\n", - " --cohort-id protocol_example. \\\n", - " --cwd output/rss_ld_sketch" - ] - }, - { - "cell_type": "markdown", - "id": "452c348f-487f-44e4-96c1-75fe118cbc9a", - "metadata": {}, - "source": [ - "### Step 3. Merge the per-block dosage sketches into one per-chromosome PLINK2 pgen." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "81f28809", - "metadata": { - "kernel": "Bash" - }, - "outputs": [], - "source": [ - "sos run pipeline/rss_ld_sketch.ipynb merge_chrom \\\n", - " --output-dir output/rss_ld_sketch \\\n", - " --cohort-id protocol_example. \\\n", - " --chrom 22 \\\n", - " --cwd output/rss_ld_sketch" - ] - }, - { - "cell_type": "markdown", - "id": "32c022be", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Command interface\n", - "\n", - "List every workflow and its parameters:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f3569a70", - "metadata": { - "kernel": "Bash" - }, - "outputs": [], - "source": [ - "sos run pipeline/rss_ld_sketch.ipynb -h" - ] - }, - { - "cell_type": "markdown", - "id": "ac50d174", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Workflow implementation" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a7886e46", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[global]\n", - "parameter: cwd = path(\"output\")\n", - "parameter: job_size = 1\n", - "parameter: walltime = \"24:00:00\"\n", - "parameter: mem = \"32G\"\n", - "parameter: numThreads = 8\n", - "\n", - "cwd = path(f'{cwd:a}')\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c321bef5", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[generate_W]\n", - "# Generate projection matrix $W \\sim N(0, 1/\\sqrt{n})$, shape (n x B).\n", - "# Run ONCE before processing any chromosome.\n", - "#\n", - "# W depends only on n (total sample size) and B -- not on any variant data.\n", - "# n_samples is passed directly as a parameter; no VCF reading is needed.\n", - "# All 22 chromosomes reuse the same W so that per-chromosome stochastic\n", - "# genotype samples can be arithmetically merged for meta-analysis.\n", - "parameter: n_samples = int\n", - "parameter: output_dir = str\n", - "parameter: B = 10000\n", - "parameter: seed = 123\n", - "\n", - "import os\n", - "input: []\n", - "output: f'{output_dir}/W_B{B}.npy'\n", - "task: trunk_workers = 1, trunk_size = 1, walltime = '00:05:00', mem = '4G', cores = 1\n", - "python: expand = \"${ }\", stdout = f'{_output:n}.stdout', stderr = f'{_output:n}.stderr'\n", - "\n", - " import numpy as np\n", - " import os\n", - "\n", - " n = ${n_samples}\n", - " B = ${B}\n", - " seed = ${seed}\n", - " W_out = \"${_output}\"\n", - "\n", - " # -- Generate $W \\sim N(0, 1/\\sqrt{n})$ -----------------------------\n", - " # Convention: W = np.random.normal(0, 1/np.sqrt(n), size=(n, B))\n", - " # W is shared across all chromosomes -- do not regenerate per chromosome.\n", - " print(f\"Generating W ~ N(0, 1/sqrt({n})), shape ({n}, {B}), seed={seed}\")\n", - " np.random.seed(seed)\n", - " W = np.random.normal(0, 1.0 / np.sqrt(n), size=(n, B)).astype(np.float32)\n", - "\n", - " os.makedirs(os.path.dirname(os.path.abspath(W_out)), exist_ok=True)\n", - " np.save(W_out, W)\n", - " print(f\"Saved: {W_out}\")\n", - " print(f\"Shape: {W.shape}, size: {os.path.getsize(W_out)/1e9:.2f} GB\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "68a93ed9", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[process_block]\n", - "parameter: ld_block_file = str\n", - "parameter: chrom = 0\n", - "parameter: vcf_base = str\n", - "parameter: vcf_prefix = str\n", - "parameter: cohort_id = \"ADSP.R5.EUR\"\n", - "parameter: output_dir = str\n", - "parameter: W_matrix = str\n", - "parameter: B = 10000\n", - "parameter: maf_min = 0.0005\n", - "parameter: mac_min = 5\n", - "parameter: msng_min = 0.05\n", - "parameter: sample_list = \"\"\n", - "\n", - "import os\n", - "\n", - "def _read_blocks(bed, chrom_filter):\n", - " blocks = []\n", - " with open(bed) as fh:\n", - " for line in fh:\n", - " if line.startswith(\"#\") or not line.strip():\n", - " continue\n", - " parts = line.split()\n", - " c = parts[0]\n", - " if not (c.startswith(\"chr\") and c[3:].isdigit()):\n", - " continue\n", - " cnum = int(c[3:])\n", - " if not (1 <= cnum <= 22):\n", - " continue\n", - " if chrom_filter != 0 and cnum != chrom_filter:\n", - " continue\n", - " blocks.append({\"chr\": c, \"start\": int(parts[1]), \"end\": int(parts[2])})\n", - " if not blocks:\n", - " raise ValueError(f\"No blocks found for chrom={chrom_filter} in {bed}\")\n", - " return blocks\n", - "\n", - "blocks = _read_blocks(ld_block_file, chrom)\n", - "print(f\" {len(blocks)} LD blocks queued\")\n", - "\n", - "input: for_each = \"blocks\"\n", - "output: f'{output_dir}/{_blocks[\"chr\"]}/{_blocks[\"chr\"]}_{_blocks[\"start\"]}_{_blocks[\"end\"]}/{cohort_id}.{_blocks[\"chr\"]}_{_blocks[\"start\"]}_{_blocks[\"end\"]}.dosage.gz'\n", - "task: trunk_workers = 1, trunk_size = 1, walltime = walltime, mem = mem, cores = numThreads\n", - "python: expand = \"${ }\"\n", - "\n", - " import numpy as np\n", - " import os\n", - " import gzip\n", - " import sys\n", - " import atexit\n", - " from math import nan\n", - " from cyvcf2 import VCF\n", - " from os import listdir\n", - "\n", - " # Block coordinates from for_each loop\n", - " chrm_str = \"${_blocks['chr']}\"\n", - " block_start = ${_blocks[\"start\"]}\n", - " block_end = ${_blocks[\"end\"]}\n", - "\n", - " vcf_base = \"${vcf_base}\"\n", - " vcf_prefix = \"${vcf_prefix}\"\n", - " W_path = \"${W_matrix}\"\n", - " B = ${B}\n", - " maf_min = ${maf_min}\n", - " mac_min = ${mac_min}\n", - " msng_min = ${msng_min}\n", - " sample_list = \"${sample_list}\"\n", - " cohort_id = \"${cohort_id}\"\n", - " base_dir = \"${output_dir}\"\n", - "\n", - " block_tag = f\"{chrm_str}_{block_start}_{block_end}\"\n", - " output_dir = os.path.join(base_dir, chrm_str, block_tag)\n", - " os.makedirs(output_dir, exist_ok=True)\n", - "\n", - " log_path = os.path.join(output_dir, f\"{block_tag}.log\")\n", - " log_fh = open(log_path, \"w\")\n", - " sys.stdout = log_fh\n", - " sys.stderr = log_fh\n", - " atexit.register(log_fh.close)\n", - "\n", - " # -- Load sample subset (optional) -----------------------------\n", - " sample_subset = None\n", - " if sample_list:\n", - " if not os.path.exists(sample_list):\n", - " raise FileNotFoundError(f\"sample_list not found: {sample_list}\")\n", - " with open(sample_list) as fh:\n", - " sample_subset = set(line.strip() for line in fh if line.strip())\n", - " print(f\" Sample subset: {len(sample_subset):,} samples\")\n", - "\n", - " # -- Helpers ---------------------------------------------------\n", - " def get_vcf_files(chrm_str):\n", - " files = sorted([\n", - " os.path.join(vcf_base, x)\n", - " for x in listdir(vcf_base)\n", - " if x.endswith(\".bgz\") and (\n", - " x.startswith(vcf_prefix + chrm_str + \":\") or\n", - " x.startswith(vcf_prefix + chrm_str + \".\")\n", - " )\n", - " ])\n", - " if not files:\n", - " raise FileNotFoundError(f\"No VCF files for {chrm_str} in {vcf_base}\")\n", - " return files\n", - "\n", - " def open_vcf(vf, sample_subset):\n", - " \"\"\"Open a VCF file, applying sample subset if provided.\"\"\"\n", - " vcf = VCF(vf)\n", - " if sample_subset is not None:\n", - " vcf_samples = vcf.samples\n", - " keep = [s for s in vcf_samples if s in sample_subset]\n", - " if not keep:\n", - " raise ValueError(f\"No sample_list samples in {os.path.basename(vf)}\")\n", - " vcf.set_samples(keep)\n", - " return vcf\n", - "\n", - " def extract_dosage(var):\n", - " \"\"\"Extract diploid dosage from a cyvcf2 variant. Returns list of floats (nan for missing).\"\"\"\n", - " return [sum(x[0:2]) for x in [[nan if v == -1 else v for v in gt] for gt in var.genotypes]]\n", - "\n", - " def fill_missing_col_means(G):\n", - " col_means = np.nanmean(G, axis=0)\n", - " return np.where(np.isnan(G), col_means, G)\n", - "\n", - " # -- Single-pass: scan variants, filter, and collect dosages ---\n", - " # BED is 0-based half-open [start, end); VCF is 1-based.\n", - " print(f\"[1/3] Scanning {chrm_str} [{block_start:,}, {block_end:,}) ...\")\n", - " vcf_files = get_vcf_files(chrm_str)\n", - " region = f\"{chrm_str}:{block_start+1}-{block_end}\"\n", - " var_info = []\n", - " dosage_matrix = []\n", - " n_samples = None\n", - " # Filter counters\n", - " n_total = 0\n", - " n_multiallelic = 0\n", - " n_monomorphic = 0\n", - " n_all_na = 0\n", - " n_low_maf = 0\n", - " n_low_mac = 0\n", - " n_high_msng = 0\n", - "\n", - " for vf in vcf_files:\n", - " vcf = open_vcf(vf, sample_subset)\n", - " if n_samples is None:\n", - " n_samples = len(vcf.samples)\n", - " for var in vcf(region):\n", - " if not (block_start <= var.POS - 1 < block_end):\n", - " continue\n", - " n_total += 1\n", - " if len(var.ALT) != 1:\n", - " n_multiallelic += 1\n", - " continue\n", - " dosage = extract_dosage(var)\n", - " if np.nanvar(dosage) == 0:\n", - " n_monomorphic += 1\n", - " continue\n", - " nan_count = int(np.sum(np.isnan(dosage)))\n", - " n_non_na = len(dosage) - nan_count\n", - " if n_non_na == 0:\n", - " n_all_na += 1\n", - " continue\n", - " alt_sum = float(np.nansum(dosage))\n", - " mac = min(2 * n_non_na - alt_sum, alt_sum)\n", - " maf = mac / (2 * n_non_na)\n", - " af = alt_sum / (2 * n_non_na)\n", - " msng_rate = nan_count / len(dosage)\n", - " if msng_rate > msng_min:\n", - " n_high_msng += 1\n", - " continue\n", - " if maf < maf_min:\n", - " n_low_maf += 1\n", - " continue\n", - " if mac < mac_min:\n", - " n_low_mac += 1\n", - " continue\n", - " var_info.append({\n", - " \"chr\": var.CHROM, \"pos\": var.POS,\n", - " \"ref\": var.REF, \"alt\": var.ALT[0],\n", - " \"af\": round(float(af), 6),\n", - " \"id\": f\"{var.CHROM}:{var.POS}:{var.REF}:{var.ALT[0]}\",\n", - " \"obs_ct\": 2 * n_non_na,\n", - " })\n", - " dosage_matrix.append(dosage)\n", - " vcf.close()\n", - "\n", - " n_passed = len(var_info)\n", - " print(f\" {n_total:,} total variants in region\")\n", - " print(f\" {n_passed:,} passed filters (n={n_samples:,})\")\n", - " print(f\" Filtered: {n_multiallelic:,} multiallelic, \"\n", - " f\"{n_monomorphic:,} monomorphic, {n_all_na:,} all-NA, \"\n", - " f\"{n_high_msng:,} high-missingness, \"\n", - " f\"{n_low_maf:,} low-MAF, {n_low_mac:,} low-MAC\")\n", - "\n", - " if not var_info:\n", - " raise ValueError(f\"No passing variants in {chrm_str} [{block_start:,}, {block_end:,})\")\n", - "\n", - " # -- Load W ----------------------------------------------------\n", - " print(f\"[2/3] Loading W ...\")\n", - " W = np.load(W_path)\n", - " if W.shape != (n_samples, B):\n", - " raise ValueError(f\"W shape mismatch: {W.shape} vs ({n_samples},{B})\")\n", - " W = W.astype(np.float32)\n", - " print(f\" W: {W.shape}\")\n", - "\n", - " # -- Compute U = $W^T G$ and write output files --------------------\n", - " print(f\"[3/3] Computing U and writing output files ...\")\n", - "\n", - " dosage_path = os.path.join(output_dir, f\"{cohort_id}.{block_tag}.dosage.gz\")\n", - " map_path = os.path.join(output_dir, f\"{cohort_id}.{block_tag}.map\")\n", - " afreq_path = os.path.join(output_dir, f\"{cohort_id}.{block_tag}.afreq\")\n", - " meta_path = os.path.join(output_dir, f\"{cohort_id}.{block_tag}.meta\")\n", - "\n", - " # Write .map\n", - " with open(map_path, \"w\") as fh:\n", - " for v in var_info:\n", - " fh.write(f\"{v['chr']}\\t{v['id']}\\t0\\t{v['pos']}\\n\")\n", - "\n", - " # Write .meta\n", - " with open(meta_path, \"w\") as fh:\n", - " fh.write(f\"source_n_samples={n_samples}\\nB={B}\\n\")\n", - " fh.write(f\"chrom={chrm_str}\\nblock_start={block_start}\\nblock_end={block_end}\\n\")\n", - " fh.write(f\"n_total={n_total}\\nn_passed={n_passed}\\n\")\n", - " fh.write(f\"n_multiallelic={n_multiallelic}\\nn_monomorphic={n_monomorphic}\\n\")\n", - " fh.write(f\"n_all_na={n_all_na}\\nn_high_msng={n_high_msng}\\n\")\n", - " fh.write(f\"n_low_maf={n_low_maf}\\nn_low_mac={n_low_mac}\\n\")\n", - "\n", - "\n", - " # Build G from collected dosages, compute U = $W^T G$, write dosage.gz\n", - " # Dosage format=1: ID ALT REF val_S1 ... val_SB\n", - " # Min-max scaling to [0, 2] makes the output plink2-compatible as dosage.\n", - " # This preserves correlation structure (cor is scale-invariant) which is\n", - " # what matters for LD computation downstream.\n", - " G = np.array(dosage_matrix, dtype=np.float32).T # (n_samples, n_variants)\n", - " del dosage_matrix\n", - " G = fill_missing_col_means(G)\n", - "\n", - " # variant-wise scaling\n", - " col_mean = G.mean(axis=0, keepdims=True)\n", - " col_std = G.std(axis=0, keepdims=True)\n", - " # avoid division by zero\n", - " col_std[col_std == 0] = 1.0\n", - " G = (G - col_mean) / col_std\n", - "\n", - " U = W.T @ G # (B, n_variants)\n", - " del G\n", - "\n", - " col_min = U.min(axis=0)\n", - " col_max = U.max(axis=0)\n", - " denom = col_max - col_min\n", - " denom[denom == 0] = 1.0\n", - " U = 2.0 * (U - col_min) / denom\n", - " U = np.round(U, 4)\n", - "\n", - " # Record the col min and max for U\n", - "\n", - " with open(afreq_path, \"w\") as fh:\n", - " # Add column headers\n", - " fh.write(\"#CHROM\\tID\\tREF\\tALT\\tALT_FREQS\\tOBS_CT\\tU_MIN\\tU_MAX\\n\")\n", - " for j, v in enumerate(var_info):\n", - " fh.write(f\"{v['chr']}\\t{v['id']}\\t{v['ref']}\\t{v['alt']}\\t\"\n", - " f\"{v['af']:.6f}\\t{v['obs_ct']}\\t\"\n", - " f\"{col_min[j]:.6f}\\t{col_max[j]:.6f}\\n\")\n", - "\n", - " with gzip.open(dosage_path, \"wt\", compresslevel=4) as gz:\n", - " for j, v in enumerate(var_info):\n", - " vals = \" \".join(f\"{x:.4f}\" for x in U[:, j])\n", - " gz.write(f\"{v['id']} {v['alt']} {v['ref']} {vals}\\n\")\n", - "\n", - " del U\n", - " print(f\" Written: {len(var_info):,} variants -> {os.path.basename(dosage_path)}\")\n", - " print(f\" Written: {os.path.basename(map_path)}, {os.path.basename(afreq_path)}\")\n", - " print(f\"\\nDone: {chrm_str} [{block_start:,}, {block_end:,})\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9e8fff43", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[merge_chrom]\n", - "parameter: chrom = 0\n", - "parameter: output_dir = str\n", - "parameter: cohort_id = str\n", - "parameter: plink2_bin = \"plink2\"\n", - "\n", - "import os, glob\n", - "\n", - "def _chroms_to_process(output_dir, chrom_filter):\n", - " if chrom_filter != 0:\n", - " return [f\"chr{chrom_filter}\"]\n", - " return sorted(set(\n", - " os.path.basename(d)\n", - " for d in glob.glob(os.path.join(output_dir, \"chr*\"))\n", - " if os.path.isdir(d)\n", - " ))\n", - "\n", - "chroms = _chroms_to_process(output_dir, chrom)\n", - "\n", - "input: for_each = \"chroms\"\n", - "output: f\"{output_dir}/{_chroms}/{cohort_id}.{_chroms}.pgen\"\n", - "task: trunk_workers = 1, trunk_size = 1, walltime = walltime, mem = mem, cores = numThreads\n", - "bash: expand = \"$[ ]\"\n", - " set -euo pipefail\n", - " shopt -s nullglob\n", - "\n", - " chrom_dir=\"$[output_dir]/$[_chroms]\"\n", - " final_prefix=\"${chrom_dir}/$[cohort_id].$[_chroms]\"\n", - " merge_list=\"${chrom_dir}/$[cohort_id].$[_chroms]_pmerge_list.txt\"\n", - "\n", - " # Step 1: Convert each block dosage.gz -> sorted per-block pgen\n", - " > \"${merge_list}\"\n", - " files=(\"${chrom_dir}\"/*/*.dosage.gz)\n", - " if [ ${#files[@]} -eq 0 ]; then\n", - " echo \"No dosage files found in ${chrom_dir}\" >&2\n", - " exit 1\n", - " fi\n", - " for dosage_gz in \"${files[@]}\"; do\n", - " block_dir=$(dirname \"${dosage_gz}\")\n", - " block_tag=$(basename \"${block_dir}\")\n", - " prefix=\"${block_dir}/$[cohort_id].${block_tag}_tmp\"\n", - " map_file=\"${block_dir}/$[cohort_id].${block_tag}.map\"\n", - " psam_file=\"${block_dir}/$[cohort_id].${block_tag}.psam\"\n", - " meta_file=\"${block_dir}/$[cohort_id].${block_tag}.meta\"\n", - " B=$(grep \"^B=\" \"${meta_file}\" | cut -d= -f2)\n", - " printf '#FID\\tIID\\n' > \"${psam_file}\"\n", - " for i in $(seq 1 ${B}); do\n", - " printf 'S%d\\tS%d\\n' ${i} ${i} >> \"${psam_file}\"\n", - " done\n", - " $[plink2_bin] \\\n", - " --import-dosage \"${dosage_gz}\" format=1 noheader \\\n", - " --psam \"${psam_file}\" \\\n", - " --map \"${map_file}\" \\\n", - " --make-pgen \\\n", - " --out \"${prefix}_unsorted\" \\\n", - " --silent\n", - " $[plink2_bin] \\\n", - " --pfile \"${prefix}_unsorted\" \\\n", - " --make-pgen \\\n", - " --sort-vars \\\n", - " --out \"${prefix}\" \\\n", - " --silent\n", - " rm -f \"${prefix}_unsorted.pgen\" \"${prefix}_unsorted.pvar\" \"${prefix}_unsorted.psam\"\n", - " echo \"${prefix}\" >> \"${merge_list}\"\n", - " done\n", - "\n", - " # Step 2: Merge all per-block pgens -> one per-chrom pgen\n", - " $[plink2_bin] \\\n", - " --pmerge-list \"${merge_list}\" pfile \\\n", - " --make-pgen \\\n", - " --sort-vars \\\n", - " --out \"${final_prefix}\"\n", - "\n", - " # Step 3: Concatenate .afreq\n", - " first=1\n", - " for f in \"${chrom_dir}\"/*/*.afreq; do\n", - " if [ \"${first}\" -eq 1 ]; then\n", - " cat \"${f}\" > \"${final_prefix}.afreq\"\n", - " first=0\n", - " else\n", - " tail -n +2 \"${f}\" >> \"${final_prefix}.afreq\"\n", - " fi\n", - " done\n", - "\n", - "R: expand = \"$[ ]\"\n", - " library(data.table)\n", - " meta_files <- list.files(\"$[output_dir]/$[_chroms]\",\n", - " pattern = \"[.]meta$\", recursive = TRUE,\n", - " full.names = TRUE)\n", - " if (length(meta_files) > 0) {\n", - " fields <- c(\"n_total\", \"n_passed\", \"n_multiallelic\", \"n_monomorphic\",\n", - " \"n_all_na\", \"n_high_msng\", \"n_low_maf\", \"n_low_mac\")\n", - " stats <- rbindlist(lapply(meta_files, function(f) {\n", - " lines <- grep(\"^n_\", readLines(f), value = TRUE)\n", - " kv <- strsplit(lines, \"=\")\n", - " vals <- setNames(as.integer(sapply(kv, `[`, 2)), sapply(kv, `[`, 1))\n", - " as.data.table(as.list(vals[fields]))\n", - " }))\n", - " totals <- colSums(stats, na.rm = TRUE)\n", - " summary <- data.frame(t(totals))\n", - " summary$pct_dropped <- round(100 * (1 - summary$n_passed / summary$n_total), 1)\n", - " cat(\"\\n=== Filter Summary for $[_chroms] ===\\n\")\n", - " print(data.frame(value = unlist(summary), row.names = names(summary)))\n", - " }\n", - "bash: expand = \"$[ ]\"\n", - " # Step 5: Cleanup block intermediates\n", - " chrom_dir=\"$[output_dir]/$[_chroms]\"\n", - " final_prefix=\"${chrom_dir}/$[cohort_id].$[_chroms]\"\n", - " rm -f \"${final_prefix}_pmerge_list.txt\"\n", - " rm -f \"${final_prefix}-merge.pgen\" \"${final_prefix}-merge.pvar\" \"${final_prefix}-merge.psam\"\n", - " for block_dir in \"${chrom_dir}\"/*/; do\n", - " rm -rf \"${block_dir}\"\n", - " done\n" - ] - }, - { - "cell_type": "markdown", - "id": "dc998dcc", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Troubleshooting\n", - "\n", - "| Symptom | Cause | Fix |\n", - "|---|---|---|\n", - "| `No VCF files for chrXX in {vcf_base}` | VCF naming or extension mismatch | Files must end in `.bgz` and be named `{vcf_prefix}{chr}.*.bgz`; check `--vcf-base` and `--vcf-prefix`. |\n", - "| `W shape mismatch` | `--n-samples` or `--B` differs from the W used | Re-run `generate_W` with the same `--n-samples` and `--B`, and pass that `W_B{B}.npy` to `process_block`. |\n", - "| `No passing variants in chrXX` | Filters removed everything (small toy cohort) | Widen `--maf-min` / `--mac-min` / `--msng-min`, or choose blocks with more variants. |\n", - "| `No blocks found for chrom=XX` | `--chrom` does not match any BED rows | Ensure the BED `chr` column matches (e.g. `chr22`) and `--chrom` is the matching number. |\n", - "| Region query returns nothing | Missing tabix index | Run `tabix -p vcf file.bgz` to create the `.tbi`. |" - ] - }, - { - "cell_type": "markdown", - "id": "ac3cfb79", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Output\n", - "\n", - "Per chromosome, under `--cwd`:\n", - "- `{cohort_id}.chr{N}.pgen` \u2014 binary genotype-sketch data (B pseudo-samples \u00d7 p variants)\n", - "- `{cohort_id}.chr{N}.pvar` \u2014 variant information\n", - "- `{cohort_id}.chr{N}.psam` \u2014 sample (sketch) information\n", - "- `{cohort_id}.chr{N}.afreq` \u2014 allele frequencies\n", - "\n", - "These feed SuSiE-RSS fine-mapping: load with a metadata TSV (one row per chromosome, columns `#chrom start end path`, `path` = pgen prefix). Use the X (genotype) interface for `susie_rss(z, X=X)` or the R (correlation) interface for `susie_rss(z, R=R)`." - ] - }, - { - "cell_type": "markdown", - "id": "ff927c9c", - "metadata": {}, - "source": [ - "## Anticipated Results\n", - "\n", - "The pipeline produces output files in the `output/` subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the **Output** section above for the expected file names and formats." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.13" - }, - "sos": { - "kernels": [ - [ - "SoS", - "sos", - "sos", - "", - "" - ] - ], - "version": "" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} \ No newline at end of file From 92df463fa2f18119615dd7ba5f5c88c9d93edd4f Mon Sep 17 00:00:00 2001 From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com> Date: Tue, 23 Jun 2026 12:00:07 -0400 Subject: [PATCH 2/6] remove dup merge output --- code/SoS/reference_data/rss_ld_sketch.ipynb | 850 ++++++++++++++++++++ 1 file changed, 850 insertions(+) create mode 100644 code/SoS/reference_data/rss_ld_sketch.ipynb diff --git a/code/SoS/reference_data/rss_ld_sketch.ipynb b/code/SoS/reference_data/rss_ld_sketch.ipynb new file mode 100644 index 00000000..0396c03d --- /dev/null +++ b/code/SoS/reference_data/rss_ld_sketch.ipynb @@ -0,0 +1,850 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "8bdb623a", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "# RSS LD Sketch Pipeline" + ] + }, + { + "cell_type": "markdown", + "id": "4b8d670a", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Description\n", + "\n", + "This pipeline generates a stochastic genotype sample **U = WᵀG** from whole-genome sequencing VCF files and stores it as a PLINK2 pgen file for use as an LD reference panel with SuSiE-RSS fine-mapping.\n", + "\n", + "**Key idea:** Rather than storing the full genotype matrix G (n × p), we compute U = WᵀG (B × p) using a random projection matrix $W \\sim N(0, 1/\\sqrt{n})$. The approximate LD matrix $R = U^T U / B \\approx G^T G / n$ by the Johnson–Lindenstrauss lemma. G is never stored.\n", + "\n", + "**Matrix dimensions:**\n", + "- G : (n × p) — n individuals × p variants\n", + "- W : (n × B) — projection matrix, generated once per cohort\n", + "- U : (B × p) — stochastic genotype sample = WᵀG, stored in pgen\n", + "- $\\hat{R}$ : (p × p) — approximate LD matrix, computed on-the-fly by SuSiE-RSS from U\n", + "\n", + "The workflow has three steps run in order: `generate_W` (build the projection matrix), `process_block` (read VCF per LD block and write per-block dosage sketches), and `merge_chrom` (merge per-block dosages into one per-chromosome pgen).\n", + "\n", + "**Note on data:** This example runs on a clearly-labeled synthetic toy dataset — a chr22 VCF (`protocol_example.genotype.chr22.bgz`, 60 individuals) and a 3-block LD-block BED (`protocol_example.ld_blocks.bed`). No access-controlled individual-level human genomic data is used." + ] + }, + { + "cell_type": "markdown", + "id": "ffdf8d92", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Input\n", + "\n", + "- **LD-block BED** (`--ld-block-file`): tab-separated file with columns `chr`, `start`, `end` (0-based half-open) defining the regions to sketch. Toy file: `input/rss_ld_sketch/protocol_example.ld_blocks.bed` (3 chr22 blocks).\n", + "- **VCF directory** (`--vcf-base`) and **prefix** (`--vcf-prefix`): bgzipped (`.bgz`) + tabix-indexed VCF(s) named `{vcf_prefix}{chr}.*.bgz`. Toy file: `input/rss_ld_sketch/protocol_example.genotype.chr22.bgz` (60 individuals), discovered with `--vcf-base input/rss_ld_sketch --vcf-prefix protocol_example.genotype.`\n", + "- **`--n-samples`**: number of individuals in the VCF (here 60). Must match the VCF sample count.\n", + "- **`--B`**: number of sketch (pseudo-)samples / projection dimension (the toy uses a small B for speed; production uses ~10000).\n", + "- **`--chrom`**: chromosome to process (e.g. 22; 0 = all autosomes found).\n", + "- **`--cohort-id`**: label used to name output files.\n", + "\n", + "Filter thresholds (defaults shown in the implementation): `--maf-min 0.0005`, `--mac-min 5`, `--msng-min 0.05`." + ] + }, + { + "cell_type": "markdown", + "id": "dce33bc7", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Steps\n", + "\n", + "Run the three workflows in order. `generate_W` builds the shared projection matrix once; `process_block` sketches each LD block; `merge_chrom` assembles the per-chromosome pgen." + ] + }, + { + "cell_type": "markdown", + "id": "4633fe31", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "**Timing:** ~10-20 min (chr22) on typical compute infrastructure." + ] + }, + { + "cell_type": "markdown", + "id": "6ffb31a4-7ad4-479d-8955-ba598a16ef07", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Step 1. Generate the projection matrix W (run once per cohort; `--n-samples` must equal the VCF sample count)." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "03612385", + "metadata": { + "kernel": "Bash" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n", + " import pkg_resources\n", + "INFO: Running \u001b[32mgenerate_W\u001b[0m: \n", + "INFO: \u001b[32mgenerate_W\u001b[0m is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mgenerate_W\u001b[0m output: \u001b[32moutput/rss_ld_sketch/W_B50.npy\u001b[0m\n", + "INFO: Workflow generate_W (ID=w68f63c60d8da4b5e) is executed successfully with 1 completed step.\n" + ] + } + ], + "source": [ + "sos run pipeline/rss_ld_sketch.ipynb generate_W \\\n", + " --n-samples 60 \\\n", + " --output-dir output/rss_ld_sketch \\\n", + " --B 50 \\\n", + " --seed 123 \\\n", + " --cwd output/rss_ld_sketch" + ] + }, + { + "cell_type": "markdown", + "id": "b62d37b8-da5d-4d5f-a3b7-7633ff5ff70f", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Step 2. Process all LD blocks for the chromosome — read the VCF, filter variants, and write per-block dosage sketches U = WᵀG.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "d2eccffd", + "metadata": { + "kernel": "Bash" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n", + " import pkg_resources\n", + "INFO: Running \u001b[32mprocess_block\u001b[0m: \n", + " 3 LD blocks queued\n", + "INFO: \u001b[32mprocess_block\u001b[0m (index=0) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mprocess_block\u001b[0m (index=1) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mprocess_block\u001b[0m (index=2) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mprocess_block\u001b[0m output: \u001b[32moutput/rss_ld_sketch/chr22/chr22_16000000_20000000/protocol_example..chr22_16000000_20000000.dosage.gz output/rss_ld_sketch/chr22/chr22_30000000_34000000/protocol_example..chr22_30000000_34000000.dosage.gz... (3 items in 3 groups)\u001b[0m\n", + "INFO: Workflow process_block (ID=w8c1e759d62203ef6) is executed successfully with 1 completed step and 3 completed substeps.\n" + ] + } + ], + "source": [ + "sos run pipeline/rss_ld_sketch.ipynb process_block \\\n", + " --ld-block-file input/rss_ld_sketch/protocol_example.ld_blocks.bed \\\n", + " --chrom 22 \\\n", + " --vcf-base input/rss_ld_sketch \\\n", + " --vcf-prefix protocol_example.genotype. \\\n", + " --output-dir output/rss_ld_sketch \\\n", + " --W-matrix output/rss_ld_sketch/W_B50.npy \\\n", + " --B 50 \\\n", + " --cohort-id protocol_example. \\\n", + " --cwd output/rss_ld_sketch" + ] + }, + { + "cell_type": "markdown", + "id": "452c348f-487f-44e4-96c1-75fe118cbc9a", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Step 3. Merge the per-block dosage sketches into one per-chromosome PLINK2 pgen." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "81f28809", + "metadata": { + "kernel": "Bash" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n", + " import pkg_resources\n", + "INFO: Running \u001b[32mmerge_chrom\u001b[0m: \n", + "PLINK v2.0.0-a.6.9LM 64-bit Intel (29 Jan 2025) cog-genomics.org/plink/2.0/\n", + "(C) 2005-2025 Shaun Purcell, Christopher Chang GNU General Public License v3\n", + "Logging to output/rss_ld_sketch/chr22/protocol_example..chr22.log.\n", + "Options in effect:\n", + " --make-pgen\n", + " --out output/rss_ld_sketch/chr22/protocol_example..chr22\n", + " --pmerge-list output/rss_ld_sketch/chr22/protocol_example..chr22_pmerge_list.txt pfile\n", + " --sort-vars\n", + "\n", + "Start time: Tue Jun 23 09:52:54 2026\n", + "191527 MiB RAM detected, ~187643 available; reserving 95763 MiB for main\n", + "workspace.\n", + "Using up to 32 threads (change this with --threads).\n", + "--pmerge-list: 3 filesets specified.\n", + "--pmerge-list: 50 samples present.\n", + "--pmerge-list: Merged .psam written to\n", + "output/rss_ld_sketch/chr22/protocol_example..chr22-merge.psam .\n", + "--pmerge-list: 3 .pvar files scanned.\n", + "Concatenation job detected.\n", + "Concatenating... 673/673 variants complete.\n", + "Results written to\n", + "output/rss_ld_sketch/chr22/protocol_example..chr22-merge.pgen +\n", + "output/rss_ld_sketch/chr22/protocol_example..chr22-merge.pvar .\n", + "50 samples (0 females, 0 males, 50 ambiguous; 50 founders) loaded from\n", + "output/rss_ld_sketch/chr22/protocol_example..chr22-merge.psam.\n", + "673 variants loaded from\n", + "output/rss_ld_sketch/chr22/protocol_example..chr22-merge.pvar.\n", + "Note: No phenotype data present.\n", + "Writing output/rss_ld_sketch/chr22/protocol_example..chr22.pvar ... done.\n", + "Writing output/rss_ld_sketch/chr22/protocol_example..chr22.psam ... done.\n", + "Writing output/rss_ld_sketch/chr22/protocol_example..chr22.pgen ... done.\n", + "End time: Tue Jun 23 09:52:54 2026\n", + "\n", + "=== Filter Summary for chr22 ===\n", + " value\n", + "n_total 6157.0\n", + "n_passed 673.0\n", + "n_multiallelic 0.0\n", + "n_monomorphic 4701.0\n", + "n_all_na 0.0\n", + "n_high_msng 101.0\n", + "n_low_maf 0.0\n", + "n_low_mac 682.0\n", + "pct_dropped 89.1\n", + "INFO: \u001b[32mmerge_chrom\u001b[0m is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmerge_chrom\u001b[0m output: \u001b[32moutput/rss_ld_sketch/chr22/protocol_example..chr22.pgen\u001b[0m\n", + "INFO: Workflow merge_chrom (ID=w8e5a670551e06660) is executed successfully with 1 completed step.\n" + ] + } + ], + "source": [ + "sos run pipeline/rss_ld_sketch.ipynb merge_chrom \\\n", + " --output-dir output/rss_ld_sketch \\\n", + " --cohort-id protocol_example. \\\n", + " --chrom 22 \\\n", + " --cwd output/rss_ld_sketch" + ] + }, + { + "cell_type": "markdown", + "id": "32c022be", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Command interface\n", + "\n", + "List every workflow and its parameters:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f3569a70", + "metadata": { + "kernel": "Bash" + }, + "outputs": [], + "source": [ + "sos run pipeline/rss_ld_sketch.ipynb -h" + ] + }, + { + "cell_type": "markdown", + "id": "ac50d174", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Workflow implementation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a7886e46", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[global]\n", + "parameter: cwd = path(\"output\")\n", + "parameter: job_size = 1\n", + "parameter: walltime = \"24:00:00\"\n", + "parameter: mem = \"32G\"\n", + "parameter: numThreads = 8\n", + "\n", + "cwd = path(f'{cwd:a}')\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c321bef5", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[generate_W]\n", + "# Generate projection matrix $W \\sim N(0, 1/\\sqrt{n})$, shape (n x B).\n", + "# Run ONCE before processing any chromosome.\n", + "#\n", + "# W depends only on n (total sample size) and B -- not on any variant data.\n", + "# n_samples is passed directly as a parameter; no VCF reading is needed.\n", + "# All 22 chromosomes reuse the same W so that per-chromosome stochastic\n", + "# genotype samples can be arithmetically merged for meta-analysis.\n", + "parameter: n_samples = int\n", + "parameter: output_dir = str\n", + "parameter: B = 10000\n", + "parameter: seed = 123\n", + "\n", + "import os\n", + "input: []\n", + "output: f'{output_dir}/W_B{B}.npy'\n", + "task: trunk_workers = 1, trunk_size = 1, walltime = '00:05:00', mem = '4G', cores = 1\n", + "python: expand = \"${ }\", stdout = f'{_output:n}.stdout', stderr = f'{_output:n}.stderr'\n", + "\n", + " import numpy as np\n", + " import os\n", + "\n", + " n = ${n_samples}\n", + " B = ${B}\n", + " seed = ${seed}\n", + " W_out = \"${_output}\"\n", + "\n", + " # -- Generate $W \\sim N(0, 1/\\sqrt{n})$ -----------------------------\n", + " # Convention: W = np.random.normal(0, 1/np.sqrt(n), size=(n, B))\n", + " # W is shared across all chromosomes -- do not regenerate per chromosome.\n", + " print(f\"Generating W ~ N(0, 1/sqrt({n})), shape ({n}, {B}), seed={seed}\")\n", + " np.random.seed(seed)\n", + " W = np.random.normal(0, 1.0 / np.sqrt(n), size=(n, B)).astype(np.float32)\n", + "\n", + " os.makedirs(os.path.dirname(os.path.abspath(W_out)), exist_ok=True)\n", + " np.save(W_out, W)\n", + " print(f\"Saved: {W_out}\")\n", + " print(f\"Shape: {W.shape}, size: {os.path.getsize(W_out)/1e9:.2f} GB\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "68a93ed9", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[process_block]\n", + "parameter: ld_block_file = str\n", + "parameter: chrom = 0\n", + "parameter: vcf_base = str\n", + "parameter: vcf_prefix = str\n", + "parameter: cohort_id = \"ADSP.R5.EUR\"\n", + "parameter: output_dir = str\n", + "parameter: W_matrix = str\n", + "parameter: B = 10000\n", + "parameter: maf_min = 0.0005\n", + "parameter: mac_min = 5\n", + "parameter: msng_min = 0.05\n", + "parameter: sample_list = \"\"\n", + "\n", + "import os\n", + "\n", + "def _read_blocks(bed, chrom_filter):\n", + " blocks = []\n", + " with open(bed) as fh:\n", + " for line in fh:\n", + " if line.startswith(\"#\") or not line.strip():\n", + " continue\n", + " parts = line.split()\n", + " c = parts[0]\n", + " if not (c.startswith(\"chr\") and c[3:].isdigit()):\n", + " continue\n", + " cnum = int(c[3:])\n", + " if not (1 <= cnum <= 22):\n", + " continue\n", + " if chrom_filter != 0 and cnum != chrom_filter:\n", + " continue\n", + " blocks.append({\"chr\": c, \"start\": int(parts[1]), \"end\": int(parts[2])})\n", + " if not blocks:\n", + " raise ValueError(f\"No blocks found for chrom={chrom_filter} in {bed}\")\n", + " return blocks\n", + "\n", + "blocks = _read_blocks(ld_block_file, chrom)\n", + "print(f\" {len(blocks)} LD blocks queued\")\n", + "\n", + "input: for_each = \"blocks\"\n", + "output: f'{output_dir}/{_blocks[\"chr\"]}/{_blocks[\"chr\"]}_{_blocks[\"start\"]}_{_blocks[\"end\"]}/{cohort_id}.{_blocks[\"chr\"]}_{_blocks[\"start\"]}_{_blocks[\"end\"]}.dosage.gz'\n", + "task: trunk_workers = 1, trunk_size = 1, walltime = walltime, mem = mem, cores = numThreads\n", + "python: expand = \"${ }\"\n", + "\n", + " import numpy as np\n", + " import os\n", + " import gzip\n", + " import sys\n", + " import atexit\n", + " from math import nan\n", + " from cyvcf2 import VCF\n", + " from os import listdir\n", + "\n", + " # Block coordinates from for_each loop\n", + " chrm_str = \"${_blocks['chr']}\"\n", + " block_start = ${_blocks[\"start\"]}\n", + " block_end = ${_blocks[\"end\"]}\n", + "\n", + " vcf_base = \"${vcf_base}\"\n", + " vcf_prefix = \"${vcf_prefix}\"\n", + " W_path = \"${W_matrix}\"\n", + " B = ${B}\n", + " maf_min = ${maf_min}\n", + " mac_min = ${mac_min}\n", + " msng_min = ${msng_min}\n", + " sample_list = \"${sample_list}\"\n", + " cohort_id = \"${cohort_id}\"\n", + " base_dir = \"${output_dir}\"\n", + "\n", + " block_tag = f\"{chrm_str}_{block_start}_{block_end}\"\n", + " output_dir = os.path.join(base_dir, chrm_str, block_tag)\n", + " os.makedirs(output_dir, exist_ok=True)\n", + "\n", + " log_path = os.path.join(output_dir, f\"{block_tag}.log\")\n", + " log_fh = open(log_path, \"w\")\n", + " sys.stdout = log_fh\n", + " sys.stderr = log_fh\n", + " atexit.register(log_fh.close)\n", + "\n", + " # -- Load sample subset (optional) -----------------------------\n", + " sample_subset = None\n", + " if sample_list:\n", + " if not os.path.exists(sample_list):\n", + " raise FileNotFoundError(f\"sample_list not found: {sample_list}\")\n", + " with open(sample_list) as fh:\n", + " sample_subset = set(line.strip() for line in fh if line.strip())\n", + " print(f\" Sample subset: {len(sample_subset):,} samples\")\n", + "\n", + " # -- Helpers ---------------------------------------------------\n", + " def get_vcf_files(chrm_str):\n", + " files = sorted([\n", + " os.path.join(vcf_base, x)\n", + " for x in listdir(vcf_base)\n", + " if x.endswith(\".bgz\") and (\n", + " x.startswith(vcf_prefix + chrm_str + \":\") or\n", + " x.startswith(vcf_prefix + chrm_str + \".\")\n", + " )\n", + " ])\n", + " if not files:\n", + " raise FileNotFoundError(f\"No VCF files for {chrm_str} in {vcf_base}\")\n", + " return files\n", + "\n", + " def open_vcf(vf, sample_subset):\n", + " \"\"\"Open a VCF file, applying sample subset if provided.\"\"\"\n", + " vcf = VCF(vf)\n", + " if sample_subset is not None:\n", + " vcf_samples = vcf.samples\n", + " keep = [s for s in vcf_samples if s in sample_subset]\n", + " if not keep:\n", + " raise ValueError(f\"No sample_list samples in {os.path.basename(vf)}\")\n", + " vcf.set_samples(keep)\n", + " return vcf\n", + "\n", + " def extract_dosage(var):\n", + " \"\"\"Extract diploid dosage from a cyvcf2 variant. Returns list of floats (nan for missing).\"\"\"\n", + " return [sum(x[0:2]) for x in [[nan if v == -1 else v for v in gt] for gt in var.genotypes]]\n", + "\n", + " def fill_missing_col_means(G):\n", + " col_means = np.nanmean(G, axis=0)\n", + " return np.where(np.isnan(G), col_means, G)\n", + "\n", + " # -- Single-pass: scan variants, filter, and collect dosages ---\n", + " # BED is 0-based half-open [start, end); VCF is 1-based.\n", + " print(f\"[1/3] Scanning {chrm_str} [{block_start:,}, {block_end:,}) ...\")\n", + " vcf_files = get_vcf_files(chrm_str)\n", + " region = f\"{chrm_str}:{block_start+1}-{block_end}\"\n", + " var_info = []\n", + " dosage_matrix = []\n", + " n_samples = None\n", + " # Filter counters\n", + " n_total = 0\n", + " n_multiallelic = 0\n", + " n_monomorphic = 0\n", + " n_all_na = 0\n", + " n_low_maf = 0\n", + " n_low_mac = 0\n", + " n_high_msng = 0\n", + "\n", + " for vf in vcf_files:\n", + " vcf = open_vcf(vf, sample_subset)\n", + " if n_samples is None:\n", + " n_samples = len(vcf.samples)\n", + " for var in vcf(region):\n", + " if not (block_start <= var.POS - 1 < block_end):\n", + " continue\n", + " n_total += 1\n", + " if len(var.ALT) != 1:\n", + " n_multiallelic += 1\n", + " continue\n", + " dosage = extract_dosage(var)\n", + " if np.nanvar(dosage) == 0:\n", + " n_monomorphic += 1\n", + " continue\n", + " nan_count = int(np.sum(np.isnan(dosage)))\n", + " n_non_na = len(dosage) - nan_count\n", + " if n_non_na == 0:\n", + " n_all_na += 1\n", + " continue\n", + " alt_sum = float(np.nansum(dosage))\n", + " mac = min(2 * n_non_na - alt_sum, alt_sum)\n", + " maf = mac / (2 * n_non_na)\n", + " af = alt_sum / (2 * n_non_na)\n", + " msng_rate = nan_count / len(dosage)\n", + " if msng_rate > msng_min:\n", + " n_high_msng += 1\n", + " continue\n", + " if maf < maf_min:\n", + " n_low_maf += 1\n", + " continue\n", + " if mac < mac_min:\n", + " n_low_mac += 1\n", + " continue\n", + " var_info.append({\n", + " \"chr\": var.CHROM, \"pos\": var.POS,\n", + " \"ref\": var.REF, \"alt\": var.ALT[0],\n", + " \"af\": round(float(af), 6),\n", + " \"id\": f\"{var.CHROM}:{var.POS}:{var.REF}:{var.ALT[0]}\",\n", + " \"obs_ct\": 2 * n_non_na,\n", + " })\n", + " dosage_matrix.append(dosage)\n", + " vcf.close()\n", + "\n", + " n_passed = len(var_info)\n", + " print(f\" {n_total:,} total variants in region\")\n", + " print(f\" {n_passed:,} passed filters (n={n_samples:,})\")\n", + " print(f\" Filtered: {n_multiallelic:,} multiallelic, \"\n", + " f\"{n_monomorphic:,} monomorphic, {n_all_na:,} all-NA, \"\n", + " f\"{n_high_msng:,} high-missingness, \"\n", + " f\"{n_low_maf:,} low-MAF, {n_low_mac:,} low-MAC\")\n", + "\n", + " if not var_info:\n", + " raise ValueError(f\"No passing variants in {chrm_str} [{block_start:,}, {block_end:,})\")\n", + "\n", + " # -- Load W ----------------------------------------------------\n", + " print(f\"[2/3] Loading W ...\")\n", + " W = np.load(W_path)\n", + " if W.shape != (n_samples, B):\n", + " raise ValueError(f\"W shape mismatch: {W.shape} vs ({n_samples},{B})\")\n", + " W = W.astype(np.float32)\n", + " print(f\" W: {W.shape}\")\n", + "\n", + " # -- Compute U = $W^T G$ and write output files --------------------\n", + " print(f\"[3/3] Computing U and writing output files ...\")\n", + "\n", + " dosage_path = os.path.join(output_dir, f\"{cohort_id}.{block_tag}.dosage.gz\")\n", + " map_path = os.path.join(output_dir, f\"{cohort_id}.{block_tag}.map\")\n", + " afreq_path = os.path.join(output_dir, f\"{cohort_id}.{block_tag}.afreq\")\n", + " meta_path = os.path.join(output_dir, f\"{cohort_id}.{block_tag}.meta\")\n", + "\n", + " # Write .map\n", + " with open(map_path, \"w\") as fh:\n", + " for v in var_info:\n", + " fh.write(f\"{v['chr']}\\t{v['id']}\\t0\\t{v['pos']}\\n\")\n", + "\n", + " # Write .meta\n", + " with open(meta_path, \"w\") as fh:\n", + " fh.write(f\"source_n_samples={n_samples}\\nB={B}\\n\")\n", + " fh.write(f\"chrom={chrm_str}\\nblock_start={block_start}\\nblock_end={block_end}\\n\")\n", + " fh.write(f\"n_total={n_total}\\nn_passed={n_passed}\\n\")\n", + " fh.write(f\"n_multiallelic={n_multiallelic}\\nn_monomorphic={n_monomorphic}\\n\")\n", + " fh.write(f\"n_all_na={n_all_na}\\nn_high_msng={n_high_msng}\\n\")\n", + " fh.write(f\"n_low_maf={n_low_maf}\\nn_low_mac={n_low_mac}\\n\")\n", + "\n", + "\n", + " # Build G from collected dosages, compute U = $W^T G$, write dosage.gz\n", + " # Dosage format=1: ID ALT REF val_S1 ... val_SB\n", + " # Min-max scaling to [0, 2] makes the output plink2-compatible as dosage.\n", + " # This preserves correlation structure (cor is scale-invariant) which is\n", + " # what matters for LD computation downstream.\n", + " G = np.array(dosage_matrix, dtype=np.float32).T # (n_samples, n_variants)\n", + " del dosage_matrix\n", + " G = fill_missing_col_means(G)\n", + "\n", + " # variant-wise scaling\n", + " col_mean = G.mean(axis=0, keepdims=True)\n", + " col_std = G.std(axis=0, keepdims=True)\n", + " # avoid division by zero\n", + " col_std[col_std == 0] = 1.0\n", + " G = (G - col_mean) / col_std\n", + "\n", + " U = W.T @ G # (B, n_variants)\n", + " del G\n", + "\n", + " col_min = U.min(axis=0)\n", + " col_max = U.max(axis=0)\n", + " denom = col_max - col_min\n", + " denom[denom == 0] = 1.0\n", + " U = 2.0 * (U - col_min) / denom\n", + " U = np.round(U, 4)\n", + "\n", + " # Record the col min and max for U\n", + "\n", + " with open(afreq_path, \"w\") as fh:\n", + " # Add column headers\n", + " fh.write(\"#CHROM\\tID\\tREF\\tALT\\tALT_FREQS\\tOBS_CT\\tU_MIN\\tU_MAX\\n\")\n", + " for j, v in enumerate(var_info):\n", + " fh.write(f\"{v['chr']}\\t{v['id']}\\t{v['ref']}\\t{v['alt']}\\t\"\n", + " f\"{v['af']:.6f}\\t{v['obs_ct']}\\t\"\n", + " f\"{col_min[j]:.6f}\\t{col_max[j]:.6f}\\n\")\n", + "\n", + " with gzip.open(dosage_path, \"wt\", compresslevel=4) as gz:\n", + " for j, v in enumerate(var_info):\n", + " vals = \" \".join(f\"{x:.4f}\" for x in U[:, j])\n", + " gz.write(f\"{v['id']} {v['alt']} {v['ref']} {vals}\\n\")\n", + "\n", + " del U\n", + " print(f\" Written: {len(var_info):,} variants -> {os.path.basename(dosage_path)}\")\n", + " print(f\" Written: {os.path.basename(map_path)}, {os.path.basename(afreq_path)}\")\n", + " print(f\"\\nDone: {chrm_str} [{block_start:,}, {block_end:,})\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9e8fff43", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[merge_chrom]\n", + "parameter: chrom = 0\n", + "parameter: output_dir = str\n", + "parameter: cohort_id = str\n", + "parameter: plink2_bin = \"plink2\"\n", + "\n", + "import os, glob\n", + "\n", + "def chromsto_process(output_dir, chrom_filter):\n", + " if chrom_filter != 0:\n", + " return [f\"chr{chrom_filter}\"]\n", + " return sorted(set(\n", + " os.path.basename(d)\n", + " for d in glob.glob(os.path.join(output_dir, \"chr*\"))\n", + " if os.path.isdir(d)\n", + " ))\n", + "\n", + "chroms = chromsto_process(output_dir, chrom)\n", + "\n", + "input: for_each = \"chroms\"\n", + "output: f\"{output_dir}/{_chroms}/{cohort_id}.{_chroms}.pgen\"\n", + "task: trunk_workers = 1, trunk_size = 1, walltime = walltime, mem = mem, cores = numThreads\n", + "bash: expand = \"$[ ]\"\n", + "\n", + " set -euo pipefail\n", + " shopt -s nullglob\n", + "\n", + " chrom_dir=\"$[output_dir]/$[_chroms]\"\n", + " final_prefix=\"${chrom_dir}/$[cohort_id].$[_chroms]\"\n", + " merge_list=\"${chrom_dir}/$[cohort_id].$[_chroms]_pmerge_list.txt\"\n", + "\n", + " # Step 1: Convert each block dosage.gz -> sorted per-block pgen\n", + " > \"${merge_list}\"\n", + " files=(\"${chrom_dir}\"/*/*.dosage.gz)\n", + " if [ ${#files[@]} -eq 0 ]; then\n", + " echo \"No dosage files found in ${chrom_dir}\" >&2\n", + " exit 1\n", + " fi\n", + " for dosage_gz in \"${files[@]}\"; do\n", + " block_dir=$(dirname \"${dosage_gz}\")\n", + " block_tag=$(basename \"${block_dir}\")\n", + " prefix=\"${block_dir}/$[cohort_id].${block_tag}_tmp\"\n", + " map_file=\"${block_dir}/$[cohort_id].${block_tag}.map\"\n", + " psam_file=\"${block_dir}/$[cohort_id].${block_tag}.psam\"\n", + " meta_file=\"${block_dir}/$[cohort_id].${block_tag}.meta\"\n", + "\n", + " B=$(grep \"^B=\" \"${meta_file}\" | cut -d= -f2)\n", + " printf '#FID\\tIID\\n' > \"${psam_file}\"\n", + " for i in $(seq 1 ${B}); do\n", + " printf 'S%d\\tS%d\\n' ${i} ${i} >> \"${psam_file}\"\n", + " done\n", + "\n", + " $[plink2_bin] \\\n", + " --import-dosage \"${dosage_gz}\" format=1 noheader \\\n", + " --psam \"${psam_file}\" \\\n", + " --map \"${map_file}\" \\\n", + " --make-pgen \\\n", + " --out \"${prefix}_unsorted\" \\\n", + " --silent\n", + "\n", + " $[plink2_bin] \\\n", + " --pfile \"${prefix}_unsorted\" \\\n", + " --make-pgen \\\n", + " --sort-vars \\\n", + " --out \"${prefix}\" \\\n", + " --silent\n", + "\n", + " rm -f \"${prefix}_unsorted.pgen\" \"${prefix}_unsorted.pvar\" \"${prefix}_unsorted.psam\"\n", + " echo \"${prefix}\" >> \"${merge_list}\"\n", + " done\n", + "\n", + " # Step 2: Merge all per-block pgens -> one per-chrom pgen\n", + " $[plink2_bin] \\\n", + " --pmerge-list \"${merge_list}\" pfile \\\n", + " --make-pgen \\\n", + " --sort-vars \\\n", + " --out \"${final_prefix}\"\n", + "\n", + " # Remove PLINK merge intermediates immediately after merge\n", + " rm -f \"${final_prefix}-merge.pgen\" \"${final_prefix}-merge.pvar\" \"${final_prefix}-merge.psam\"\n", + "\n", + " # Step 3: Concatenate .afreq\n", + " first=1\n", + " for f in \"${chrom_dir}\"/*/*.afreq; do\n", + " if [ \"${first}\" -eq 1 ]; then\n", + " cat \"${f}\" > \"${final_prefix}.afreq\"\n", + " first=0\n", + " else\n", + " tail -n +2 \"${f}\" >> \"${final_prefix}.afreq\"\n", + " fi\n", + " done\n", + "\n", + "R: expand = \"$[ ]\"\n", + "\n", + " library(data.table)\n", + " meta_files <- list.files(\"$[output_dir]/$[_chroms]\",\n", + " pattern = \"[.]meta$\", recursive = TRUE,\n", + " full.names = TRUE)\n", + " if (length(meta_files) > 0) {\n", + " fields <- c(\"n_total\", \"n_passed\", \"n_multiallelic\", \"n_monomorphic\",\n", + " \"n_all_na\", \"n_high_msng\", \"n_low_maf\", \"n_low_mac\")\n", + " stats <- rbindlist(lapply(meta_files, function(f) {\n", + " lines <- grep(\"^n_\", readLines(f), value = TRUE)\n", + " kv <- strsplit(lines, \"=\")\n", + " vals <- setNames(as.integer(sapply(kv, `[`, 2)), sapply(kv, `[`, 1))\n", + " as.data.table(as.list(vals[fields]))\n", + " }))\n", + " totals <- colSums(stats, na.rm = TRUE)\n", + " summary <- data.frame(t(totals))\n", + " summary$pct_dropped <- round(100 * (1 - summary$n_passed / summary$n_total), 1)\n", + " cat(\"\\n=== Filter Summary for $[_chroms] ===\\n\")\n", + " print(data.frame(value = unlist(summary), row.names = names(summary)))\n", + " }\n", + "\n", + "bash: expand = \"$[ ]\"\n", + "\n", + " # Step 5: Cleanup block intermediates\n", + " chrom_dir=\"$[output_dir]/$[_chroms]\"\n", + " final_prefix=\"${chrom_dir}/$[cohort_id].$[_chroms]\"\n", + "\n", + " rm -f \"${final_prefix}_pmerge_list.txt\"\n", + " for block_dir in \"${chrom_dir}\"/*/; do\n", + " rm -rf \"${block_dir}\"\n", + " done" + ] + }, + { + "cell_type": "markdown", + "id": "dc998dcc", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Troubleshooting\n", + "\n", + "| Symptom | Cause | Fix |\n", + "|---|---|---|\n", + "| `No VCF files for chrXX in {vcf_base}` | VCF naming or extension mismatch | Files must end in `.bgz` and be named `{vcf_prefix}{chr}.*.bgz`; check `--vcf-base` and `--vcf-prefix`. |\n", + "| `W shape mismatch` | `--n-samples` or `--B` differs from the W used | Re-run `generate_W` with the same `--n-samples` and `--B`, and pass that `W_B{B}.npy` to `process_block`. |\n", + "| `No passing variants in chrXX` | Filters removed everything (small toy cohort) | Widen `--maf-min` / `--mac-min` / `--msng-min`, or choose blocks with more variants. |\n", + "| `No blocks found for chrom=XX` | `--chrom` does not match any BED rows | Ensure the BED `chr` column matches (e.g. `chr22`) and `--chrom` is the matching number. |\n", + "| Region query returns nothing | Missing tabix index | Run `tabix -p vcf file.bgz` to create the `.tbi`. |" + ] + }, + { + "cell_type": "markdown", + "id": "ac3cfb79", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Output\n", + "\n", + "Per chromosome, under `--cwd`:\n", + "- `{cohort_id}.chr{N}.pgen` — binary genotype-sketch data (B pseudo-samples × p variants)\n", + "- `{cohort_id}.chr{N}.pvar` — variant information\n", + "- `{cohort_id}.chr{N}.psam` — sample (sketch) information\n", + "- `{cohort_id}.chr{N}.afreq` — allele frequencies\n", + "\n", + "These feed SuSiE-RSS fine-mapping: load with a metadata TSV (one row per chromosome, columns `#chrom start end path`, `path` = pgen prefix). Use the X (genotype) interface for `susie_rss(z, X=X)` or the R (correlation) interface for `susie_rss(z, R=R)`." + ] + }, + { + "cell_type": "markdown", + "id": "ff927c9c", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Anticipated Results\n", + "\n", + "The pipeline produces output files in the `output/` subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the **Output** section above for the expected file names and formats." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "SoS", + "language": "sos", + "name": "sos" + }, + "language_info": { + "codemirror_mode": "sos", + "file_extension": ".sos", + "mimetype": "text/x-sos", + "name": "sos", + "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter", + "pygments_lexer": "sos" + }, + "sos": { + "kernels": [ + [ + "Bash", + "calysto_bash", + "Bash", + "#E6EEFF", + "" + ], + [ + "SoS", + "sos", + "sos", + "", + "sos" + ] + ], + "version": "" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From d1ed9bffbdaaabbbd1427db5902fa06b00a3e222 Mon Sep 17 00:00:00 2001 From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com> Date: Tue, 23 Jun 2026 12:00:55 -0400 Subject: [PATCH 3/6] Delete code/SoS/enrichment/sldsc_enrichment.ipynb --- code/SoS/enrichment/sldsc_enrichment.ipynb | 1383 -------------------- 1 file changed, 1383 deletions(-) delete mode 100644 code/SoS/enrichment/sldsc_enrichment.ipynb diff --git a/code/SoS/enrichment/sldsc_enrichment.ipynb b/code/SoS/enrichment/sldsc_enrichment.ipynb deleted file mode 100644 index e022ec8a..00000000 --- a/code/SoS/enrichment/sldsc_enrichment.ipynb +++ /dev/null @@ -1,1383 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "# Stratified LD Score Regression (S-LDSC) Enrichment\n", - "\n", - "Minimal working-example driver for the S-LDSC functional-enrichment pipeline. The **Steps** section below gives one ready-to-run `sos run` command per workflow, using the toy inputs symlinked under `input/`.\n", - "\n", - "> **Environment note.** Steps 1\u20132 (`make_annotation_files_ldscore`, `get_heritability`) wrap the external **polyfun** toolkit (`compute_ldscores.py`, `ldsc.py`, `munge_polyfun_sumstats.py`) and require pre-computed reference-panel files (baseline-LD scores, LD weights, `.frq`, and PLINK `.bed/.bim/.fam`). polyfun is **not installed in this environment** and the reference panel is not shipped with the toy example, so those two steps cannot be executed here; their commands are provided for use on a system where polyfun and a matching panel are available. Steps 3\u20134 (`postprocess`, `meta_subset`) use `pecotmr::sldsc_postprocessing_pipeline` (available here) and read the `.results`/`.log` files produced by Step 2.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Description\n", - "This notebook implements the pipeline of [S-LDSC](https://www.nature.com/articles/ng.3404) for LD score and functional enrichment analysis.\n", - "\n", - "**Important: the S-LDSC implementation comes from the [polyfun](https://github.com/omerwe/polyfun/tree/master) package, not the original LDSC from `bulik/ldsc` GitHub repo.**" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "Markdown" - }, - "source": [ - "Uses GWAS summary statistics together with annotation and LD reference-panel data to compute per-SNP heritability enrichment for each annotation. It supports single-annotation (individual contribution) and joint multi-annotation (independent contribution) analysis.\n", - "\n", - "**Background.** LD Score Regression (Bulik-Sullivan et al. 2015) distinguishes confounding (e.g. population stratification) from true polygenic signal by regressing GWAS chi-square statistics on LD scores: SNPs tagging more variation (high LD score) show higher chi-square under true polygenicity, whereas confounding inflates statistics independently of LD. S-LDSC (Finucane et al. 2015) partitions heritability across overlapping annotation categories; standardized tau accounts for negative selection (Gazal et al. 2017). The model details and the tau*/EnrichStat definitions follow below.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Methods - Workflow Overview\n", - "\n", - "The pipeline runs in three stages: (1) annotation preparation and the S-LDSC regression (polyfun), (2) post-processing into standardized $\\tau^*$ and meta-analysis (the `pecotmr` package), and (3) optional re-meta on user-defined trait subsets. The concrete commands for stages 1-2 are in the **Steps** section below.\n", - "\n", - "**Stage 1 - polyfun.** Three SoS workflows wrap polyfun: `make_annotation_files_ldscore` converts target annotations into polyfun `.annot.gz` and runs `compute_ldscores.py` (toggles `compute_single` and `compute_joint`, both default `True`; the joint dir is only emitted when $N \\geq 2$); `munge_sumstats_polyfun` preprocesses each GWAS into LDSC format; `get_heritability` runs polyfun's `ldsc.py` once per `--target-anno-dir`, enforcing the MAF cutoff via `--frqfile-chr` (`maf_cutoff` accepts only `0` or `0.05`).\n", - "\n", - "**Stage 2 - pecotmr post-processing.** A single `pecotmr::sldsc_postprocessing_pipeline` call consumes all polyfun outputs: it extracts $\\tau$, $E$, $h^2_g$, EnrichStat p-value and per-block jackknife $\\tau$ values; computes $sd_C$ and $M_{\\mathrm{ref}}$ over the regression's MAF-cutoff SNP set; standardizes $\\tau \\to \\tau^*$ for single and joint modes; auto-detects binary vs continuous annotations; and runs a DerSimonian-Laird random-effects meta-analysis across traits, producing three meta tables ($\\tau^*$ cross-type comparable, $E$ within-binary, EnrichStat within-binary). Output is an R list with `per_trait` and `meta` entries.\n", - "\n", - "**Stage 3 - subset meta-analysis.** `pecotmr::meta_sldsc_random` re-runs the meta on a trait subset without re-running the regression (lightweight, interactive):\n", - "\n", - "```r\n", - "res <- readRDS(\"sldsc_results.rds\")\n", - "neuro <- c(\"AD_GWAX\", \"PD_meta\", \"ALS_meta\")\n", - "meta_neuro_taustar <- pecotmr::meta_sldsc_random(\n", - " res$per_trait[neuro], category = \"my_target_anno\", quantity = \"tau_star\"\n", - ")\n", - "```\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Theory\n", - "\n", - "The statistical model behind the pipeline is summarized below. Because the same framework underlies several of the workflow steps, the model, its stratified extension, and the tau-estimation / enrichment definitions are described together here rather than repeated per step." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### LDSC model\n", - "\n", - "Under a polygenic assumption, in which effect sizes for variants are drawn independently from distributions with variance proportional to $1/(p(1-p))$ where $p$ is the minor allele frequency (MAF), the expected $\\chi^2$ statistic of variant $j$ is:\n", - "\n", - "$$E[\\chi^2_j \\mid \\ell_j] \\;=\\; \\frac{N\\,h^2\\,\\ell_j}{M} \\;+\\; N a \\;+\\; 1 \\quad (1)$$\n", - "\n", - "where $N$ is the sample size; $M$ is the number of SNPs, so that $h^2/M$ is the average heritability per SNP; $a$ measures the contribution of confounding biases such as cryptic relatedness and population stratification; and $\\ell_j = \\sum_k r^2_{jk}$ is the LD Score of variant $j$, which measures the amount of genetic variation tagged by $j$. A full derivation is given in the Supplementary Note of Bulik-Sullivan et al. (2015); an alternative derivation appears in the Supplementary Note of Zhu and Stephens (2017) AoAS.\n", - "\n", - "Equation (1) shows that LD Score regression can compute SNP-based heritability for a phenotype from GWAS summary statistics alone, without requiring individual-level genotype data as REML and related methods do." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### Stratified LDSC\n", - "\n", - "Heritability is the proportion of phenotypic variation that is due to variation in genetic values, and it can also be partitioned over disjoint or overlapping categories of SNPs.\n", - "\n", - "Stratified LD Score Regression (S-LDSC) partitions heritability by leveraging both LD-score information and SNPs that have not reached genome-wide significance. S-LDSC exploits the fact that the $\\chi^2$ statistic for a given SNP reflects the cumulative effects of all SNPs tagged by it: in regions of high LD, the focal SNP captures the contribution of a group of nearby SNPs.\n", - "\n", - "S-LDSC declares an annotation enriched for heritability if SNPs with high LD to that annotation have higher $\\chi^2$ statistics than SNPs with low LD to it.\n", - "\n", - "Let $a_{jC}$ denote the value of annotation $C$ at SNP $j$:\n", - "\n", - "- **Binary annotation** (e.g. an indicator for \"in enhancer\", \"in exon\", \"in cell-type-specific peak\"): $a_{jC} \\in \\{0, 1\\}$.\n", - "- **Continuous annotation** (e.g. gene-specificity score, conservation score, continuous epigenomic signal): $a_{jC} \\in \\mathbb{R}$.\n", - "\n", - "Under a polygenic model the per-SNP heritability for SNP $j$ is\n", - "\n", - "$$\\mathrm{Var}(\\beta_j) \\;=\\; \\sum_C a_{jC}\\, \\tau_C$$\n", - "\n", - "and the expected $\\chi^2$ statistic of SNP $j$ is\n", - "\n", - "$$E[\\chi^2_j \\mid \\mathbf{a}_j] \\;=\\; N \\sum_C \\tau_C\\, \\ell(j, C) \\;+\\; N a \\;+\\; 1 \\quad (2)$$\n", - "\n", - "where $\\ell(j, C) = \\sum_k a_{kC}\\, r^2_{jk}$ is the partitioned LD score of SNP $j$ with respect to annotation $C$, and $a$ measures confounding bias. Equation (2) allows joint estimation of all $\\tau_C$ via a (computationally simple) multiple regression of $\\chi^2_j$ against $\\ell(j, C)$.\n", - "\n", - "Interpretation of $\\tau_C$:\n", - "- **Binary $C$**: $\\tau_C$ is the *additive increase in per-SNP heritability* for SNPs in category $C$, on top of the contributions from any other annotations they belong to.\n", - "- **Continuous $C$**: $\\tau_C$ is the *additive change in per-SNP heritability per unit increase* in the value of annotation $C$.\n", - "\n", - "For application to real data and comparisons to other methods, see the three papers cited at the top of this notebook." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### Tau Estimation and Enrichment Analysis\n", - "\n", - "Goal: quantify the contribution of functional annotations to trait heritability and assess statistical significance, accounting for LD structure and (for continuous annotations) annotation scale.\n", - "\n", - "The pipeline has two computational layers:\n", - "\n", - "- **Regression layer** \u2014 the S-LDSC regression itself, performed by the [polyfun](https://github.com/omerwe/polyfun) engine. We do not re-implement this.\n", - "- **Post-processing layer** \u2014 standardization, differential per-SNP heritability, binary/continuous detection, and random-effects meta-analysis across traits. Implemented in the [`pecotmr`](https://github.com/StatFunGen/pecotmr) R package (`R/sldsc_wrapper.R`).\n", - "\n", - "The notation below tags each modeling quantity as **(polyfun)** or **(pecotmr)**.\n", - "\n", - "#### Notation\n", - "\n", - "For each annotation $C$ we use:\n", - "\n", - "- $\\pi^{h^2}_C$ = proportion of trait heritability $h^2_g$ assigned to annotation $C$.\n", - "- $\\pi^{M}_C$ = proportion of (effective) SNPs in annotation $C$. For binary annotations this is $M_C / M_{\\mathrm{ref}}$; for continuous annotations it is the share of total annotation weight in $C$.\n", - "\n", - "#### Reference panel and MAF cutoff\n", - "\n", - "All LD-derived quantities \u2014 partitioned LD scores for the 97 baseline annotations and for our $K$ target annotations, the LD-score-regression weights, allele frequencies, and the SNP set \u2014 are computed against our own LD reference panel. We do not mix in pre-computed quantities from external panels (e.g. 1000G); $M_{\\mathrm{ref}}$ throughout this notebook denotes the number of common SNPs in our panel.\n", - "\n", - "By default we restrict to MAF $> 5\\%$ per the sLDSC recommendation: rare-variant LD is unstable and HapMap3-style regression weights are common-variant by construction. The cutoff is exposed as the SoS parameter `maf_cutoff` (default $0.05$); the regression, the standardized $sd_C$, and $M_{\\mathrm{ref}}$ are all evaluated on the same MAF $>$ cutoff SNP set. If allele-frequency files are not available the pipeline fails; the user must explicitly set `maf_cutoff = 0` to opt out (not recommended).\n", - "\n", - "#### Quantities from the regression layer (polyfun)\n", - "\n", - "Solving Equation (2) jointly across annotations, with 200-block genomic jackknife for inference, is performed by polyfun's `ldsc.py`. From each polyfun run we obtain, per annotation:\n", - "\n", - "- $\\tau_C$ and its standard error \u2014 **(polyfun)**.\n", - "- $\\pi^{h^2}_C$ and $\\pi^{M}_C$ \u2014 **(polyfun)**.\n", - "- $E_C = \\pi^{h^2}_C / \\pi^{M}_C$ and its standard error \u2014 **(polyfun)**.\n", - "- The p-value of the differential per-SNP heritability test (defined below) \u2014 **(polyfun)**, computed internally with the full coefficient covariance matrix.\n", - "\n", - "We also obtain, per run:\n", - "\n", - "- The total trait heritability $h^2_g$ \u2014 **(polyfun)**.\n", - "- The 200-block jackknife delete-values of $\\tau_C$ \u2014 **(polyfun)**.\n", - "\n", - "#### Quantities from the post-processing layer (pecotmr)\n", - "\n", - "From the polyfun outputs above plus our reference panel, the post-processing layer computes:\n", - "\n", - "- $sd_C$ \u2014 per-annotation standard deviation over MAF $>$ cutoff SNPs \u2014 **(pecotmr: `compute_sldsc_annot_sd`)**.\n", - "- $M_{\\mathrm{ref}}$ \u2014 reference SNP count at the MAF cutoff \u2014 **(pecotmr: `compute_sldsc_M_ref`)**.\n", - "- Whether each annotation is binary or continuous \u2014 **(pecotmr: `is_binary_sldsc_annot`)**.\n", - "- $\\tau^*_C$ point estimate and per-block $\\tau^*_C$ \u2014 **(pecotmr: `standardize_sldsc_trait`)**.\n", - "- EnrichStat point estimate and its standard error (formula below) \u2014 **(pecotmr: `standardize_sldsc_trait`)**.\n", - "- DerSimonian-Laird random-effects meta-analysis of $\\tau^*_C$, $E_C$, or EnrichStat across traits \u2014 **(pecotmr: `meta_sldsc_random`)**.\n", - "\n", - "The top-level entry point `pecotmr::sldsc_postprocessing_pipeline` orchestrates all of the above.\n", - "\n", - "#### Standardized tau ($\\tau^*$) \u2014 (pecotmr)\n", - "\n", - "$\\tau_C$ has units that depend on the scale of the annotation and on the total heritability of the trait, so raw $\\tau$ is not directly comparable across annotations or across traits. We compute the standardized version (Gazal et al. 2017)\n", - "\n", - "$$\\tau^*_C \\;=\\; \\tau_C \\cdot \\frac{sd_C \\cdot M_{\\mathrm{ref}}}{h^2_g}$$\n", - "\n", - "interpreted as the additive change in per-SNP heritability associated with a 1 standard deviation increase in annotation $C$, divided by the average per-SNP heritability across all SNPs. $\\tau^*_C$ is dimensionless and comparable across annotations and across traits. In a joint multi-annotation regression it is the *independent contribution* of annotation $C$ after controlling for overlapping effects of the others.\n", - "\n", - "Here $sd_C$ is the standard deviation of annotation $C$ across reference SNPs (MAF $>$ cutoff), $M_{\\mathrm{ref}}$ is the count of those SNPs, and $h^2_g$ is the trait heritability. Applying the same scaling to each of the 200 jackknife blocks yields per-block $\\tau^*_C$ values; their sample variance gives the jackknife standard error\n", - "$$SE^{\\text{jackknife}}(\\tau^*_C) \\;=\\; \\sqrt{\\,\\tfrac{(B-1)^2}{B}\\, \\mathrm{Var}_b(\\tau^*_{C,(b)})\\,}$$\n", - "with $B = 200$, used as the per-trait input to cross-trait meta-analysis.\n", - "\n", - "#### Differential per-SNP heritability (\"EnrichStat\") \u2014 (polyfun + pecotmr)\n", - "\n", - "To test whether the per-SNP heritability *inside* annotation $C$ differs from *outside* it (Finucane et al. 2015):\n", - "\n", - "$$\\text{EnrichStat}_C \\;=\\; \\frac{h^2_g}{M_{\\mathrm{ref}}} \\!\\left[\\, \\frac{\\pi^{h^2}_C}{\\pi^{M}_C} \\;-\\; \\frac{1 - \\pi^{h^2}_C}{1 - \\pi^{M}_C} \\,\\right]$$\n", - "\n", - "The point-estimate p-value of this test is computed by polyfun internally using the full coefficient covariance and reported as `Enrichment_p`. Its standard error is recovered from the reported p-value:\n", - "\n", - "$$|Z_C| \\;=\\; \\Phi^{-1}\\!\\left(1 - \\tfrac{p_C}{2}\\right), \\qquad SE_{\\text{EnrichStat}_C} \\;=\\; \\frac{|\\text{EnrichStat}_C|}{|Z_C|}.$$\n", - "\n", - "This per-trait point + SE is the input to cross-trait meta-analysis.\n", - "\n", - "#### Reporting: binary vs. continuous annotations \u2014 (pecotmr)\n", - "\n", - "The estimation machinery applies to both annotation types, but the *headline* quantity to report **within each type** differs.\n", - "\n", - "For a **binary annotation** (e.g. enhancer indicator, exon, in/out of a cell-type peak), $\\pi^{M}_C = M_C / M_{\\mathrm{ref}}$ has a direct interpretation and $E_C$ reads as \"the category explains $E_C$-fold more heritability than its share of SNPs.\" The within-type headline quantities are therefore $E_C$ and the EnrichStat p-value; $\\tau^*_C$ is reported alongside.\n", - "\n", - "For a **continuous annotation** (e.g. gene-specificity score, conservation score, continuous epigenomic signal), $E_C$ depends on the scale of the annotation: rescaling the annotation by a constant changes $E_C$ even though the underlying biology is unchanged. The within-type headline quantities are therefore $\\tau^*_C$ and its p-value; $E_C$ is reported alongside but should not be interpreted for continuous annotations.\n", - "\n", - "The pipeline determines whether an annotation is binary by inspecting whether its values lie in $\\{0, 1\\}$ and selects the appropriate within-type headline statistic automatically (pecotmr).\n", - "\n", - "> **From the official LDSC tutorial** ([Partitioned Heritability from Continuous Annotations](https://github.com/bulik/ldsc/wiki/Partitioned-Heritability-from-Continuous-Annotations)):\n", - ">\n", - "> *\"Enrichment is (Prop. heritability) / (Prop. SNPs). These outputs make sense only for binary annotations. Do not try to interpret them for continuous annotations. Using `--print-coefficients` outputs the regression coefficients and corresponding standard errors and Z score for each annotation. These coefficients measure the additional contribution of one annotation to the model and are interpretable for both binary and continuous annotations.\"*\n", - ">\n", - "> The pipeline always passes `--print-coefficients` to polyfun for this reason.\n", - "\n", - "#### Cross-type comparison: always use $\\tau^*_C$ \u2014 (pecotmr)\n", - "\n", - "For an apple-to-apple comparison **across binary and continuous annotations** \u2014 ranking annotations on a single axis, meta-analyzing a mixed set, or reporting a leaderboard that pools both types \u2014 use $\\tau^*_C$. The standardization in Gazal et al. (2017) was designed for exactly this purpose: $sd_C = \\sqrt{p(1-p)}$ for a binary annotation (where $p$ is the proportion in the category) and $sd_C = $ empirical standard deviation for a continuous annotation, so the resulting $\\tau^*_C$ is dimensionless and has the same interpretation in both cases \u2014 additive change in per-SNP heritability per 1 SD increase in the annotation, normalized by the average per-SNP heritability. $E_C$ does not have this property and must not be compared across types.\n", - "\n", - "The pipeline emits both $E_C$ and $\\tau^*_C$ for every annotation, with the binary/continuous flag, so callers can pick the right column for the comparison they are making.\n", - "\n", - "#### Joint analysis \u2014 (polyfun runs the regression; pecotmr standardizes both modes)\n", - "\n", - "For **joint analysis** (multiple annotations fit together), both $\\tau$ and $E$ are conditional on the other annotations in the model. We report joint $\\tau^*_C$ as the independent contribution of annotation $C$ after controlling for the others. The annotation-prep step exposes two independent toggles, `compute_single` and `compute_joint` (both default `True`), so the user can produce the $N$ single-target outputs, the joint output, or both in one invocation. With both defaults the post-processing layer reads all $N+1$ regression outputs per trait and presents single + joint side-by-side. When the joint subset is decided after looking at single-target results (exploratory $\\rightarrow$ conditional workflow), the user runs the annotation-prep step a second time with `compute_single=False` on the curated subset.\n", - "\n", - "### Meta-Analysis across Traits (Random Effects) \u2014 (pecotmr)\n", - "\n", - "DerSimonian-Laird random-effects meta-analysis of per-annotation estimates across traits, implemented in `pecotmr::meta_sldsc_random` (which delegates the numerics to `rmeta::meta.summaries(..., method = \"random\")`):\n", - "\n", - "$$\\hat\\theta_{\\mathrm{meta}} \\;=\\; \\frac{\\sum_i w_i\\, \\hat\\theta_i}{\\sum_i w_i}, \\qquad SE_{\\mathrm{meta}} \\;=\\; \\sqrt{\\frac{1}{\\sum_i w_i}}, \\qquad w_i \\;=\\; \\frac{1}{SE_i^2 + \\hat\\sigma^2}$$\n", - "\n", - "where $\\hat\\theta_i$ is the per-trait estimate and $SE_i$ its standard error:\n", - "\n", - "- **For $\\tau^*_C$ meta**: $SE_i$ is the jackknife SE from the per-block $\\tau^*_C$ values.\n", - "- **For $E_C$ meta**: $SE_i$ is the polyfun-reported `Enrichment_std_error`.\n", - "- **For EnrichStat meta**: $SE_i$ is the back-solved SE from polyfun's `Enrichment_p`.\n", - "\n", - "For binary-annotation enrichment reporting we use a two-channel meta: the **effect size and SE** come from the meta on $E_C$ (interpretable on the original enrichment-fold scale), while the **p-value** comes from the meta on EnrichStat (the appropriate hypothesis test). The pipeline produces a default meta over all supplied traits; users can re-run meta on any subset of traits without re-running the regression layer.\n", - "\n", - "$$Z_{\\mathrm{meta}} \\;=\\; \\frac{\\hat\\theta_{\\mathrm{meta}}}{SE_{\\mathrm{meta}}}, \\qquad p \\;=\\; 2\\,\\Phi(-|Z_{\\mathrm{meta}}|)$$" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Minimal Working Example (MWE)\n", - "\n", - "The steps below run the four pipeline workflows end to end on the example data. Each step lists what it does, then the `sos run` command to execute it.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Step 1. `make_annotation_files_ldscore`\n", - "\n", - "*Annotation preparation and S-LDSC regression (polyfun).* This step accepts a single annotation file for a single-tau analysis (one annotation as input) or several annotation files for a joint-tau analysis (multiple annotations as input)." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "#### **Inputs**\n", - "\n", - "##### 1. Target Annotation File\n", - "\n", - "- **Purpose**: Specifies the user-provided (\"target\") genome annotation files. The pipeline supports both binary and continuous annotations; the type is auto-detected per annotation column.\n", - "- **Formats**:\n", - " - Text file (`.txt`) listing per-chromosome paths to annotation files. Annotation files can be `.rds`/`.tsv`/`.txt`.\n", - " - Alternatively, files for specific chromosomes can be provided directly.\n", - " - **Multiple target annotations** are supported in one input file (one column per annotation, prefixed `path`, `path1`, `path2`, ...). Single-target and joint-target analyses are produced automatically in one pipeline pass.\n", - " - **Format** (the score column is optional; if absent, score is set to 1):\n", - " - `is_range = False`:\n", - " ```\n", - " chr pos score\n", - " 1 10001 1\n", - " 1 10002 1\n", - " ```\n", - " - `is_range = True`:\n", - " ```\n", - " chr start end score\n", - " 1 10001 20001 1\n", - " 1 30001 40001 1\n", - " ```\n", - "\n", - "##### 2. Reference Annotation File (baseline-LD)\n", - "\n", - "- **Purpose**: Provides the baseline annotations (typically the 97-annotation baseline-LD model from Gazal et al. 2017) in `.annot.gz` format for each chromosome. The baseline conditions every regression.\n", - "- **Formats**:\n", - " - Text file listing baseline annotation files for all chromosomes.\n", - " - Alternatively, files for specific chromosomes can be provided directly.\n", - "\n", - "##### 3. Genome Reference File\n", - "\n", - "- **Purpose**: PLINK-format `.bed/.bim/.fam` files for our LD reference panel, per chromosome. This is the panel against which all LD-derived quantities (target LD scores, baseline LD scores, regression weights, allele frequencies) must be computed. **Do not mix files derived from different panels** (e.g. 1000G vs ADSP).\n", - "- **Formats**:\n", - " - Text file listing per-chromosome reference files, or files for specific chromosomes.\n", - "\n", - "##### 4. SNP List\n", - "\n", - "- **Purpose**: Specifies the SNPs to include in LDSC analysis (typically a HapMap3-style list).\n", - "- **Format**: A list of `rsid`s, one per line.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "Bash" - }, - "outputs": [], - "source": [ - "sos run pipeline/sldsc_enrichment.ipynb make_annotation_files_ldscore \\\n", - " --annotation_file input/enrichment/sldsc/colocboost_test_annotation_path.txt \\\n", - " --reference_anno_file input/enrichment/sldsc/reference_annotation0.txt \\\n", - " --genome_ref_file input/enrichment/sldsc/genome_reference_bfile.txt \\\n", - " --annotation_name protocol_example \\\n", - " --plink_name reference. --baseline_name annotations. --weight_name weights. \\\n", - " --python_exec python \\\n", - " --polyfun_path ../polyfun \\\n", - " --cwd output/sldsc_ldscore -j 4\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### Munge summary statistics (preprocessing, run before Step 2)\n", - "\n", - "Before estimating heritability, each raw GWAS summary-statistics file must be converted into the LDSC-compatible format consumed by `get_heritability`. Run `munge_sumstats_polyfun` once per trait; the munged files are then collected in the directory passed to `get_heritability` via `--sumstat_dir`.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "# sos run pipeline/sldsc_enrichment.ipynb munge_sumstats_polyfun \\\n", - "# --sumstats data/polyfun_new/example_data/trait_raw_sumstats.tsv \\\n", - "# --n 0 \\\n", - "# --min-info 0.6 \\\n", - "# --min-maf 0.001 \\\n", - "# --chi2-cutoff 30 \\\n", - "# --polyfun_path data/github/polyfun \\\n", - "# --cwd data/polyfun_new/example_data" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Step 2. `get_heritability`\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "**Inputs**\n", - "\n", - "##### 1. Allele Frequency Files (`.frq`, our panel)\n", - "\n", - "- **Purpose**: PLINK `.frq` files for the reference panel, used to enforce the MAF cutoff. **Required** when `maf_cutoff > 0` (default `0.05`); the pipeline fails if missing unless `maf_cutoff = 0` is explicitly set.\n", - "\n", - "##### 2. GWAS Summary Statistics\n", - "\n", - "- **Purpose**: One munged sumstats file per trait, listed in a text file (`all_traits_file`). The pipeline runs the regression once per trait per single/joint mode.\n", - "- **Format**:\n", - " ```\n", - " CAD_META.filtered.sumstats.gz\n", - " UKB.Lym.BOLT.sumstats.gz\n", - " ```\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "Bash" - }, - "outputs": [], - "source": [ - "sos run pipeline/sldsc_enrichment.ipynb get_heritability \\\n", - " --target_anno_dirs output/sldsc_ldscore/protocol_example_single_1 \\\n", - " --all_traits_file input/enrichment/sldsc/sumstats_test_all.txt \\\n", - " --sumstat_dir input/enrichment/sldsc \\\n", - " --baseline_ld_dir input/enrichment/sldsc \\\n", - " --weights_dir input/enrichment/sldsc \\\n", - " --plink_name reference. --baseline_name annotations. --weight_name weights. \\\n", - " --annotation_name protocol_example --python_exec python \\\n", - " --polyfun_path ../polyfun \\\n", - " --maf_cutoff 0 --cwd output/sldsc_heritability -j 4\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Step 3. `Post-processing (pecotmr) and meta-analysis`\n", - "\n", - "*Post-Processing (`pecotmr::sldsc_postprocessing_pipeline`)*\n", - "\n", - "A single R function call consumes all polyfun outputs for the run and produces the final tables:\n", - "\n", - "- Reads each polyfun output and extracts $\\tau$, $E$, $h^2_g$, EnrichStat p-value, and per-block jackknife $\\tau$ values.\n", - "- Computes annotation $sd_C$ and $M_{\\mathrm{ref}}$ over the same MAF $>$ cutoff SNP set as the regression.\n", - "- Standardizes $\\tau \\to \\tau^*$ for both single-tau and joint-tau modes, including the per-block versions for jackknife SE.\n", - "- Auto-detects whether each annotation is binary or continuous and tags every output row accordingly.\n", - "- Reports the number and names of baseline annotations encountered (via `message()`) for transparency.\n", - "- Runs the default DerSimonian-Laird random-effects meta-analysis across all supplied traits, producing three meta tables: $\\tau^*$ (cross-type comparable), $E$ (within-binary), and EnrichStat (within-type).\n", - "\n", - "Outputs are returned as an R list with two top-level entries: `per_trait` (one tidy data frame per trait, single + joint estimates side-by-side per target) and `meta` (three tables, one per quantity, with rows = target annotations and columns = single/joint mean/SE/p plus an `is_binary` flag).\n", - "\n", - "The `[postprocess]` step reads all polyfun outputs under `heritability_cwd`\n", - "(which contains the $N$ single-target subdirectories and optionally the\n", - "joint subdirectory) and calls `pecotmr::sldsc_postprocessing_pipeline()`\n", - "to produce per-trait standardized tables and the default random-effects\n", - "meta across all traits.\n", - "\n", - "Use `--target-categories-label` (same order as `--target-categories`) to give the target annotations friendly names in the output \u2014 e.g. `--target-categories ANNOT_1_0 ANNOT_2_0 --target-categories-label quantile_eQTL eQTL` makes the `target` column read `quantile_eQTL` / `eQTL` instead of `ANNOT_1_0` / `ANNOT_2_0` (the original names are kept in `params$target_categories_orig`). Omit it to keep the polyfun `.results` names.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "Bash" - }, - "outputs": [], - "source": [ - "sos run pipeline/sldsc_enrichment.ipynb postprocess \\\n", - " --traits_file input/enrichment/sldsc/sumstats_test_all.txt \\\n", - " --heritability_cwd output/sldsc_heritability \\\n", - " --target_categories ANNOT_0 --target_categories_label protocol_example_annotation \\\n", - " --target_anno_dir output/sldsc_ldscore/protocol_example_single_1 \\\n", - " --annotation_name protocol_example --python_exec python \\\n", - " --polyfun_path ../polyfun \\\n", - " --maf_cutoff 0 --cwd output/sldsc_postprocess -j 4\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Step 4. `Subset Meta-Analysis (`pecotmr::meta_sldsc_random`)` (optional)\n", - "\n", - "The default meta in Step 2 pools all traits the user supplied. To re-run the meta on a subset (e.g., neurodegenerative traits only, or autoimmune traits only) without re-running the regression layer:\n", - "\n", - "```r\n", - "res <- readRDS(\"sldsc_results.rds\")\n", - "neuro <- c(\"AD_GWAX\", \"PD_meta\", \"ALS_meta\")\n", - "meta_neuro_taustar <- pecotmr::meta_sldsc_random(\n", - " res$per_trait[neuro], category = \"my_target_anno\", quantity = \"tau_star\"\n", - ")\n", - "```\n", - "\n", - "This step is light-weight and can be run interactively.\n", - "\n", - "\n", - "The default meta in step 3 pools all traits supplied to `[postprocess]`. Use `[meta_subset]` to re-run the meta on a user-defined trait subset (e.g., neurodegenerative traits only, autoimmune traits only) without re-running the regression or the per-trait standardization. The subset operates on the cached `.sldsc_postprocess.rds` output; it is light-weight and can be run interactively or in batch.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "Bash" - }, - "outputs": [], - "source": [ - "sos run pipeline/sldsc_enrichment.ipynb meta_subset \\\n", - " --postprocess_rds output/sldsc_postprocess/protocol_example.sldsc_postprocess.rds \\\n", - " --subset_traits_file input/enrichment/sldsc/sumstats_test_category1.txt \\\n", - " --subset_name category1 --target_categories ANNOT_0 \\\n", - " --annotation_name protocol_example --python_exec python \\\n", - " --polyfun_path ../polyfun \\\n", - " --maf_cutoff 0 --cwd output/sldsc_postprocess -j 4" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Output\n", - "\n", - "### Output summary (cached artifacts)\n", - "\n", - "| Stage | Cached on disk | Recomputable from | Purpose |\n", - "|---|---|---|---|\n", - "| Target LD scores | per-annotation, once | annotation + reference panel | input to every regression |\n", - "| polyfun `.results` per (trait, mode) | yes | regression run | $\\tau$, $E$, EnrichStat |\n", - "| Per-trait standardized table | yes (RDS) | polyfun outputs + $sd_C$ + $M_{\\mathrm{ref}}$ | reporting + meta |\n", - "| Default meta tables | yes (RDS) | per-trait standardized | headline figures |\n", - "| Subset meta | re-run on demand | per-trait standardized | custom analyses |\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### Per-stage outputs\n", - "\n", - "Each workflow writes into its `--cwd`:\n", - "\n", - "- **make_annotation_files_ldscore** \u2014 polyfun `.annot.gz` files plus per-annotation LD-score directories (`.l2.ldscore.{gz,parquet}`, `.l2.M`, `.l2.M_5_50`). One single-target directory per annotation, plus (when more than one annotation) a joint directory.\n", - "- **get_heritability** \u2014 per trait and per target directory, the S-LDSC regression outputs `.{results,log,part_delete}`. The `.results` `Category` column carries the annotation name with a `_` suffix.\n", - "- **postprocess** \u2014 a single `.sldsc_postprocess.rds` containing per-trait tables (Gazal-style tau*, EnrichStat with back-solved jackknife SE) and three DerSimonian\u2013Laird random-effects meta tables (tau*, E, EnrichStat).\n", - "- **meta_subset** \u2014 a re-meta of the cached `.sldsc_postprocess.rds` over a user-defined trait subset (lightweight; no regression re-run).\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Anticipated Results\n", - "\n", - "Produces per-annotation enrichment statistics (tau, enrichment, p-value) from stratified LD score regression." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Command interface\n", - "\n", - "List all workflows and their options:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "Bash" - }, - "outputs": [], - "source": [ - "sos run pipeline/sldsc_enrichment.ipynb -h" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Workflow implementation\n", - "\n", - "The cells below are the pipeline definition (preserved from the original notebook): the `[global]` parameter block and the workflow step bodies.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[global]\n", - "# Path to the work directory of the analysis.\n", - "parameter: cwd = path('output')\n", - "# Prefix for the analysis output\n", - "parameter: annotation_name = str\n", - "parameter: python_exec = \"python\" # e.g. \"/home/you/.conda/envs/polyfun/bin/python\"\n", - "parameter: polyfun_path = path # e.g. \"/home/you/tools/polyfun\"\n", - "\n", - "# MAF cutoff for sLDSC. Default 0.05 per sLDSC recommendation (rare-variant LD is unstable\n", - "# and HapMap3-style regression weights are common-variant by construction).\n", - "# Set to 0 to opt out of MAF filtering (NOT recommended; only use if you understand the implications).\n", - "# Other values would require recomputing LD scores at that cutoff.\n", - "parameter: maf_cutoff = 0.05\n", - "\n", - "# for make_annotation_files_ldscore workflow:\n", - "parameter: annotation_file = path()\n", - "parameter: reference_anno_file = path()\n", - "parameter: genome_ref_file = path() # with .bed\n", - "parameter: chromosome = []\n", - "parameter: snp_list = path()\n", - "parameter: ld_wind_kb = 0 # use kb if the value is provided\n", - "parameter: ld_wind_cm = 1.0 # default using ld_wind_cm\n", - "\n", - "# for get_heritability workflow.\n", - "# Note: all LD-derived inputs (baseline LD scores, target LD scores, regression weights,\n", - "# allele frequencies) must be computed against the same reference panel as `genome_ref_file`.\n", - "# Do not mix files derived from different reference panels (e.g., 1000G vs ADSP).\n", - "parameter: all_traits_file = path() # txt file, each row contains all GWAS summary statistics name: e.g. CAD_META.filtered.sumstats.gz\n", - "parameter: sumstat_dir = path() # Directory containing GWAS summary statistics\n", - "parameter: target_anno_dir = path() # Directory containing target annotation files: output of ldscore\n", - "parameter: baseline_ld_dir = path() # Directory containing baseline LD score files (computed against our panel)\n", - "parameter: frqfile_dir = path() # Directory containing allele frequency files (.frq, our panel)\n", - "parameter: plink_name = \"ADSP_chr\"\n", - "parameter: weights_dir = path() # Directory containing LD weights (computed against our panel)\n", - "parameter: baseline_name = \"baseline_chr\" # Prefix of baseline annotation files\n", - "parameter: weight_name = \"weights_chr\" # Prefix of LD weights files\n", - "parameter: n_blocks = 200\n", - "\n", - "# Number of threads\n", - "parameter: numThreads = 16\n", - "# For cluster jobs, number commands to run per job\n", - "parameter: job_size = 1\n", - "parameter: walltime = '12h'\n", - "parameter: mem = '16G'" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "Python 3 (ipykernel)" - }, - "source": [ - "## Make Annotation File" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[make_annotation_files_ldscore]\n", - "# Annotation preparation. Takes one annotation_file with N target annotations\n", - "# and produces, in one invocation, any combination of:\n", - "# - N single-target LD-score directories (when compute_single = TRUE, default)\n", - "# - 1 joint LD-score directory containing all N (when compute_joint = TRUE\n", - "# and N >= 2, default)\n", - "#\n", - "# Outputs per chromosome :\n", - "# /_single_/_single_..annot.gz (i in 1..N, when compute_single)\n", - "# /_single_/_single_..l2.ldscore.{parquet|gz}\n", - "# /_single_/_single_..l2.M\n", - "# /_single_/_single_..l2.M_5_50 (when .frq present)\n", - "#\n", - "# /_joint/_joint..{...} (when compute_joint and N>=2)\n", - "#\n", - "# Workflows:\n", - "# - Workflow A (\"all at once\"): compute_single=TRUE, compute_joint=TRUE (defaults).\n", - "# Produces both, fits the case where you have already chosen the joint set.\n", - "# - Workflow B (\"exploratory then conditional\"):\n", - "# Step 1: compute_single=TRUE, compute_joint=FALSE.\n", - "# Run on N candidate annotations -> N single-target dirs.\n", - "# Inspect single-target results, identify K significant ones.\n", - "# Step 2: compute_single=FALSE, compute_joint=TRUE.\n", - "# Run on a NEW annotation_file with the K selected annotations\n", - "# -> 1 joint dir with the conditional model.\n", - "\n", - "#\n", - "# --- snplist (--snp_list) vs no-snplist: which polyfun script, output format,\n", - "# column name, and the CM requirement ---\n", - "# --snp_list given -> ldsc.py --l2 --print-snps -> output .l2.ldscore.gz\n", - "# --snp_list absent -> compute_ldscores.py -> output .l2.ldscore.parquet\n", - "#\n", - "# LD-score column name (this is what becomes the .results \"Category\" in\n", - "# [get_heritability], with a \"_\" suffix appended there):\n", - "# * compute_ldscores.py ALWAYS keeps the annot column name(s):\n", - "# single annot column \"ANNOT\" -> ldscore column \"ANNOT\"\n", - "# joint annot columns \"ANNOT_1\",\"ANNOT_2\",... -> \"ANNOT_1\",\"ANNOT_2\",...\n", - "# * ldsc.py --l2 has a quirk: with EXACTLY ONE annotation (n_annot == 1) it\n", - "# HARD-CODES the ldscore column name to \"L2\" and DROPS the annot's original\n", - "# column name. With >=2 annotations it uses \"L2\"\n", - "# (\"ANNOT_1L2\",\"ANNOT_2L2\",...).\n", - "# => a single-target snplist run reports \"L2_0\" in .results, while a\n", - "# single-target no-snplist run reports \"ANNOT_0\". [postprocess] auto-\n", - "# detects either; only matters if you pass --target-categories explicitly.\n", - "#\n", - "# CM column requirement for snplist: ldsc.py --l2 --print-snps requires the\n", - "# target annot to (a) carry a \"CM\" (centimorgan) column and (b) line up with\n", - "# the plink .bim (same SNP set, same row order). This step handles both\n", - "# internally (normalize_for_ldsc: takes CM from the .bim 4th column, re-expands\n", - "# the annot onto the .bim rows, filling 0). Therefore the plink .bim files MUST\n", - "# carry genetic-map (cM) positions when using --ld-wind-cm (the default);\n", - "# if your .bim has 0 in the cM column, switch to --ld-wind-kb instead.\n", - "#\n", - "parameter: compute_single = True\n", - "parameter: compute_joint = True\n", - "parameter: score_column = 3\n", - "parameter: is_range = False\n", - "\n", - "import pandas as pd\n", - "import os\n", - "\n", - "if not (compute_single or compute_joint):\n", - " raise ValueError(\"[make_annotation_files_ldscore] at least one of compute_single or compute_joint must be TRUE\")\n", - "\n", - "def adapt_file_path(file_path, reference_file):\n", - " reference_path = os.path.dirname(reference_file)\n", - " if os.path.isfile(file_path):\n", - " return file_path\n", - " file_name = os.path.basename(file_path)\n", - " if os.path.isfile(file_name):\n", - " return file_name\n", - " file_in_ref_dir = os.path.join(reference_path, file_name)\n", - " if os.path.isfile(file_in_ref_dir):\n", - " return file_in_ref_dir\n", - " file_prefixed = os.path.join(reference_path, file_path)\n", - " if os.path.isfile(file_prefixed):\n", - " return file_prefixed\n", - " raise FileNotFoundError(f\"No valid path found for file: {file_path}\")\n", - "\n", - "\n", - "# ---- Parse inputs and determine N ----\n", - "if (str(annotation_file).endswith(('rds', 'tsv', 'txt', 'tsv.gz', 'txt.gz')) and\n", - " str(reference_anno_file).endswith('annot.gz')):\n", - " # Case 1: direct file paths (single-chromosome run). Multiple target files separated by ','.\n", - " target_files_direct = str(annotation_file).split(',')\n", - " N_targets = len(target_files_direct)\n", - " target_names = [f\"target_{i+1}\" for i in range(N_targets)]\n", - " input_files = [[*target_files_direct, str(reference_anno_file), str(genome_ref_file)]]\n", - " if len(chromosome) > 0:\n", - " input_chroms = [int(x) for x in chromosome]\n", - " else:\n", - " input_chroms = [0]\n", - "else:\n", - " # Case 2: txt list with #id and one or more 'path' columns\n", - " target_files_df = pd.read_csv(annotation_file, sep=\"\\t\")\n", - " reference_files = pd.read_csv(reference_anno_file, sep=\"\\t\")\n", - " genome_ref_files = pd.read_csv(genome_ref_file, sep=\"\\t\")\n", - "\n", - " target_files_df[\"#id\"] = [x.replace(\"chr\", \"\") for x in target_files_df[\"#id\"].astype(str)]\n", - " reference_files[\"#id\"] = [x.replace(\"chr\", \"\") for x in reference_files[\"#id\"].astype(str)]\n", - " genome_ref_files[\"#id\"] = [x.replace(\"chr\", \"\") for x in genome_ref_files[\"#id\"].astype(str)]\n", - "\n", - " path_columns = [c for c in target_files_df.columns if c.startswith('path')]\n", - " N_targets = len(path_columns)\n", - " target_names = path_columns[:] # 'path', 'path1', 'path2', ...\n", - "\n", - " for col in path_columns:\n", - " target_files_df[col] = target_files_df[col].apply(lambda x: adapt_file_path(x, str(annotation_file)))\n", - " reference_files[\"path\"] = reference_files[\"path\"].apply(lambda x: adapt_file_path(x, str(reference_anno_file)))\n", - " genome_ref_files[\"path\"] = genome_ref_files[\"path\"].apply(lambda x: adapt_file_path(x, str(genome_ref_file)))\n", - "\n", - " merged = target_files_df.merge(reference_files, on=\"#id\").merge(genome_ref_files, on=\"#id\")\n", - " if len(chromosome) > 0:\n", - " merged = merged[merged[\"#id\"].isin([str(c) for c in chromosome])]\n", - "\n", - " rows = merged.values.tolist()\n", - " input_chroms = [r[0] for r in rows]\n", - " input_files = [[*r[1:N_targets+1], r[-2], r[-1]] for r in rows]\n", - "\n", - "# ---- Determine output format ----\n", - "use_print_snps = snp_list.is_file()\n", - "ldscore_ext = \"l2.ldscore.gz\" if use_print_snps else \"l2.ldscore.parquet\"\n", - "\n", - "if ld_wind_kb > 0:\n", - " use_kb_window = True\n", - " ld_window_param = ld_wind_kb\n", - " ld_window_flag = \"--ld-wind-kb\"\n", - "else:\n", - " use_kb_window = False\n", - " ld_window_param = ld_wind_cm\n", - " ld_window_flag = \"--ld-wind-cm\"\n", - "\n", - "emit_single = compute_single\n", - "emit_joint = compute_joint and N_targets >= 2\n", - "\n", - "# ---- Build per-chromosome output list ----\n", - "def chrom_outputs(chrom):\n", - " outs = []\n", - " if emit_single:\n", - " for i in range(N_targets):\n", - " name = f\"{annotation_name}_single_{i+1}\"\n", - " prefix = f\"{cwd:a}/{name}/{name}.{chrom}\"\n", - " outs += [f\"{prefix}.annot.gz\", f\"{prefix}.{ldscore_ext}\", f\"{prefix}.l2.M\"]\n", - " if emit_joint:\n", - " name = f\"{annotation_name}_joint\"\n", - " prefix = f\"{cwd:a}/{name}/{name}.{chrom}\"\n", - " outs += [f\"{prefix}.annot.gz\", f\"{prefix}.{ldscore_ext}\", f\"{prefix}.l2.M\"]\n", - " return outs\n", - "\n", - "input: input_files, group_by = N_targets + 2, group_with = \"input_chroms\"\n", - "\n", - "output: chrom_outputs(input_chroms[_index])\n", - "\n", - "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bnn}'\n", - "\n", - "# ----------------------------------------------------------------------------\n", - "# Step A: write the requested .annot files for this chromosome.\n", - "# ----------------------------------------------------------------------------\n", - "R: expand = \"${ }\", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'\n", - " library(data.table)\n", - "\n", - " clean_chr <- function(x) as.numeric(gsub(\"^chr\", \"\", x))\n", - "\n", - " process_range_data <- function(data, chr_value) {\n", - " data$chr <- clean_chr(data$chr)\n", - " data <- data[data$chr == chr_value,]\n", - " if (nrow(data) == 0) return(NULL)\n", - " expanded <- lapply(seq_len(nrow(data)), function(j) {\n", - " row <- data[j,]\n", - " pos_seq <- seq(row$start, row$end - 1)\n", - " result <- data.frame(chr = rep(row$chr, length(pos_seq)), pos = pos_seq)\n", - " if (ncol(data) > 3) {\n", - " for (col in 4:ncol(data))\n", - " result[[names(data)[col]]] <- rep(row[[col]], length(pos_seq))\n", - " }\n", - " result\n", - " })\n", - " unique(rbindlist(expanded))\n", - " }\n", - "\n", - " process_annotation <- function(target_anno, ref_anno, score_column_value) {\n", - " target_anno <- as.data.frame(target_anno)\n", - " ref_anno <- as.data.frame(ref_anno)\n", - " target_anno$chr <- clean_chr(target_anno$chr)\n", - " ref_anno$CHR <- clean_chr(ref_anno$CHR)\n", - " chr_value <- unique(ref_anno$CHR)\n", - " anno_scores <- rep(0, nrow(ref_anno))\n", - " match_pos <- match(target_anno$pos, ref_anno$BP)\n", - " valid_pos <- as.numeric(na.omit(match_pos))\n", - " if (score_column_value <= ncol(target_anno)) {\n", - " anno_scores[valid_pos] <- target_anno[[score_column_value]][!is.na(match_pos)]\n", - " } else {\n", - " anno_scores[valid_pos] <- 1\n", - " print(\"Warning: score column does not exist; setting scores to 1\")\n", - " }\n", - " anno_scores\n", - " }\n", - "\n", - " read_target_anno <- function(file_path, ref_anno) {\n", - " if (endsWith(file_path, \"rds\")) {\n", - " target_anno <- readRDS(file_path)\n", - " return(process_annotation(target_anno, ref_anno, ${score_column}))\n", - " }\n", - " target_anno <- fread(file_path)\n", - " if (${\"TRUE\" if is_range else \"FALSE\"}) {\n", - " names(target_anno)[1:3] <- c(\"chr\", \"start\", \"end\")\n", - " target_anno <- process_range_data(target_anno, unique(ref_anno$CHR))\n", - " if (is.null(target_anno)) return(rep(0, nrow(ref_anno)))\n", - " } else {\n", - " names(target_anno)[1:2] <- c(\"chr\", \"pos\")\n", - " }\n", - " process_annotation(target_anno, ref_anno, ${score_column})\n", - " }\n", - "\n", - " # ---- Read reference annotation ----\n", - " ref_anno <- as.data.frame(fread(${_input[-2]:ar}))\n", - " if (\"ANNOT\" %in% colnames(ref_anno)) ref_anno <- ref_anno[, -which(colnames(ref_anno) == \"ANNOT\")]\n", - "\n", - " # ---- Compute per-target annotation scores ----\n", - " target_files <- c(${\",\".join('\"%s\"' % str(p.absolute()) for p in _input[:-2])})\n", - " N_local <- length(target_files)\n", - " score_list <- lapply(target_files, read_target_anno, ref_anno = ref_anno)\n", - "\n", - " emit_single_local <- ${\"TRUE\" if emit_single else \"FALSE\"}\n", - " emit_joint_local <- ${\"TRUE\" if emit_joint else \"FALSE\"}\n", - " use_print_snps_local <- ${\"TRUE\" if use_print_snps else \"FALSE\"}\n", - " bfile_prefix <- \"${_input[-1]:na}\"\n", - "\n", - " # Reshape annot to match .bim panel for ldsc.py --l2 --print-snps\n", - " # (drop A1/A2/MAF, expand to .bim rows filling 0, take CM from .bim).\n", - " normalize_for_ldsc <- function(df) {\n", - " if (!use_print_snps_local) return(df)\n", - " df <- df[, !names(df) %in% c(\"A1\", \"A2\", \"MAF\", \"CM\"), drop = FALSE]\n", - " annot_cols <- setdiff(names(df), c(\"CHR\", \"BP\", \"SNP\"))\n", - " bim <- as.data.frame(fread(paste0(bfile_prefix, \".bim\"), header = FALSE,\n", - " col.names = c(\"CHR\", \"SNP\", \"CM\", \"BP\", \"A1\", \"A2\")))\n", - " bim$CHR <- as.character(bim$CHR); df$CHR <- as.character(df$CHR)\n", - " idx <- match(bim$SNP, df$SNP)\n", - " out <- data.frame(CHR = bim$CHR, BP = bim$BP, SNP = bim$SNP, CM = bim$CM,\n", - " stringsAsFactors = FALSE)\n", - " for (col in annot_cols) {\n", - " v <- rep(0, nrow(bim))\n", - " non_na <- !is.na(idx)\n", - " v[non_na] <- df[[col]][idx[non_na]]\n", - " out[[col]] <- v\n", - " }\n", - " out\n", - " }\n", - "\n", - " # ---- Write N single-target .annot files (when requested) ----\n", - " if (emit_single_local) {\n", - " for (i in seq_len(N_local)) {\n", - " out_anno <- ref_anno\n", - " out_anno$ANNOT <- score_list[[i]]\n", - " out_anno <- normalize_for_ldsc(out_anno)\n", - " name <- paste0(\"${annotation_name}\", \"_single_\", i)\n", - " out_path_gz <- file.path(\"${cwd:a}\", name, paste0(name, \".${input_chroms[_index]}.annot.gz\"))\n", - " out_path_tsv <- sub(\"\\\\.gz$\", \"\", out_path_gz)\n", - " dir.create(dirname(out_path_gz), showWarnings = FALSE, recursive = TRUE)\n", - " fwrite(out_anno, out_path_tsv, quote = FALSE, col.names = TRUE, row.names = FALSE, sep = \"\\t\")\n", - " }\n", - " }\n", - "\n", - " # ---- Optionally write joint .annot ----\n", - " if (emit_joint_local) {\n", - " joint_anno <- ref_anno\n", - " for (i in seq_len(N_local)) {\n", - " joint_anno[[paste0(\"ANNOT_\", i)]] <- score_list[[i]]\n", - " }\n", - " joint_anno <- normalize_for_ldsc(joint_anno)\n", - " joint_name <- paste0(\"${annotation_name}\", \"_joint\")\n", - " joint_out_gz <- file.path(\"${cwd:a}\", joint_name, paste0(joint_name, \".${input_chroms[_index]}.annot.gz\"))\n", - " joint_out_tsv <- sub(\"\\\\.gz$\", \"\", joint_out_gz)\n", - " dir.create(dirname(joint_out_gz), showWarnings = FALSE, recursive = TRUE)\n", - " fwrite(joint_anno, joint_out_tsv, quote = FALSE, col.names = TRUE, row.names = FALSE, sep = \"\\t\")\n", - " }\n", - "\n", - "# ----------------------------------------------------------------------------\n", - "# Step B: gzip all annot files. Uses expand=\"$[ ]\" so bash ${var} survives.\n", - "# ----------------------------------------------------------------------------\n", - "bash: expand = \"$[ ]\", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'\n", - " set -e\n", - " annots=()\n", - " if [ \"$[str(emit_single)]\" = \"True\" ]; then\n", - " for i in $(seq 1 $[N_targets]); do\n", - " annots+=(\"$[cwd:a]/$[annotation_name]_single_$i/$[annotation_name]_single_$i.$[input_chroms[_index]].annot\")\n", - " done\n", - " fi\n", - " if [ \"$[str(emit_joint)]\" = \"True\" ]; then\n", - " annots+=(\"$[cwd:a]/$[annotation_name]_joint/$[annotation_name]_joint.$[input_chroms[_index]].annot\")\n", - " fi\n", - " for a in \"${annots[@]}\"; do\n", - " gzip -f \"$a\"\n", - " done\n", - "\n", - "# ----------------------------------------------------------------------------\n", - "# Step C: run polyfun's LD-score computation for each emitted annotation file.\n", - "# ----------------------------------------------------------------------------\n", - "bash: expand = \"$[ ]\", stderr = f'{_output[1]}.stderr', stdout = f'{_output[1]}.stdout'\n", - " set -e\n", - " chrom=\"$[input_chroms[_index]]\"\n", - "\n", - " run_polyfun() {\n", - " local annot=\"$1\"\n", - " local out_prefix=\"$2\"\n", - " if [ \"$[str(use_print_snps)]\" = \"True\" ]; then\n", - " $[python_exec] $[polyfun_path]/ldsc.py \\\n", - " --print-snps $[snp_list] \\\n", - " $[ld_window_flag] $[ld_window_param] \\\n", - " --out \"$out_prefix\" \\\n", - " --bfile $[_input[-1]:nar] \\\n", - " --yes-really \\\n", - " --annot \"$annot\" \\\n", - " --l2\n", - " else\n", - " $[python_exec] $[polyfun_path]/compute_ldscores.py \\\n", - " --annot \"$annot\" \\\n", - " --bfile $[_input[-1]:nar] \\\n", - " $[ld_window_flag] $[ld_window_param] \\\n", - " --out \"${out_prefix}.$[ldscore_ext]\" \\\n", - " --allow-missing\n", - " fi\n", - " }\n", - "\n", - " if [ \"$[str(emit_single)]\" = \"True\" ]; then\n", - " for i in $(seq 1 $[N_targets]); do\n", - " name=\"$[annotation_name]_single_$i\"\n", - " annot=\"$[cwd:a]/$name/$name.$chrom.annot.gz\"\n", - " prefix=\"$[cwd:a]/$name/$name.$chrom\"\n", - " run_polyfun \"$annot\" \"$prefix\"\n", - " done\n", - " fi\n", - " if [ \"$[str(emit_joint)]\" = \"True\" ]; then\n", - " name=\"$[annotation_name]_joint\"\n", - " annot=\"$[cwd:a]/$name/$name.$chrom.annot.gz\"\n", - " prefix=\"$[cwd:a]/$name/$name.$chrom\"\n", - " run_polyfun \"$annot\" \"$prefix\"\n", - " fi\n", - "\n", - "# ----------------------------------------------------------------------------\n", - "# Step D: write .l2.M and .l2.M_5_50 files for each emitted annotation directory.\n", - "# ----------------------------------------------------------------------------\n", - "R: expand = \"${ }\", stderr = f'{_output[2]}.stderr', stdout = f'{_output[2]}.stdout'\n", - " suppressPackageStartupMessages({ library(data.table); library(dplyr) })\n", - " use_print_snps <- ${str(use_print_snps).upper()}\n", - "\n", - " chrom <- \"${input_chroms[_index]}\"\n", - " # Look up .frq file under frqfile_dir, using plink_name + chrom (matches cell 25).\n", - " frq_file <- file.path(\"${frqfile_dir}\", paste0(\"${plink_name}\", chrom, \".frq\"))\n", - " has_frq <- file.exists(frq_file)\n", - " frq_dt <- if (has_frq) fread(frq_file)[, .(SNP, MAF)] else NULL\n", - "\n", - " write_M_files <- function(annot_path, ldscore_path, m_path) {\n", - " if (use_print_snps && file.exists(m_path) && file.exists(paste0(m_path, \"_5_50\"))) {\n", - " cat(\"M files already exist for\", m_path, \"\\n\"); return(invisible())\n", - " }\n", - " ldscore_dt <- if (endsWith(ldscore_path, \".parquet\")) {\n", - " suppressPackageStartupMessages(library(arrow)); arrow::read_parquet(ldscore_path)\n", - " } else fread(ldscore_path)\n", - " annot_dt <- fread(annot_path)\n", - " annot_filtered <- annot_dt[annot_dt$SNP %in% ldscore_dt$SNP, ]\n", - " merged <- if (has_frq) merge(annot_filtered, frq_dt, by = \"SNP\", all.x = TRUE) else annot_filtered\n", - " std_cols <- c(\"CHR\", \"SNP\", \"BP\", \"CM\", \"A1\", \"A2\", if (has_frq) \"MAF\")\n", - " annot_cols <- setdiff(names(merged), std_cols)\n", - " if (length(annot_cols) == 0L) { merged[, ANNOT := 1L]; annot_cols <- \"ANNOT\" }\n", - " M <- merged[, lapply(.SD, sum, na.rm = TRUE), .SDcols = annot_cols]\n", - " writeLines(paste(as.numeric(M), collapse = \" \"), m_path)\n", - " if (has_frq) {\n", - " common <- merged[!is.na(MAF) & MAF > 0.05, ]\n", - " M5 <- common[, lapply(.SD, sum, na.rm = TRUE), .SDcols = annot_cols]\n", - " writeLines(paste(as.numeric(M5), collapse = \" \"), paste0(m_path, \"_5_50\"))\n", - " }\n", - " }\n", - "\n", - " targets <- c()\n", - " if (${\"TRUE\" if emit_single else \"FALSE\"}) {\n", - " for (i in seq_len(${N_targets})) {\n", - " targets <- c(targets, paste0(\"${annotation_name}\", \"_single_\", i))\n", - " }\n", - " }\n", - " if (${\"TRUE\" if emit_joint else \"FALSE\"}) {\n", - " targets <- c(targets, paste0(\"${annotation_name}\", \"_joint\"))\n", - " }\n", - " for (name in targets) {\n", - " annot_path <- file.path(\"${cwd:a}\", name, paste0(name, \".\", chrom, \".annot.gz\"))\n", - " ldscore_path <- file.path(\"${cwd:a}\", name, paste0(name, \".\", chrom, \".${ldscore_ext}\"))\n", - " m_path <- file.path(\"${cwd:a}\", name, paste0(name, \".\", chrom, \".l2.M\"))\n", - " write_M_files(annot_path, ldscore_path, m_path)\n", - " }\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "Python 3 (ipykernel)" - }, - "source": [ - "## Calculate Functional Enrichment using Annotations" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[get_heritability]\n", - "# Per-trait sLDSC regression via polyfun. Fans out across target_anno_dirs:\n", - "# each (trait, target_dir) pair becomes one polyfun invocation. Outputs go to\n", - "# //.{results,log,part_delete}.\n", - "#\n", - "# `target_anno_dirs` is the list produced by [make_annotation_files_ldscore]:\n", - "# typically N _single_ directories plus optionally one _joint directory.\n", - "\n", - "#\n", - "# --- about the \".results\" Category column and the \"_0 / _1\" suffix ---\n", - "# Each (trait, target_dir) pair is ONE polyfun call; its `ldsc.py --ref-ld-chr`\n", - "# always gets exactly two LD-score sources, in this order:\n", - "# \"/.\" (index 0) , \"/\" (index 1)\n", - "# With --overlap-annot, every annotation column in the .results \"Category\" is\n", - "# named _:\n", - "# index 0 = the target file -> \"ANNOT_0\" (no-snplist; compute_ldscores.py keeps the annot col name)\n", - "# -> \"L2_0\" (snplist + single annot; ldsc.py hard-codes \"L2\", see below)\n", - "# -> \"ANNOT_1_0\",\"ANNOT_2_0\" (no-snplist joint dir, N>=2 annot cols)\n", - "# -> \"ANNOT_1L2_0\",\"ANNOT_2L2_0\" (snplist joint dir, N>=2 -> \"L2\")\n", - "# index 1 = the baseline file -> \"base_1\",\"Coding_UCSC_1\", ... (the 97 baseline annots)\n", - "# So in this pipeline the suffix is only ever 0 (target) or 1 (baseline); it would\n", - "# continue 0,1,2,... only if you handed `ldsc.py --ref-ld-chr` more than two sources.\n", - "# (Why ANNOT_0 vs L2_0: see the [make_annotation_files_ldscore] header \u2014 ldsc.py's\n", - "# \"n_annot == 1 -> column name 'L2'\" quirk vs compute_ldscores.py keeping the annot\n", - "# column name.) [postprocess] auto-detects the target Category; if you instead pass\n", - "# --target-categories, the names must match this column exactly.\n", - "#\n", - "parameter: target_anno_dirs = paths()\n", - "parameter: all_traits = []\n", - "\n", - "import os\n", - "\n", - "with open(all_traits_file, 'r') as f:\n", - " trait_paths = [os.path.join(sumstat_dir, line.strip()) for line in f if line.strip()]\n", - "\n", - "# Build (trait, target_dir) Cartesian product as parallel flat lists.\n", - "input_list = []\n", - "target_meta = []\n", - "for td in target_anno_dirs:\n", - " for t in trait_paths:\n", - " input_list.append(t)\n", - " target_meta.append(str(td))\n", - "\n", - "input: input_list, group_by = 1, group_with = \"target_meta\"\n", - "\n", - "output: f\"{cwd:a}/{os.path.basename(target_meta[_index])}/{os.path.basename(_input[0])}.log\", \\\n", - " f\"{cwd:a}/{os.path.basename(target_meta[_index])}/{os.path.basename(_input[0])}.results\"\n", - "\n", - "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'\n", - "\n", - "bash: expand = \"${ }\"\n", - " target_dir=\"${target_meta[_index]}\"\n", - " target_name=\"$(basename ${target_meta[_index]})\"\n", - " trait=\"$(basename ${_input[0]})\"\n", - " output_dir=\"${cwd:a}/$target_name\"\n", - " mkdir -p \"$output_dir\"\n", - "\n", - " # MAF cutoff handling. Only 0 (disabled) or 0.05 (sLDSC default) are supported;\n", - " # other values would require recomputing LD scores at that cutoff.\n", - " frq_file_check=\"${frqfile_dir}/${plink_name}22.frq\"\n", - " if [ \"${maf_cutoff}\" = \"0\" ] || [ \"${maf_cutoff}\" = \"0.0\" ]; then\n", - " echo \"maf_cutoff = 0: skipping MAF filter (--not-M-5-50)\"\n", - " frq_option=\"--not-M-5-50\"\n", - " elif [ \"${maf_cutoff}\" = \"0.05\" ]; then\n", - " if [ -f \"$frq_file_check\" ]; then\n", - " echo \"maf_cutoff = 0.05: using --frqfile-chr (MAF > 5%)\"\n", - " frq_option=\"--frqfile-chr ${frqfile_dir}/${plink_name}\"\n", - " else\n", - " echo \"ERROR: maf_cutoff=0.05 requires .frq files for the reference panel,\"\n", - " echo \" but none found at ${frqfile_dir}/${plink_name}*.frq.\"\n", - " echo \" Provide .frq files in frqfile_dir, or set maf_cutoff=0 (NOT recommended).\"\n", - " exit 1\n", - " fi\n", - " else\n", - " echo \"ERROR: maf_cutoff=${maf_cutoff} is not supported. Only 0 (no filter) or\"\n", - " echo \" 0.05 (sLDSC default) are accepted. Other values would require\"\n", - " echo \" recomputing LD scores at that cutoff.\"\n", - " exit 1\n", - " fi\n", - "\n", - " run_ldsc() {\n", - " local extra_args=\"$1\"\n", - " ${python_exec} ${polyfun_path}/ldsc.py \\\n", - " --h2 ${sumstat_dir}/$trait \\\n", - " --ref-ld-chr \"$target_dir/$target_name.\",\"${baseline_ld_dir}/${baseline_name}\" \\\n", - " --out \"$output_dir/$trait\" \\\n", - " --overlap-annot \\\n", - " --w-ld-chr ${weights_dir}/${weight_name} \\\n", - " $frq_option \\\n", - " --print-coefficients \\\n", - " --print-delete-vals \\\n", - " --n-blocks ${n_blocks} \\\n", - " $extra_args\n", - " }\n", - "\n", - " run_ldsc \"\"\n", - " log_file=\"$output_dir/$trait.log\"\n", - "\n", - " # FloatingPointError retry ladder (preserved from original): 30 -> 20 -> 10\n", - " for max in 30 20 10; do\n", - " if [ -f \"$log_file\" ] && grep -q \"FloatingPointError\\|invalid value encountered in sqrt\" \"$log_file\"; then\n", - " echo \"FloatingPointError detected, retrying with --chisq-max $max...\"\n", - " run_ldsc \"--chisq-max $max\"\n", - " else\n", - " break\n", - " fi\n", - " done\n", - "\n", - " if [ -f \"$log_file\" ] && grep -q \"FloatingPointError\\|invalid value encountered in sqrt\" \"$log_file\"; then\n", - " echo \"ERROR: FloatingPointError persists for trait $trait at target $target_name even with --chisq-max 10\"\n", - " echo \"This trait may have severe numerical instability issues in the summary statistics.\"\n", - " fi\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[munge_sumstats_polyfun]\n", - "parameter: sumstats = path\n", - "parameter: n = 0\n", - "parameter: min_info = 0.6\n", - "parameter: min_maf = 0.001\n", - "parameter: keep_hla = False\n", - "parameter: chi2_cut = 30\n", - "input: sumstats\n", - "output: f\"{_input:n}.munged.parquet\"\n", - "bash: expand=True, stderr=f'{_output:nn}.stderr', stdout=f'{_output:nn}.stdout'\n", - " {python_exec} {polyfun_path}/munge_polyfun_sumstats.py \\\n", - " --sumstats {_input} \\\n", - " --out {_output} \\\n", - " {'--n {}'.format(n) if n>0 else ''} \\\n", - " {'--min-info {}'.format(min_info)} \\\n", - " {'--min-maf {}'.format(min_maf)} \\\n", - " {'--chi2-cutoff {}'.format(chi2_cut)} \\\n", - " {'--keep-hla' if keep_hla else ''} \\\n", - " --remove-strand-ambig" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[postprocess]\n", - "# Post-processing of polyfun outputs via pecotmr::sldscPostprocessingPipeline.\n", - "# Reads .results / .log / .part_delete for all traits in `traits_file`, both\n", - "# single-target and (when present) joint-target runs, computes Gazal-style\n", - "# tau*, EnrichStat with back-solved jackknife SE, and runs the default\n", - "# DerSimonian-Laird random-effects meta across all supplied traits. Writes\n", - "# one RDS containing per-trait tables and three meta tables (tau*, E, EnrichStat).\n", - "\n", - "parameter: traits_file = path() # text file: one trait sumstats filename per line\n", - "parameter: heritability_cwd = path() # parent directory of [get_heritability] outputs (contains _single_/ subdirs and optionally _joint/)\n", - "parameter: target_categories = [] # target annotation names. Auto-detected from the joint-run results if empty.\n", - "parameter: target_categories_label = [] # optional display names, same order as target_categories;\n", - " # when given, every \"target\" column / tau*-block colname in\n", - " # the output RDS is renamed to these (params$target_categories\n", - " # holds the labels, params$target_categories_orig the originals).\n", - "parameter: target_anno_dir = path() # directory of target .annot.gz files used for sd_C and binary detection (typically the joint dir, since it carries all target columns)\n", - "\n", - "input: traits_file\n", - "output: f\"{cwd:a}/{annotation_name}.sldsc_postprocess.rds\"\n", - "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n", - "\n", - "R: expand = \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'\n", - " library(pecotmr)\n", - "\n", - " traits <- readLines(\"${traits_file}\")\n", - " target_cats <- c(${\",\".join('\"%s\"' % c for c in target_categories)})\n", - " target_lab <- c(${\",\".join('\"%s\"' % c for c in target_categories_label)})\n", - "\n", - " # Auto-detect single-target and joint-target output directories.\n", - " her_root <- \"${heritability_cwd}\"\n", - " all_subdirs <- list.dirs(her_root, recursive = FALSE)\n", - " single_pattern <- paste0(\"^\", \"${annotation_name}\", \"_single_([0-9]+)$\")\n", - " joint_name <- paste0(\"${annotation_name}\", \"_joint\")\n", - " single_dirs <- all_subdirs[grepl(single_pattern, basename(all_subdirs))]\n", - " single_indices <- as.integer(sub(single_pattern, \"\\\\1\", basename(single_dirs)))\n", - " single_dirs <- single_dirs[order(single_indices)]\n", - " joint_dir <- file.path(her_root, joint_name)\n", - " has_joint <- dir.exists(joint_dir)\n", - "\n", - " message(sprintf(\"Detected %d single-target dirs%s\",\n", - " length(single_dirs),\n", - " if (has_joint) \"; joint-target dir present\" else \"; no joint-target dir\"))\n", - "\n", - " # Build per-trait prefix maps. Each trait's polyfun output is at /\n", - " # (polyfun appends .results / .log / .part_delete).\n", - " trait_single_prefixes <- lapply(traits, function(t) file.path(single_dirs, t))\n", - " names(trait_single_prefixes) <- traits\n", - "\n", - " if (has_joint) {\n", - " trait_joint_prefix <- setNames(file.path(joint_dir, traits), traits)\n", - " } else {\n", - " trait_joint_prefix <- setNames(rep(NA_character_, length(traits)), traits)\n", - " }\n", - "\n", - " res <- sldscPostprocessingPipeline(\n", - " traitSinglePrefixes = trait_single_prefixes,\n", - " traitJointPrefix = trait_joint_prefix,\n", - " targetAnnoDir = \"${target_anno_dir}\",\n", - " frqfileDir = \"${frqfile_dir}\",\n", - " plinkName = \"${plink_name}\",\n", - " mafCutoff = ${maf_cutoff},\n", - " targetCategories = if (length(target_cats) > 0) target_cats else NULL,\n", - " targetLabels = if (length(target_lab) > 0) target_lab else NULL\n", - " )\n", - "\n", - " saveRDS(res, \"${_output[0]}\")\n", - " message(\"S-LDSC post-processing complete; results written to ${_output[0]}\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[meta_subset]\n", - "# Optional: re-run random-effects meta on a user-defined subset of traits, using\n", - "# the cached per-trait standardized results from [postprocess]. No regression rerun.\n", - "\n", - "parameter: postprocess_rds = path() # output of [postprocess]\n", - "parameter: subset_traits_file = path() # text file: one trait id per line, subset of those passed to [postprocess]\n", - "parameter: subset_name = str # label used in the output filename\n", - "parameter: target_categories = [] # target annotation names to meta on; if empty, uses all from postprocess output\n", - "# If [postprocess] was run with --target-categories-label, the cached RDS already\n", - "# carries the display names (params$target_categories = the labels), so leave\n", - "# --target-categories empty here (or pass the labels, not the original ANNOT_* names).\n", - "\n", - "input: postprocess_rds, subset_traits_file\n", - "output: f\"{cwd:a}/{annotation_name}.{subset_name}.meta.rds\"\n", - "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n", - "\n", - "R: expand = \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'\n", - " library(pecotmr)\n", - "\n", - " res <- readRDS(\"${postprocess_rds}\")\n", - " subset_traits <- readLines(\"${subset_traits_file}\")\n", - " target_cats <- c(${\",\".join([f'\"{c}\"' for c in target_categories])})\n", - " if (length(target_cats) == 0) target_cats <- res$params$target_categories\n", - "\n", - " subset_per_trait <- res$per_trait[subset_traits]\n", - "\n", - " # Map wide names (tau_star_single/joint) to bare names metaSldscRandom expects.\n", - " view_single <- pecotmr:::.sldscViewForMeta(subset_per_trait, \"single\")\n", - " view_joint <- pecotmr:::.sldscViewForMeta(subset_per_trait, \"joint\")\n", - "\n", - " out <- list(\n", - " tau_star_single = setNames(lapply(target_cats, function(c) metaSldscRandom(view_single, c, \"tauStar\")), target_cats),\n", - " tau_star_joint = setNames(lapply(target_cats, function(c) metaSldscRandom(view_joint, c, \"tauStar\")), target_cats),\n", - " enrichment = setNames(lapply(target_cats, function(c) metaSldscRandom(view_single, c, \"enrichment\")), target_cats),\n", - " enrichstat = setNames(lapply(target_cats, function(c) metaSldscRandom(view_single, c, \"enrichstat\")), target_cats)\n", - " )\n", - "\n", - " saveRDS(out, \"${_output[0]}\")\n", - " message(\"Subset meta complete; results written to ${_output[0]}\")" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "SoS", - "language": "sos", - "name": "sos" - }, - "language_info": { - "codemirror_mode": "sos", - "file_extension": ".sos", - "mimetype": "text/x-sos", - "name": "sos", - "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter", - "pygments_lexer": "sos" - }, - "sos": { - "kernels": [ - [ - "Markdown", - "markdown", - "markdown", - "", - "" - ], - [ - "SoS", - "sos", - "", - "", - "sos" - ] - ], - "panel": { - "displayed": true, - "height": 0 - }, - "version": "0.22.4" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} \ No newline at end of file From a943d9903663893d81c5320ea3302ff5249bf7f1 Mon Sep 17 00:00:00 2001 From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com> Date: Tue, 23 Jun 2026 12:01:18 -0400 Subject: [PATCH 4/6] fix based on pecotmr 0.5.3 --- code/SoS/enrichment/sldsc_enrichment.ipynb | 1491 ++++++++++++++++++++ 1 file changed, 1491 insertions(+) create mode 100644 code/SoS/enrichment/sldsc_enrichment.ipynb diff --git a/code/SoS/enrichment/sldsc_enrichment.ipynb b/code/SoS/enrichment/sldsc_enrichment.ipynb new file mode 100644 index 00000000..0569c353 --- /dev/null +++ b/code/SoS/enrichment/sldsc_enrichment.ipynb @@ -0,0 +1,1491 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "# Stratified LD Score Regression (S-LDSC) Enrichment\n", + "\n", + "Minimal working-example driver for the S-LDSC functional-enrichment pipeline. The **Steps** section below gives one ready-to-run `sos run` command per workflow, using the toy inputs symlinked under `input/`.\n", + "\n", + "> **Environment note.** Steps 1–2 (`make_annotation_files_ldscore`, `get_heritability`) wrap the external **polyfun** toolkit (`compute_ldscores.py`, `ldsc.py`, `munge_polyfun_sumstats.py`) and require pre-computed reference-panel files (baseline-LD scores, LD weights, `.frq`, and PLINK `.bed/.bim/.fam`). polyfun is **not installed in this environment** and the reference panel is not shipped with the toy example, so those two steps cannot be executed here; their commands are provided for use on a system where polyfun and a matching panel are available. Steps 3–4 (`postprocess`, `meta_subset`) use `pecotmr::sldsc_postprocessing_pipeline` (available here) and read the `.results`/`.log` files produced by Step 2.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Description\n", + "This notebook implements the pipeline of [S-LDSC](https://www.nature.com/articles/ng.3404) for LD score and functional enrichment analysis.\n", + "\n", + "**Important: the S-LDSC implementation comes from the [polyfun](https://github.com/omerwe/polyfun/tree/master) package, not the original LDSC from `bulik/ldsc` GitHub repo.**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "Markdown" + }, + "source": [ + "Uses GWAS summary statistics together with annotation and LD reference-panel data to compute per-SNP heritability enrichment for each annotation. It supports single-annotation (individual contribution) and joint multi-annotation (independent contribution) analysis.\n", + "\n", + "**Background.** LD Score Regression (Bulik-Sullivan et al. 2015) distinguishes confounding (e.g. population stratification) from true polygenic signal by regressing GWAS chi-square statistics on LD scores: SNPs tagging more variation (high LD score) show higher chi-square under true polygenicity, whereas confounding inflates statistics independently of LD. S-LDSC (Finucane et al. 2015) partitions heritability across overlapping annotation categories; standardized tau accounts for negative selection (Gazal et al. 2017). The model details and the tau*/EnrichStat definitions follow below.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Methods - Workflow Overview\n", + "\n", + "The pipeline runs in three stages: (1) annotation preparation and the S-LDSC regression (polyfun), (2) post-processing into standardized $\\tau^*$ and meta-analysis (the `pecotmr` package), and (3) optional re-meta on user-defined trait subsets. The concrete commands for stages 1-2 are in the **Steps** section below.\n", + "\n", + "**Stage 1 - polyfun.** Three SoS workflows wrap polyfun: `make_annotation_files_ldscore` converts target annotations into polyfun `.annot.gz` and runs `compute_ldscores.py` (toggles `compute_single` and `compute_joint`, both default `True`; the joint dir is only emitted when $N \\geq 2$); `munge_sumstats_polyfun` preprocesses each GWAS into LDSC format; `get_heritability` runs polyfun's `ldsc.py` once per `--target-anno-dir`, enforcing the MAF cutoff via `--frqfile-chr` (`maf_cutoff` accepts only `0` or `0.05`).\n", + "\n", + "**Stage 2 - pecotmr post-processing.** A single `pecotmr::sldsc_postprocessing_pipeline` call consumes all polyfun outputs: it extracts $\\tau$, $E$, $h^2_g$, EnrichStat p-value and per-block jackknife $\\tau$ values; computes $sd_C$ and $M_{\\mathrm{ref}}$ over the regression's MAF-cutoff SNP set; standardizes $\\tau \\to \\tau^*$ for single and joint modes; auto-detects binary vs continuous annotations; and runs a DerSimonian-Laird random-effects meta-analysis across traits, producing three meta tables ($\\tau^*$ cross-type comparable, $E$ within-binary, EnrichStat within-binary). Output is an R list with `per_trait` and `meta` entries.\n", + "\n", + "**Stage 3 - subset meta-analysis.** `pecotmr::meta_sldsc_random` re-runs the meta on a trait subset without re-running the regression (lightweight, interactive):\n", + "\n", + "```r\n", + "res <- readRDS(\"sldsc_results.rds\")\n", + "neuro <- c(\"AD_GWAX\", \"PD_meta\", \"ALS_meta\")\n", + "meta_neuro_taustar <- pecotmr::meta_sldsc_random(\n", + " res$per_trait[neuro], category = \"my_target_anno\", quantity = \"tau_star\"\n", + ")\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Theory\n", + "\n", + "The statistical model behind the pipeline is summarized below. Because the same framework underlies several of the workflow steps, the model, its stratified extension, and the tau-estimation / enrichment definitions are described together here rather than repeated per step." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### LDSC model\n", + "\n", + "Under a polygenic assumption, in which effect sizes for variants are drawn independently from distributions with variance proportional to $1/(p(1-p))$ where $p$ is the minor allele frequency (MAF), the expected $\\chi^2$ statistic of variant $j$ is:\n", + "\n", + "$$E[\\chi^2_j \\mid \\ell_j] \\;=\\; \\frac{N\\,h^2\\,\\ell_j}{M} \\;+\\; N a \\;+\\; 1 \\quad (1)$$\n", + "\n", + "where $N$ is the sample size; $M$ is the number of SNPs, so that $h^2/M$ is the average heritability per SNP; $a$ measures the contribution of confounding biases such as cryptic relatedness and population stratification; and $\\ell_j = \\sum_k r^2_{jk}$ is the LD Score of variant $j$, which measures the amount of genetic variation tagged by $j$. A full derivation is given in the Supplementary Note of Bulik-Sullivan et al. (2015); an alternative derivation appears in the Supplementary Note of Zhu and Stephens (2017) AoAS.\n", + "\n", + "Equation (1) shows that LD Score regression can compute SNP-based heritability for a phenotype from GWAS summary statistics alone, without requiring individual-level genotype data as REML and related methods do." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Stratified LDSC\n", + "\n", + "Heritability is the proportion of phenotypic variation that is due to variation in genetic values, and it can also be partitioned over disjoint or overlapping categories of SNPs.\n", + "\n", + "Stratified LD Score Regression (S-LDSC) partitions heritability by leveraging both LD-score information and SNPs that have not reached genome-wide significance. S-LDSC exploits the fact that the $\\chi^2$ statistic for a given SNP reflects the cumulative effects of all SNPs tagged by it: in regions of high LD, the focal SNP captures the contribution of a group of nearby SNPs.\n", + "\n", + "S-LDSC declares an annotation enriched for heritability if SNPs with high LD to that annotation have higher $\\chi^2$ statistics than SNPs with low LD to it.\n", + "\n", + "Let $a_{jC}$ denote the value of annotation $C$ at SNP $j$:\n", + "\n", + "- **Binary annotation** (e.g. an indicator for \"in enhancer\", \"in exon\", \"in cell-type-specific peak\"): $a_{jC} \\in \\{0, 1\\}$.\n", + "- **Continuous annotation** (e.g. gene-specificity score, conservation score, continuous epigenomic signal): $a_{jC} \\in \\mathbb{R}$.\n", + "\n", + "Under a polygenic model the per-SNP heritability for SNP $j$ is\n", + "\n", + "$$\\mathrm{Var}(\\beta_j) \\;=\\; \\sum_C a_{jC}\\, \\tau_C$$\n", + "\n", + "and the expected $\\chi^2$ statistic of SNP $j$ is\n", + "\n", + "$$E[\\chi^2_j \\mid \\mathbf{a}_j] \\;=\\; N \\sum_C \\tau_C\\, \\ell(j, C) \\;+\\; N a \\;+\\; 1 \\quad (2)$$\n", + "\n", + "where $\\ell(j, C) = \\sum_k a_{kC}\\, r^2_{jk}$ is the partitioned LD score of SNP $j$ with respect to annotation $C$, and $a$ measures confounding bias. Equation (2) allows joint estimation of all $\\tau_C$ via a (computationally simple) multiple regression of $\\chi^2_j$ against $\\ell(j, C)$.\n", + "\n", + "Interpretation of $\\tau_C$:\n", + "- **Binary $C$**: $\\tau_C$ is the *additive increase in per-SNP heritability* for SNPs in category $C$, on top of the contributions from any other annotations they belong to.\n", + "- **Continuous $C$**: $\\tau_C$ is the *additive change in per-SNP heritability per unit increase* in the value of annotation $C$.\n", + "\n", + "For application to real data and comparisons to other methods, see the three papers cited at the top of this notebook." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Tau Estimation and Enrichment Analysis\n", + "\n", + "Goal: quantify the contribution of functional annotations to trait heritability and assess statistical significance, accounting for LD structure and (for continuous annotations) annotation scale.\n", + "\n", + "The pipeline has two computational layers:\n", + "\n", + "- **Regression layer** — the S-LDSC regression itself, performed by the [polyfun](https://github.com/omerwe/polyfun) engine. We do not re-implement this.\n", + "- **Post-processing layer** — standardization, differential per-SNP heritability, binary/continuous detection, and random-effects meta-analysis across traits. Implemented in the [`pecotmr`](https://github.com/StatFunGen/pecotmr) R package (`R/sldsc_wrapper.R`).\n", + "\n", + "The notation below tags each modeling quantity as **(polyfun)** or **(pecotmr)**.\n", + "\n", + "#### Notation\n", + "\n", + "For each annotation $C$ we use:\n", + "\n", + "- $\\pi^{h^2}_C$ = proportion of trait heritability $h^2_g$ assigned to annotation $C$.\n", + "- $\\pi^{M}_C$ = proportion of (effective) SNPs in annotation $C$. For binary annotations this is $M_C / M_{\\mathrm{ref}}$; for continuous annotations it is the share of total annotation weight in $C$.\n", + "\n", + "#### Reference panel and MAF cutoff\n", + "\n", + "All LD-derived quantities — partitioned LD scores for the 97 baseline annotations and for our $K$ target annotations, the LD-score-regression weights, allele frequencies, and the SNP set — are computed against our own LD reference panel. We do not mix in pre-computed quantities from external panels (e.g. 1000G); $M_{\\mathrm{ref}}$ throughout this notebook denotes the number of common SNPs in our panel.\n", + "\n", + "By default we restrict to MAF $> 5\\%$ per the sLDSC recommendation: rare-variant LD is unstable and HapMap3-style regression weights are common-variant by construction. The cutoff is exposed as the SoS parameter `maf_cutoff` (default $0.05$); the regression, the standardized $sd_C$, and $M_{\\mathrm{ref}}$ are all evaluated on the same MAF $>$ cutoff SNP set. If allele-frequency files are not available the pipeline fails; the user must explicitly set `maf_cutoff = 0` to opt out (not recommended).\n", + "\n", + "#### Quantities from the regression layer (polyfun)\n", + "\n", + "Solving Equation (2) jointly across annotations, with 200-block genomic jackknife for inference, is performed by polyfun's `ldsc.py`. From each polyfun run we obtain, per annotation:\n", + "\n", + "- $\\tau_C$ and its standard error — **(polyfun)**.\n", + "- $\\pi^{h^2}_C$ and $\\pi^{M}_C$ — **(polyfun)**.\n", + "- $E_C = \\pi^{h^2}_C / \\pi^{M}_C$ and its standard error — **(polyfun)**.\n", + "- The p-value of the differential per-SNP heritability test (defined below) — **(polyfun)**, computed internally with the full coefficient covariance matrix.\n", + "\n", + "We also obtain, per run:\n", + "\n", + "- The total trait heritability $h^2_g$ — **(polyfun)**.\n", + "- The 200-block jackknife delete-values of $\\tau_C$ — **(polyfun)**.\n", + "\n", + "#### Quantities from the post-processing layer (pecotmr)\n", + "\n", + "From the polyfun outputs above plus our reference panel, the post-processing layer computes:\n", + "\n", + "- $sd_C$ — per-annotation standard deviation over MAF $>$ cutoff SNPs — **(pecotmr: `compute_sldsc_annot_sd`)**.\n", + "- $M_{\\mathrm{ref}}$ — reference SNP count at the MAF cutoff — **(pecotmr: `compute_sldsc_M_ref`)**.\n", + "- Whether each annotation is binary or continuous — **(pecotmr: `is_binary_sldsc_annot`)**.\n", + "- $\\tau^*_C$ point estimate and per-block $\\tau^*_C$ — **(pecotmr: `standardize_sldsc_trait`)**.\n", + "- EnrichStat point estimate and its standard error (formula below) — **(pecotmr: `standardize_sldsc_trait`)**.\n", + "- DerSimonian-Laird random-effects meta-analysis of $\\tau^*_C$, $E_C$, or EnrichStat across traits — **(pecotmr: `meta_sldsc_random`)**.\n", + "\n", + "The top-level entry point `pecotmr::sldsc_postprocessing_pipeline` orchestrates all of the above.\n", + "\n", + "#### Standardized tau ($\\tau^*$) — (pecotmr)\n", + "\n", + "$\\tau_C$ has units that depend on the scale of the annotation and on the total heritability of the trait, so raw $\\tau$ is not directly comparable across annotations or across traits. We compute the standardized version (Gazal et al. 2017)\n", + "\n", + "$$\\tau^*_C \\;=\\; \\tau_C \\cdot \\frac{sd_C \\cdot M_{\\mathrm{ref}}}{h^2_g}$$\n", + "\n", + "interpreted as the additive change in per-SNP heritability associated with a 1 standard deviation increase in annotation $C$, divided by the average per-SNP heritability across all SNPs. $\\tau^*_C$ is dimensionless and comparable across annotations and across traits. In a joint multi-annotation regression it is the *independent contribution* of annotation $C$ after controlling for overlapping effects of the others.\n", + "\n", + "Here $sd_C$ is the standard deviation of annotation $C$ across reference SNPs (MAF $>$ cutoff), $M_{\\mathrm{ref}}$ is the count of those SNPs, and $h^2_g$ is the trait heritability. Applying the same scaling to each of the 200 jackknife blocks yields per-block $\\tau^*_C$ values; their sample variance gives the jackknife standard error\n", + "$$SE^{\\text{jackknife}}(\\tau^*_C) \\;=\\; \\sqrt{\\,\\tfrac{(B-1)^2}{B}\\, \\mathrm{Var}_b(\\tau^*_{C,(b)})\\,}$$\n", + "with $B = 200$, used as the per-trait input to cross-trait meta-analysis.\n", + "\n", + "#### Differential per-SNP heritability (\"EnrichStat\") — (polyfun + pecotmr)\n", + "\n", + "To test whether the per-SNP heritability *inside* annotation $C$ differs from *outside* it (Finucane et al. 2015):\n", + "\n", + "$$\\text{EnrichStat}_C \\;=\\; \\frac{h^2_g}{M_{\\mathrm{ref}}} \\!\\left[\\, \\frac{\\pi^{h^2}_C}{\\pi^{M}_C} \\;-\\; \\frac{1 - \\pi^{h^2}_C}{1 - \\pi^{M}_C} \\,\\right]$$\n", + "\n", + "The point-estimate p-value of this test is computed by polyfun internally using the full coefficient covariance and reported as `Enrichment_p`. Its standard error is recovered from the reported p-value:\n", + "\n", + "$$|Z_C| \\;=\\; \\Phi^{-1}\\!\\left(1 - \\tfrac{p_C}{2}\\right), \\qquad SE_{\\text{EnrichStat}_C} \\;=\\; \\frac{|\\text{EnrichStat}_C|}{|Z_C|}.$$\n", + "\n", + "This per-trait point + SE is the input to cross-trait meta-analysis.\n", + "\n", + "#### Reporting: binary vs. continuous annotations — (pecotmr)\n", + "\n", + "The estimation machinery applies to both annotation types, but the *headline* quantity to report **within each type** differs.\n", + "\n", + "For a **binary annotation** (e.g. enhancer indicator, exon, in/out of a cell-type peak), $\\pi^{M}_C = M_C / M_{\\mathrm{ref}}$ has a direct interpretation and $E_C$ reads as \"the category explains $E_C$-fold more heritability than its share of SNPs.\" The within-type headline quantities are therefore $E_C$ and the EnrichStat p-value; $\\tau^*_C$ is reported alongside.\n", + "\n", + "For a **continuous annotation** (e.g. gene-specificity score, conservation score, continuous epigenomic signal), $E_C$ depends on the scale of the annotation: rescaling the annotation by a constant changes $E_C$ even though the underlying biology is unchanged. The within-type headline quantities are therefore $\\tau^*_C$ and its p-value; $E_C$ is reported alongside but should not be interpreted for continuous annotations.\n", + "\n", + "The pipeline determines whether an annotation is binary by inspecting whether its values lie in $\\{0, 1\\}$ and selects the appropriate within-type headline statistic automatically (pecotmr).\n", + "\n", + "> **From the official LDSC tutorial** ([Partitioned Heritability from Continuous Annotations](https://github.com/bulik/ldsc/wiki/Partitioned-Heritability-from-Continuous-Annotations)):\n", + ">\n", + "> *\"Enrichment is (Prop. heritability) / (Prop. SNPs). These outputs make sense only for binary annotations. Do not try to interpret them for continuous annotations. Using `--print-coefficients` outputs the regression coefficients and corresponding standard errors and Z score for each annotation. These coefficients measure the additional contribution of one annotation to the model and are interpretable for both binary and continuous annotations.\"*\n", + ">\n", + "> The pipeline always passes `--print-coefficients` to polyfun for this reason.\n", + "\n", + "#### Cross-type comparison: always use $\\tau^*_C$ — (pecotmr)\n", + "\n", + "For an apple-to-apple comparison **across binary and continuous annotations** — ranking annotations on a single axis, meta-analyzing a mixed set, or reporting a leaderboard that pools both types — use $\\tau^*_C$. The standardization in Gazal et al. (2017) was designed for exactly this purpose: $sd_C = \\sqrt{p(1-p)}$ for a binary annotation (where $p$ is the proportion in the category) and $sd_C = $ empirical standard deviation for a continuous annotation, so the resulting $\\tau^*_C$ is dimensionless and has the same interpretation in both cases — additive change in per-SNP heritability per 1 SD increase in the annotation, normalized by the average per-SNP heritability. $E_C$ does not have this property and must not be compared across types.\n", + "\n", + "The pipeline emits both $E_C$ and $\\tau^*_C$ for every annotation, with the binary/continuous flag, so callers can pick the right column for the comparison they are making.\n", + "\n", + "#### Joint analysis — (polyfun runs the regression; pecotmr standardizes both modes)\n", + "\n", + "For **joint analysis** (multiple annotations fit together), both $\\tau$ and $E$ are conditional on the other annotations in the model. We report joint $\\tau^*_C$ as the independent contribution of annotation $C$ after controlling for the others. The annotation-prep step exposes two independent toggles, `compute_single` and `compute_joint` (both default `True`), so the user can produce the $N$ single-target outputs, the joint output, or both in one invocation. With both defaults the post-processing layer reads all $N+1$ regression outputs per trait and presents single + joint side-by-side. When the joint subset is decided after looking at single-target results (exploratory $\\rightarrow$ conditional workflow), the user runs the annotation-prep step a second time with `compute_single=False` on the curated subset.\n", + "\n", + "### Meta-Analysis across Traits (Random Effects) — (pecotmr)\n", + "\n", + "DerSimonian-Laird random-effects meta-analysis of per-annotation estimates across traits, implemented in `pecotmr::meta_sldsc_random` (which delegates the numerics to `rmeta::meta.summaries(..., method = \"random\")`):\n", + "\n", + "$$\\hat\\theta_{\\mathrm{meta}} \\;=\\; \\frac{\\sum_i w_i\\, \\hat\\theta_i}{\\sum_i w_i}, \\qquad SE_{\\mathrm{meta}} \\;=\\; \\sqrt{\\frac{1}{\\sum_i w_i}}, \\qquad w_i \\;=\\; \\frac{1}{SE_i^2 + \\hat\\sigma^2}$$\n", + "\n", + "where $\\hat\\theta_i$ is the per-trait estimate and $SE_i$ its standard error:\n", + "\n", + "- **For $\\tau^*_C$ meta**: $SE_i$ is the jackknife SE from the per-block $\\tau^*_C$ values.\n", + "- **For $E_C$ meta**: $SE_i$ is the polyfun-reported `Enrichment_std_error`.\n", + "- **For EnrichStat meta**: $SE_i$ is the back-solved SE from polyfun's `Enrichment_p`.\n", + "\n", + "For binary-annotation enrichment reporting we use a two-channel meta: the **effect size and SE** come from the meta on $E_C$ (interpretable on the original enrichment-fold scale), while the **p-value** comes from the meta on EnrichStat (the appropriate hypothesis test). The pipeline produces a default meta over all supplied traits; users can re-run meta on any subset of traits without re-running the regression layer.\n", + "\n", + "$$Z_{\\mathrm{meta}} \\;=\\; \\frac{\\hat\\theta_{\\mathrm{meta}}}{SE_{\\mathrm{meta}}}, \\qquad p \\;=\\; 2\\,\\Phi(-|Z_{\\mathrm{meta}}|)$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Minimal Working Example (MWE)\n", + "\n", + "The steps below run the four pipeline workflows end to end on the example data. Each step lists what it does, then the `sos run` command to execute it.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Step 1. `make_annotation_files_ldscore`\n", + "\n", + "*Annotation preparation and S-LDSC regression (polyfun).* This step accepts a single annotation file for a single-tau analysis (one annotation as input) or several annotation files for a joint-tau analysis (multiple annotations as input)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "#### **Inputs**\n", + "\n", + "##### 1. Target Annotation File\n", + "\n", + "- **Purpose**: Specifies the user-provided (\"target\") genome annotation files. The pipeline supports both binary and continuous annotations; the type is auto-detected per annotation column.\n", + "- **Formats**:\n", + " - Text file (`.txt`) listing per-chromosome paths to annotation files. Annotation files can be `.rds`/`.tsv`/`.txt`.\n", + " - Alternatively, files for specific chromosomes can be provided directly.\n", + " - **Multiple target annotations** are supported in one input file (one column per annotation, prefixed `path`, `path1`, `path2`, ...). Single-target and joint-target analyses are produced automatically in one pipeline pass.\n", + " - **Format** (the score column is optional; if absent, score is set to 1):\n", + " - `is_range = False`:\n", + " ```\n", + " chr pos score\n", + " 1 10001 1\n", + " 1 10002 1\n", + " ```\n", + " - `is_range = True`:\n", + " ```\n", + " chr start end score\n", + " 1 10001 20001 1\n", + " 1 30001 40001 1\n", + " ```\n", + "\n", + "##### 2. Reference Annotation File (baseline-LD)\n", + "\n", + "- **Purpose**: Provides the baseline annotations (typically the 97-annotation baseline-LD model from Gazal et al. 2017) in `.annot.gz` format for each chromosome. The baseline conditions every regression.\n", + "- **Formats**:\n", + " - Text file listing baseline annotation files for all chromosomes.\n", + " - Alternatively, files for specific chromosomes can be provided directly.\n", + "\n", + "##### 3. Genome Reference File\n", + "\n", + "- **Purpose**: PLINK-format `.bed/.bim/.fam` files for our LD reference panel, per chromosome. This is the panel against which all LD-derived quantities (target LD scores, baseline LD scores, regression weights, allele frequencies) must be computed. **Do not mix files derived from different panels** (e.g. 1000G vs ADSP).\n", + "- **Formats**:\n", + " - Text file listing per-chromosome reference files, or files for specific chromosomes.\n", + "\n", + "##### 4. SNP List\n", + "\n", + "- **Purpose**: Specifies the SNPs to include in LDSC analysis (typically a HapMap3-style list).\n", + "- **Format**: A list of `rsid`s, one per line.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "kernel": "Bash" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/restricted/projectnb/xqtl/jaempawi/xqtl-protocol\n" + ] + } + ], + "source": [ + "pwd" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "kernel": "Bash" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n", + " import pkg_resources\n", + "INFO: Running \u001b[32mmake_annotation_files_ldscore\u001b[0m: \n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=1) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=3) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=2) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=0) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=5) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=6) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=4) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=7) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=9) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=10) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=8) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=11) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=14) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=13) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=12) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=15) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=18) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=16) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=17) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=19) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=21) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=20) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m output: \u001b[32m/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_ldscore/protocol_example_single_1/protocol_example_single_1.1.annot.gz /restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_ldscore/protocol_example_single_1/protocol_example_single_1.1.l2.ldscore.parquet... (66 items in 22 groups)\u001b[0m\n", + "INFO: Workflow make_annotation_files_ldscore (ID=weae0ca3fdf468fd8) is executed successfully with 1 completed step and 22 completed substeps.\n" + ] + } + ], + "source": [ + "sos run pipeline/sldsc_enrichment.ipynb make_annotation_files_ldscore \\\n", + " --annotation_file input/enrichment/sldsc/colocboost_test_annotation_path.txt \\\n", + " --reference_anno_file input/enrichment/sldsc/reference_annotation0.txt \\\n", + " --genome_ref_file input/enrichment/sldsc/genome_reference_bfile.txt \\\n", + " --annotation_name protocol_example \\\n", + " --plink_name reference. --baseline_name annotations. --weight_name weights. \\\n", + " --python_exec python \\\n", + " --polyfun_path polyfun \\\n", + " --cwd output/sldsc_ldscore -j 4\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Munge summary statistics (preprocessing, run before Step 2)\n", + "\n", + "Before estimating heritability, each raw GWAS summary-statistics file must be converted into the LDSC-compatible format consumed by `get_heritability`. Run `munge_sumstats_polyfun` once per trait; the munged files are then collected in the directory passed to `get_heritability` via `--sumstat_dir`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "# sos run pipeline/sldsc_enrichment.ipynb munge_sumstats_polyfun \\\n", + "# --sumstats data/polyfun_new/example_data/trait_raw_sumstats.tsv \\\n", + "# --n 0 \\\n", + "# --min-info 0.6 \\\n", + "# --min-maf 0.001 \\\n", + "# --chi2-cutoff 30 \\\n", + "# --polyfun_path data/github/polyfun \\\n", + "# --cwd data/polyfun_new/example_data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Step 2. `get_heritability`\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "**Inputs**\n", + "\n", + "##### 1. Allele Frequency Files (`.frq`, our panel)\n", + "\n", + "- **Purpose**: PLINK `.frq` files for the reference panel, used to enforce the MAF cutoff. **Required** when `maf_cutoff > 0` (default `0.05`); the pipeline fails if missing unless `maf_cutoff = 0` is explicitly set.\n", + "\n", + "##### 2. GWAS Summary Statistics\n", + "\n", + "- **Purpose**: One munged sumstats file per trait, listed in a text file (`all_traits_file`). The pipeline runs the regression once per trait per single/joint mode.\n", + "- **Format**:\n", + " ```\n", + " CAD_META.filtered.sumstats.gz\n", + " UKB.Lym.BOLT.sumstats.gz\n", + " ```\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "kernel": "Bash" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n", + " import pkg_resources\n", + "INFO: Running \u001b[32mget_heritability\u001b[0m: \n", + "maf_cutoff = 0: skipping MAF filter (--not-M-5-50)\n", + "maf_cutoff = 0: skipping MAF filter (--not-M-5-50)\n", + "maf_cutoff = 0: skipping MAF filter (--not-M-5-50)\n", + "python: can't open file '/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/../polyfun/ldsc.py': [Errno 2] No such file or directory\n", + "python: can't open file '/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/../polyfun/ldsc.py': [Errno 2] No such file or directory\n", + "python: can't open file '/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/../polyfun/ldsc.py': [Errno 2] No such file or directory\n", + "INFO: \u001b[32mget_heritability\u001b[0m (index=1) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mget_heritability\u001b[0m (index=0) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mget_heritability\u001b[0m (index=2) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mget_heritability\u001b[0m output: \u001b[32m/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_heritability/protocol_example_single_1/sumstats.parquet.log /restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_heritability/protocol_example_single_1/sumstats.parquet.results... (6 items in 3 groups)\u001b[0m\n", + "INFO: Workflow get_heritability (ID=wa79eac1662f5dd2d) is executed successfully with 1 completed step and 3 completed substeps.\n" + ] + } + ], + "source": [ + "sos run pipeline/sldsc_enrichment.ipynb get_heritability \\\n", + " --target_anno_dirs output/sldsc_ldscore/protocol_example_single_1 \\\n", + " --all_traits_file input/enrichment/sldsc/sumstats_test_all.txt \\\n", + " --sumstat_dir input/enrichment/sldsc \\\n", + " --baseline_ld_dir input/enrichment/sldsc \\\n", + " --weights_dir input/enrichment/sldsc \\\n", + " --plink_name reference. --baseline_name annotations. --weight_name weights. \\\n", + " --annotation_name protocol_example --python_exec python \\\n", + " --polyfun_path ../polyfun \\\n", + " --maf_cutoff 0 --cwd output/sldsc_heritability -j 4\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Step 3. `Post-processing (pecotmr) and meta-analysis`\n", + "\n", + "*Post-Processing (`pecotmr::sldsc_postprocessing_pipeline`)*\n", + "\n", + "A single R function call consumes all polyfun outputs for the run and produces the final tables:\n", + "\n", + "- Reads each polyfun output and extracts $\\tau$, $E$, $h^2_g$, EnrichStat p-value, and per-block jackknife $\\tau$ values.\n", + "- Computes annotation $sd_C$ and $M_{\\mathrm{ref}}$ over the same MAF $>$ cutoff SNP set as the regression.\n", + "- Standardizes $\\tau \\to \\tau^*$ for both single-tau and joint-tau modes, including the per-block versions for jackknife SE.\n", + "- Auto-detects whether each annotation is binary or continuous and tags every output row accordingly.\n", + "- Reports the number and names of baseline annotations encountered (via `message()`) for transparency.\n", + "- Runs the default DerSimonian-Laird random-effects meta-analysis across all supplied traits, producing three meta tables: $\\tau^*$ (cross-type comparable), $E$ (within-binary), and EnrichStat (within-type).\n", + "\n", + "Outputs are returned as an R list with two top-level entries: `per_trait` (one tidy data frame per trait, single + joint estimates side-by-side per target) and `meta` (three tables, one per quantity, with rows = target annotations and columns = single/joint mean/SE/p plus an `is_binary` flag).\n", + "\n", + "The `[postprocess]` step reads all polyfun outputs under `heritability_cwd`\n", + "(which contains the $N$ single-target subdirectories and optionally the\n", + "joint subdirectory) and calls `pecotmr::sldsc_postprocessing_pipeline()`\n", + "to produce per-trait standardized tables and the default random-effects\n", + "meta across all traits.\n", + "\n", + "Use `--target-categories-label` (same order as `--target-categories`) to give the target annotations friendly names in the output — e.g. `--target-categories ANNOT_1_0 ANNOT_2_0 --target-categories-label quantile_eQTL eQTL` makes the `target` column read `quantile_eQTL` / `eQTL` instead of `ANNOT_1_0` / `ANNOT_2_0` (the original names are kept in `params$target_categories_orig`). Omit it to keep the polyfun `.results` names.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "kernel": "Bash" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n", + " import pkg_resources\n", + "INFO: Running \u001b[32mpostprocess\u001b[0m: \n", + "INFO: \u001b[32mpostprocess\u001b[0m is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mpostprocess\u001b[0m output: \u001b[32m/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_postprocess/protocol_example.sldsc_postprocess.rds\u001b[0m\n", + "INFO: Workflow postprocess (ID=wb64dc2b84958960c) is executed successfully with 1 completed step.\n" + ] + } + ], + "source": [ + "sos run pipeline/sldsc_enrichment.ipynb postprocess \\\n", + " --traits_file input/enrichment/sldsc/sumstats_test_all.txt \\\n", + " --heritability_cwd output/sldsc_heritability \\\n", + " --target_categories ANNOT_0 --target_categories_label protocol_example_annotation \\\n", + " --target_anno_dir output/sldsc_ldscore/protocol_example_single_1 \\\n", + " --annotation_name protocol_example --python_exec python \\\n", + " --polyfun_path ../polyfun \\\n", + " --maf_cutoff 0 --cwd output/sldsc_postprocess -j 4\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Step 4. `Subset Meta-Analysis (`pecotmr::meta_sldsc_random`)` (optional)\n", + "\n", + "The default meta in Step 2 pools all traits the user supplied. To re-run the meta on a subset (e.g., neurodegenerative traits only, or autoimmune traits only) without re-running the regression layer:\n", + "\n", + "\n", + "```r\n", + "res <- readRDS(\"sldsc_results.rds\")\n", + "neuro <- c(\"AD_GWAX\", \"PD_meta\", \"ALS_meta\")\n", + "meta_neuro_taustar <- pecotmr::meta_sldsc_random(\n", + " res$per_trait[neuro], category = \"my_target_anno\", quantity = \"tau_star\"\n", + ")\n", + "```\n", + "\n", + "This step is light-weight and can be run interactively.\n", + "\n", + "\n", + "The default meta in step 3 pools all traits supplied to `[postprocess]`. Use `[meta_subset]` to re-run the meta on a user-defined trait subset (e.g., neurodegenerative traits only, autoimmune traits only) without re-running the regression or the per-trait standardization. The subset operates on the cached `.sldsc_postprocess.rds` output; it is light-weight and can be run interactively or in batch.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "kernel": "Bash" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n", + " import pkg_resources\n", + "INFO: Running \u001b[32mmeta_subset\u001b[0m: \n", + "INFO: \u001b[32mmeta_subset\u001b[0m is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmeta_subset\u001b[0m output: \u001b[32m/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_postprocess/protocol_example.category1.meta.rds\u001b[0m\n", + "INFO: Workflow meta_subset (ID=w09a2a0530119f1d2) is executed successfully with 1 completed step.\n" + ] + } + ], + "source": [ + "sos run pipeline/sldsc_enrichment.ipynb meta_subset \\\n", + " --postprocess_rds output/sldsc_postprocess/protocol_example.sldsc_postprocess.rds \\\n", + " --subset_traits_file input/enrichment/sldsc/sumstats_test_category1.txt \\\n", + " --subset_name category1 --target_categories ANNOT_0 \\\n", + " --annotation_name protocol_example --python_exec python \\\n", + " --polyfun_path ../polyfun \\\n", + " --maf_cutoff 0 --cwd output/sldsc_postprocess -j 4\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Output\n", + "\n", + "### Output summary\n", + "\n", + "| Stage | Cached on disk | Recomputable from | Purpose |\n", + "|---|---|---|---|\n", + "| Target LD scores | per-annotation, once | annotation + reference panel | input to every regression |\n", + "| polyfun `.results` per (trait, mode) | yes | regression run | $\\tau$, $E$, EnrichStat |\n", + "| Per-trait standardized table | yes (RDS) | polyfun outputs + $sd_C$ + $M_{\\mathrm{ref}}$ | reporting + meta |\n", + "| Default meta tables | yes (RDS) | per-trait standardized | headline figures |\n", + "| Subset meta | re-run on demand | per-trait standardized | custom analyses |\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Per-stage outputs\n", + "\n", + "Each workflow writes into its `--cwd`:\n", + "\n", + "- **make_annotation_files_ldscore** — polyfun `.annot.gz` files plus per-annotation LD-score directories (`.l2.ldscore.{gz,parquet}`, `.l2.M`, `.l2.M_5_50`). One single-target directory per annotation, plus (when more than one annotation) a joint directory.\n", + "- **get_heritability** — per trait and per target directory, the S-LDSC regression outputs `.{results,log,part_delete}`. The `.results` `Category` column carries the annotation name with a `_` suffix.\n", + "- **postprocess** — a single `.sldsc_postprocess.rds` containing per-trait tables (Gazal-style tau*, EnrichStat with back-solved jackknife SE) and three DerSimonian–Laird random-effects meta tables (tau*, E, EnrichStat).\n", + "- **meta_subset** — a re-meta of the cached `.sldsc_postprocess.rds` over a user-defined trait subset (lightweight; no regression re-run).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Anticipated Results\n", + "\n", + "Produces per-annotation enrichment statistics (tau, enrichment, p-value) from stratified LD score regression." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Command interface\n", + "\n", + "List all workflows and their options:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "Bash" + }, + "outputs": [], + "source": [ + "sos run pipeline/sldsc_enrichment.ipynb -h" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Workflow implementation\n", + "\n", + "The cells below are the pipeline definition (preserved from the original notebook): the `[global]` parameter block and the workflow step bodies.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[global]\n", + "# Path to the work directory of the analysis.\n", + "parameter: cwd = path('output')\n", + "# Prefix for the analysis output\n", + "parameter: annotation_name = str\n", + "parameter: python_exec = \"python\" # e.g. \"/home/you/.conda/envs/polyfun/bin/python\"\n", + "parameter: polyfun_path = path # e.g. \"/home/you/tools/polyfun\"\n", + "\n", + "# MAF cutoff for sLDSC. Default 0.05 per sLDSC recommendation (rare-variant LD is unstable\n", + "# and HapMap3-style regression weights are common-variant by construction).\n", + "# Set to 0 to opt out of MAF filtering (NOT recommended; only use if you understand the implications).\n", + "# Other values would require recomputing LD scores at that cutoff.\n", + "parameter: maf_cutoff = 0.05\n", + "\n", + "# for make_annotation_files_ldscore workflow:\n", + "parameter: annotation_file = path()\n", + "parameter: reference_anno_file = path()\n", + "parameter: genome_ref_file = path() # with .bed\n", + "parameter: chromosome = []\n", + "parameter: snp_list = path()\n", + "parameter: ld_wind_kb = 0 # use kb if the value is provided\n", + "parameter: ld_wind_cm = 1.0 # default using ld_wind_cm\n", + "\n", + "# for get_heritability workflow.\n", + "# Note: all LD-derived inputs (baseline LD scores, target LD scores, regression weights,\n", + "# allele frequencies) must be computed against the same reference panel as `genome_ref_file`.\n", + "# Do not mix files derived from different reference panels (e.g., 1000G vs ADSP).\n", + "parameter: all_traits_file = path() # txt file, each row contains all GWAS summary statistics name: e.g. CAD_META.filtered.sumstats.gz\n", + "parameter: sumstat_dir = path() # Directory containing GWAS summary statistics\n", + "parameter: target_anno_dir = path() # Directory containing target annotation files: output of ldscore\n", + "parameter: baseline_ld_dir = path() # Directory containing baseline LD score files (computed against our panel)\n", + "parameter: frqfile_dir = path() # Directory containing allele frequency files (.frq, our panel)\n", + "parameter: plink_name = \"ADSP_chr\"\n", + "parameter: weights_dir = path() # Directory containing LD weights (computed against our panel)\n", + "parameter: baseline_name = \"baseline_chr\" # Prefix of baseline annotation files\n", + "parameter: weight_name = \"weights_chr\" # Prefix of LD weights files\n", + "parameter: n_blocks = 200\n", + "\n", + "# Number of threads\n", + "parameter: numThreads = 16\n", + "# For cluster jobs, number commands to run per job\n", + "parameter: job_size = 1\n", + "parameter: walltime = '12h'\n", + "parameter: mem = '16G'" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "Python 3 (ipykernel)" + }, + "source": [ + "## Make Annotation File" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[make_annotation_files_ldscore]\n", + "# Annotation preparation. Takes one annotation_file with N target annotations\n", + "# and produces, in one invocation, any combination of:\n", + "# - N single-target LD-score directories (when compute_single = TRUE, default)\n", + "# - 1 joint LD-score directory containing all N (when compute_joint = TRUE\n", + "# and N >= 2, default)\n", + "#\n", + "# Outputs per chromosome :\n", + "# /_single_/_single_..annot.gz (i in 1..N, when compute_single)\n", + "# /_single_/_single_..l2.ldscore.{parquet|gz}\n", + "# /_single_/_single_..l2.M\n", + "# /_single_/_single_..l2.M_5_50 (when .frq present)\n", + "#\n", + "# /_joint/_joint..{...} (when compute_joint and N>=2)\n", + "#\n", + "# Workflows:\n", + "# - Workflow A (\"all at once\"): compute_single=TRUE, compute_joint=TRUE (defaults).\n", + "# Produces both, fits the case where you have already chosen the joint set.\n", + "# - Workflow B (\"exploratory then conditional\"):\n", + "# Step 1: compute_single=TRUE, compute_joint=FALSE.\n", + "# Run on N candidate annotations -> N single-target dirs.\n", + "# Inspect single-target results, identify K significant ones.\n", + "# Step 2: compute_single=FALSE, compute_joint=TRUE.\n", + "# Run on a NEW annotation_file with the K selected annotations\n", + "# -> 1 joint dir with the conditional model.\n", + "\n", + "#\n", + "# --- snplist (--snp_list) vs no-snplist: which polyfun script, output format,\n", + "# column name, and the CM requirement ---\n", + "# --snp_list given -> ldsc.py --l2 --print-snps -> output .l2.ldscore.gz\n", + "# --snp_list absent -> compute_ldscores.py -> output .l2.ldscore.parquet\n", + "#\n", + "# LD-score column name (this is what becomes the .results \"Category\" in\n", + "# [get_heritability], with a \"_\" suffix appended there):\n", + "# * compute_ldscores.py ALWAYS keeps the annot column name(s):\n", + "# single annot column \"ANNOT\" -> ldscore column \"ANNOT\"\n", + "# joint annot columns \"ANNOT_1\",\"ANNOT_2\",... -> \"ANNOT_1\",\"ANNOT_2\",...\n", + "# * ldsc.py --l2 has a quirk: with EXACTLY ONE annotation (n_annot == 1) it\n", + "# HARD-CODES the ldscore column name to \"L2\" and DROPS the annot's original\n", + "# column name. With >=2 annotations it uses \"L2\"\n", + "# (\"ANNOT_1L2\",\"ANNOT_2L2\",...).\n", + "# => a single-target snplist run reports \"L2_0\" in .results, while a\n", + "# single-target no-snplist run reports \"ANNOT_0\". [postprocess] auto-\n", + "# detects either; only matters if you pass --target-categories explicitly.\n", + "#\n", + "# CM column requirement for snplist: ldsc.py --l2 --print-snps requires the\n", + "# target annot to (a) carry a \"CM\" (centimorgan) column and (b) line up with\n", + "# the plink .bim (same SNP set, same row order). This step handles both\n", + "# internally (normalize_for_ldsc: takes CM from the .bim 4th column, re-expands\n", + "# the annot onto the .bim rows, filling 0). Therefore the plink .bim files MUST\n", + "# carry genetic-map (cM) positions when using --ld-wind-cm (the default);\n", + "# if your .bim has 0 in the cM column, switch to --ld-wind-kb instead.\n", + "#\n", + "parameter: compute_single = True\n", + "parameter: compute_joint = True\n", + "parameter: score_column = 3\n", + "parameter: is_range = False\n", + "\n", + "import pandas as pd\n", + "import os\n", + "\n", + "if not (compute_single or compute_joint):\n", + " raise ValueError(\"[make_annotation_files_ldscore] at least one of compute_single or compute_joint must be TRUE\")\n", + "\n", + "def adapt_file_path(file_path, reference_file):\n", + " reference_path = os.path.dirname(reference_file)\n", + " if os.path.isfile(file_path):\n", + " return file_path\n", + " file_name = os.path.basename(file_path)\n", + " if os.path.isfile(file_name):\n", + " return file_name\n", + " file_in_ref_dir = os.path.join(reference_path, file_name)\n", + " if os.path.isfile(file_in_ref_dir):\n", + " return file_in_ref_dir\n", + " file_prefixed = os.path.join(reference_path, file_path)\n", + " if os.path.isfile(file_prefixed):\n", + " return file_prefixed\n", + " raise FileNotFoundError(f\"No valid path found for file: {file_path}\")\n", + "\n", + "\n", + "# ---- Parse inputs and determine N ----\n", + "if (str(annotation_file).endswith(('rds', 'tsv', 'txt', 'tsv.gz', 'txt.gz')) and\n", + " str(reference_anno_file).endswith('annot.gz')):\n", + " # Case 1: direct file paths (single-chromosome run). Multiple target files separated by ','.\n", + " target_files_direct = str(annotation_file).split(',')\n", + " N_targets = len(target_files_direct)\n", + " target_names = [f\"target_{i+1}\" for i in range(N_targets)]\n", + " input_files = [[*target_files_direct, str(reference_anno_file), str(genome_ref_file)]]\n", + " if len(chromosome) > 0:\n", + " input_chroms = [int(x) for x in chromosome]\n", + " else:\n", + " input_chroms = [0]\n", + "else:\n", + " # Case 2: txt list with #id and one or more 'path' columns\n", + " target_files_df = pd.read_csv(annotation_file, sep=\"\\t\")\n", + " reference_files = pd.read_csv(reference_anno_file, sep=\"\\t\")\n", + " genome_ref_files = pd.read_csv(genome_ref_file, sep=\"\\t\")\n", + "\n", + " target_files_df[\"#id\"] = [x.replace(\"chr\", \"\") for x in target_files_df[\"#id\"].astype(str)]\n", + " reference_files[\"#id\"] = [x.replace(\"chr\", \"\") for x in reference_files[\"#id\"].astype(str)]\n", + " genome_ref_files[\"#id\"] = [x.replace(\"chr\", \"\") for x in genome_ref_files[\"#id\"].astype(str)]\n", + "\n", + " path_columns = [c for c in target_files_df.columns if c.startswith('path')]\n", + " N_targets = len(path_columns)\n", + " target_names = path_columns[:] # 'path', 'path1', 'path2', ...\n", + "\n", + " for col in path_columns:\n", + " target_files_df[col] = target_files_df[col].apply(lambda x: adapt_file_path(x, str(annotation_file)))\n", + " reference_files[\"path\"] = reference_files[\"path\"].apply(lambda x: adapt_file_path(x, str(reference_anno_file)))\n", + " genome_ref_files[\"path\"] = genome_ref_files[\"path\"].apply(lambda x: adapt_file_path(x, str(genome_ref_file)))\n", + "\n", + " merged = target_files_df.merge(reference_files, on=\"#id\").merge(genome_ref_files, on=\"#id\")\n", + " if len(chromosome) > 0:\n", + " merged = merged[merged[\"#id\"].isin([str(c) for c in chromosome])]\n", + "\n", + " rows = merged.values.tolist()\n", + " input_chroms = [r[0] for r in rows]\n", + " input_files = [[*r[1:N_targets+1], r[-2], r[-1]] for r in rows]\n", + "\n", + "# ---- Determine output format ----\n", + "use_print_snps = snp_list.is_file()\n", + "ldscore_ext = \"l2.ldscore.gz\" if use_print_snps else \"l2.ldscore.parquet\"\n", + "\n", + "if ld_wind_kb > 0:\n", + " use_kb_window = True\n", + " ld_window_param = ld_wind_kb\n", + " ld_window_flag = \"--ld-wind-kb\"\n", + "else:\n", + " use_kb_window = False\n", + " ld_window_param = ld_wind_cm\n", + " ld_window_flag = \"--ld-wind-cm\"\n", + "\n", + "emit_single = compute_single\n", + "emit_joint = compute_joint and N_targets >= 2\n", + "\n", + "# ---- Build per-chromosome output list ----\n", + "def chrom_outputs(chrom):\n", + " outs = []\n", + " if emit_single:\n", + " for i in range(N_targets):\n", + " name = f\"{annotation_name}_single_{i+1}\"\n", + " prefix = f\"{cwd:a}/{name}/{name}.{chrom}\"\n", + " outs += [f\"{prefix}.annot.gz\", f\"{prefix}.{ldscore_ext}\", f\"{prefix}.l2.M\"]\n", + " if emit_joint:\n", + " name = f\"{annotation_name}_joint\"\n", + " prefix = f\"{cwd:a}/{name}/{name}.{chrom}\"\n", + " outs += [f\"{prefix}.annot.gz\", f\"{prefix}.{ldscore_ext}\", f\"{prefix}.l2.M\"]\n", + " return outs\n", + "\n", + "input: input_files, group_by = N_targets + 2, group_with = \"input_chroms\"\n", + "\n", + "output: chrom_outputs(input_chroms[_index])\n", + "\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bnn}'\n", + "\n", + "# ----------------------------------------------------------------------------\n", + "# Step A: write the requested .annot files for this chromosome.\n", + "# ----------------------------------------------------------------------------\n", + "R: expand = \"${ }\", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'\n", + " library(data.table)\n", + "\n", + " clean_chr <- function(x) as.numeric(gsub(\"^chr\", \"\", x))\n", + "\n", + " process_range_data <- function(data, chr_value) {\n", + " data$chr <- clean_chr(data$chr)\n", + " data <- data[data$chr == chr_value,]\n", + " if (nrow(data) == 0) return(NULL)\n", + " expanded <- lapply(seq_len(nrow(data)), function(j) {\n", + " row <- data[j,]\n", + " pos_seq <- seq(row$start, row$end - 1)\n", + " result <- data.frame(chr = rep(row$chr, length(pos_seq)), pos = pos_seq)\n", + " if (ncol(data) > 3) {\n", + " for (col in 4:ncol(data))\n", + " result[[names(data)[col]]] <- rep(row[[col]], length(pos_seq))\n", + " }\n", + " result\n", + " })\n", + " unique(rbindlist(expanded))\n", + " }\n", + "\n", + " process_annotation <- function(target_anno, ref_anno, score_column_value) {\n", + " target_anno <- as.data.frame(target_anno)\n", + " ref_anno <- as.data.frame(ref_anno)\n", + " target_anno$chr <- clean_chr(target_anno$chr)\n", + " ref_anno$CHR <- clean_chr(ref_anno$CHR)\n", + " chr_value <- unique(ref_anno$CHR)\n", + " anno_scores <- rep(0, nrow(ref_anno))\n", + " match_pos <- match(target_anno$pos, ref_anno$BP)\n", + " valid_pos <- as.numeric(na.omit(match_pos))\n", + " if (score_column_value <= ncol(target_anno)) {\n", + " anno_scores[valid_pos] <- target_anno[[score_column_value]][!is.na(match_pos)]\n", + " } else {\n", + " anno_scores[valid_pos] <- 1\n", + " print(\"Warning: score column does not exist; setting scores to 1\")\n", + " }\n", + " anno_scores\n", + " }\n", + "\n", + " read_target_anno <- function(file_path, ref_anno) {\n", + " if (endsWith(file_path, \"rds\")) {\n", + " target_anno <- readRDS(file_path)\n", + " return(process_annotation(target_anno, ref_anno, ${score_column}))\n", + " }\n", + " target_anno <- fread(file_path)\n", + " if (${\"TRUE\" if is_range else \"FALSE\"}) {\n", + " names(target_anno)[1:3] <- c(\"chr\", \"start\", \"end\")\n", + " target_anno <- process_range_data(target_anno, unique(ref_anno$CHR))\n", + " if (is.null(target_anno)) return(rep(0, nrow(ref_anno)))\n", + " } else {\n", + " names(target_anno)[1:2] <- c(\"chr\", \"pos\")\n", + " }\n", + " process_annotation(target_anno, ref_anno, ${score_column})\n", + " }\n", + "\n", + " # ---- Read reference annotation ----\n", + " ref_anno <- as.data.frame(fread(${_input[-2]:ar}))\n", + " if (\"ANNOT\" %in% colnames(ref_anno)) ref_anno <- ref_anno[, -which(colnames(ref_anno) == \"ANNOT\")]\n", + "\n", + " # ---- Compute per-target annotation scores ----\n", + " target_files <- c(${\",\".join('\"%s\"' % str(p.absolute()) for p in _input[:-2])})\n", + " N_local <- length(target_files)\n", + " score_list <- lapply(target_files, read_target_anno, ref_anno = ref_anno)\n", + "\n", + " emit_single_local <- ${\"TRUE\" if emit_single else \"FALSE\"}\n", + " emit_joint_local <- ${\"TRUE\" if emit_joint else \"FALSE\"}\n", + " use_print_snps_local <- ${\"TRUE\" if use_print_snps else \"FALSE\"}\n", + " bfile_prefix <- \"${_input[-1]:na}\"\n", + "\n", + " # Reshape annot to match .bim panel for ldsc.py --l2 --print-snps\n", + " # (drop A1/A2/MAF, expand to .bim rows filling 0, take CM from .bim).\n", + " normalize_for_ldsc <- function(df) {\n", + " if (!use_print_snps_local) return(df)\n", + " df <- df[, !names(df) %in% c(\"A1\", \"A2\", \"MAF\", \"CM\"), drop = FALSE]\n", + " annot_cols <- setdiff(names(df), c(\"CHR\", \"BP\", \"SNP\"))\n", + " bim <- as.data.frame(fread(paste0(bfile_prefix, \".bim\"), header = FALSE,\n", + " col.names = c(\"CHR\", \"SNP\", \"CM\", \"BP\", \"A1\", \"A2\")))\n", + " bim$CHR <- as.character(bim$CHR); df$CHR <- as.character(df$CHR)\n", + " idx <- match(bim$SNP, df$SNP)\n", + " out <- data.frame(CHR = bim$CHR, BP = bim$BP, SNP = bim$SNP, CM = bim$CM,\n", + " stringsAsFactors = FALSE)\n", + " for (col in annot_cols) {\n", + " v <- rep(0, nrow(bim))\n", + " non_na <- !is.na(idx)\n", + " v[non_na] <- df[[col]][idx[non_na]]\n", + " out[[col]] <- v\n", + " }\n", + " out\n", + " }\n", + "\n", + " # ---- Write N single-target .annot files (when requested) ----\n", + " if (emit_single_local) {\n", + " for (i in seq_len(N_local)) {\n", + " out_anno <- ref_anno\n", + " out_anno$ANNOT <- score_list[[i]]\n", + " out_anno <- normalize_for_ldsc(out_anno)\n", + " name <- paste0(\"${annotation_name}\", \"_single_\", i)\n", + " out_path_gz <- file.path(\"${cwd:a}\", name, paste0(name, \".${input_chroms[_index]}.annot.gz\"))\n", + " out_path_tsv <- sub(\"\\\\.gz$\", \"\", out_path_gz)\n", + " dir.create(dirname(out_path_gz), showWarnings = FALSE, recursive = TRUE)\n", + " fwrite(out_anno, out_path_tsv, quote = FALSE, col.names = TRUE, row.names = FALSE, sep = \"\\t\")\n", + " }\n", + " }\n", + "\n", + " # ---- Optionally write joint .annot ----\n", + " if (emit_joint_local) {\n", + " joint_anno <- ref_anno\n", + " for (i in seq_len(N_local)) {\n", + " joint_anno[[paste0(\"ANNOT_\", i)]] <- score_list[[i]]\n", + " }\n", + " joint_anno <- normalize_for_ldsc(joint_anno)\n", + " joint_name <- paste0(\"${annotation_name}\", \"_joint\")\n", + " joint_out_gz <- file.path(\"${cwd:a}\", joint_name, paste0(joint_name, \".${input_chroms[_index]}.annot.gz\"))\n", + " joint_out_tsv <- sub(\"\\\\.gz$\", \"\", joint_out_gz)\n", + " dir.create(dirname(joint_out_gz), showWarnings = FALSE, recursive = TRUE)\n", + " fwrite(joint_anno, joint_out_tsv, quote = FALSE, col.names = TRUE, row.names = FALSE, sep = \"\\t\")\n", + " }\n", + "\n", + "# ----------------------------------------------------------------------------\n", + "# Step B: gzip all annot files. Uses expand=\"$[ ]\" so bash ${var} survives.\n", + "# ----------------------------------------------------------------------------\n", + "bash: expand = \"$[ ]\", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'\n", + " set -e\n", + " annots=()\n", + " if [ \"$[str(emit_single)]\" = \"True\" ]; then\n", + " for i in $(seq 1 $[N_targets]); do\n", + " annots+=(\"$[cwd:a]/$[annotation_name]_single_$i/$[annotation_name]_single_$i.$[input_chroms[_index]].annot\")\n", + " done\n", + " fi\n", + " if [ \"$[str(emit_joint)]\" = \"True\" ]; then\n", + " annots+=(\"$[cwd:a]/$[annotation_name]_joint/$[annotation_name]_joint.$[input_chroms[_index]].annot\")\n", + " fi\n", + " for a in \"${annots[@]}\"; do\n", + " gzip -f \"$a\"\n", + " done\n", + "\n", + "# ----------------------------------------------------------------------------\n", + "# Step C: run polyfun's LD-score computation for each emitted annotation file.\n", + "# ----------------------------------------------------------------------------\n", + "bash: expand = \"$[ ]\", stderr = f'{_output[1]}.stderr', stdout = f'{_output[1]}.stdout'\n", + " set -e\n", + " chrom=\"$[input_chroms[_index]]\"\n", + "\n", + " run_polyfun() {\n", + " local annot=\"$1\"\n", + " local out_prefix=\"$2\"\n", + " if [ \"$[str(use_print_snps)]\" = \"True\" ]; then\n", + " $[python_exec] $[polyfun_path]/ldsc.py \\\n", + " --print-snps $[snp_list] \\\n", + " $[ld_window_flag] $[ld_window_param] \\\n", + " --out \"$out_prefix\" \\\n", + " --bfile $[_input[-1]:nar] \\\n", + " --yes-really \\\n", + " --annot \"$annot\" \\\n", + " --l2\n", + " else\n", + " $[python_exec] $[polyfun_path]/compute_ldscores.py \\\n", + " --annot \"$annot\" \\\n", + " --bfile $[_input[-1]:nar] \\\n", + " $[ld_window_flag] $[ld_window_param] \\\n", + " --out \"${out_prefix}.$[ldscore_ext]\" \\\n", + " --allow-missing\n", + " fi\n", + " }\n", + "\n", + " if [ \"$[str(emit_single)]\" = \"True\" ]; then\n", + " for i in $(seq 1 $[N_targets]); do\n", + " name=\"$[annotation_name]_single_$i\"\n", + " annot=\"$[cwd:a]/$name/$name.$chrom.annot.gz\"\n", + " prefix=\"$[cwd:a]/$name/$name.$chrom\"\n", + " run_polyfun \"$annot\" \"$prefix\"\n", + " done\n", + " fi\n", + " if [ \"$[str(emit_joint)]\" = \"True\" ]; then\n", + " name=\"$[annotation_name]_joint\"\n", + " annot=\"$[cwd:a]/$name/$name.$chrom.annot.gz\"\n", + " prefix=\"$[cwd:a]/$name/$name.$chrom\"\n", + " run_polyfun \"$annot\" \"$prefix\"\n", + " fi\n", + "\n", + "# ----------------------------------------------------------------------------\n", + "# Step D: write .l2.M and .l2.M_5_50 files for each emitted annotation directory.\n", + "# ----------------------------------------------------------------------------\n", + "R: expand = \"${ }\", stderr = f'{_output[2]}.stderr', stdout = f'{_output[2]}.stdout'\n", + " suppressPackageStartupMessages({ library(data.table); library(dplyr) })\n", + " use_print_snps <- ${str(use_print_snps).upper()}\n", + "\n", + " chrom <- \"${input_chroms[_index]}\"\n", + " # Look up .frq file under frqfile_dir, using plink_name + chrom (matches cell 25).\n", + " frq_file <- file.path(\"${frqfile_dir}\", paste0(\"${plink_name}\", chrom, \".frq\"))\n", + " has_frq <- file.exists(frq_file)\n", + " frq_dt <- if (has_frq) fread(frq_file)[, .(SNP, MAF)] else NULL\n", + "\n", + " write_M_files <- function(annot_path, ldscore_path, m_path) {\n", + " if (use_print_snps && file.exists(m_path) && file.exists(paste0(m_path, \"_5_50\"))) {\n", + " cat(\"M files already exist for\", m_path, \"\\n\"); return(invisible())\n", + " }\n", + " ldscore_dt <- if (endsWith(ldscore_path, \".parquet\")) {\n", + " suppressPackageStartupMessages(library(arrow)); arrow::read_parquet(ldscore_path)\n", + " } else fread(ldscore_path)\n", + " annot_dt <- fread(annot_path)\n", + " annot_filtered <- annot_dt[annot_dt$SNP %in% ldscore_dt$SNP, ]\n", + " merged <- if (has_frq) merge(annot_filtered, frq_dt, by = \"SNP\", all.x = TRUE) else annot_filtered\n", + " std_cols <- c(\"CHR\", \"SNP\", \"BP\", \"CM\", \"A1\", \"A2\", if (has_frq) \"MAF\")\n", + " annot_cols <- setdiff(names(merged), std_cols)\n", + " if (length(annot_cols) == 0L) { merged[, ANNOT := 1L]; annot_cols <- \"ANNOT\" }\n", + " M <- merged[, lapply(.SD, sum, na.rm = TRUE), .SDcols = annot_cols]\n", + " writeLines(paste(as.numeric(M), collapse = \" \"), m_path)\n", + " if (has_frq) {\n", + " common <- merged[!is.na(MAF) & MAF > 0.05, ]\n", + " M5 <- common[, lapply(.SD, sum, na.rm = TRUE), .SDcols = annot_cols]\n", + " writeLines(paste(as.numeric(M5), collapse = \" \"), paste0(m_path, \"_5_50\"))\n", + " }\n", + " }\n", + "\n", + " targets <- c()\n", + " if (${\"TRUE\" if emit_single else \"FALSE\"}) {\n", + " for (i in seq_len(${N_targets})) {\n", + " targets <- c(targets, paste0(\"${annotation_name}\", \"_single_\", i))\n", + " }\n", + " }\n", + " if (${\"TRUE\" if emit_joint else \"FALSE\"}) {\n", + " targets <- c(targets, paste0(\"${annotation_name}\", \"_joint\"))\n", + " }\n", + " for (name in targets) {\n", + " annot_path <- file.path(\"${cwd:a}\", name, paste0(name, \".\", chrom, \".annot.gz\"))\n", + " ldscore_path <- file.path(\"${cwd:a}\", name, paste0(name, \".\", chrom, \".${ldscore_ext}\"))\n", + " m_path <- file.path(\"${cwd:a}\", name, paste0(name, \".\", chrom, \".l2.M\"))\n", + " write_M_files(annot_path, ldscore_path, m_path)\n", + " }\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "Python 3 (ipykernel)" + }, + "source": [ + "## Calculate Functional Enrichment using Annotations" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[get_heritability]\n", + "# Per-trait sLDSC regression via polyfun. Fans out across target_anno_dirs:\n", + "# each (trait, target_dir) pair becomes one polyfun invocation. Outputs go to\n", + "# //.{results,log,part_delete}.\n", + "#\n", + "# `target_anno_dirs` is the list produced by [make_annotation_files_ldscore]:\n", + "# typically N _single_ directories plus optionally one _joint directory.\n", + "\n", + "#\n", + "# --- about the \".results\" Category column and the \"_0 / _1\" suffix ---\n", + "# Each (trait, target_dir) pair is ONE polyfun call; its `ldsc.py --ref-ld-chr`\n", + "# always gets exactly two LD-score sources, in this order:\n", + "# \"/.\" (index 0) , \"/\" (index 1)\n", + "# With --overlap-annot, every annotation column in the .results \"Category\" is\n", + "# named _:\n", + "# index 0 = the target file -> \"ANNOT_0\" (no-snplist; compute_ldscores.py keeps the annot col name)\n", + "# -> \"L2_0\" (snplist + single annot; ldsc.py hard-codes \"L2\", see below)\n", + "# -> \"ANNOT_1_0\",\"ANNOT_2_0\" (no-snplist joint dir, N>=2 annot cols)\n", + "# -> \"ANNOT_1L2_0\",\"ANNOT_2L2_0\" (snplist joint dir, N>=2 -> \"L2\")\n", + "# index 1 = the baseline file -> \"base_1\",\"Coding_UCSC_1\", ... (the 97 baseline annots)\n", + "# So in this pipeline the suffix is only ever 0 (target) or 1 (baseline); it would\n", + "# continue 0,1,2,... only if you handed `ldsc.py --ref-ld-chr` more than two sources.\n", + "# (Why ANNOT_0 vs L2_0: see the [make_annotation_files_ldscore] header — ldsc.py's\n", + "# \"n_annot == 1 -> column name 'L2'\" quirk vs compute_ldscores.py keeping the annot\n", + "# column name.) [postprocess] auto-detects the target Category; if you instead pass\n", + "# --target-categories, the names must match this column exactly.\n", + "#\n", + "parameter: target_anno_dirs = paths()\n", + "parameter: all_traits = []\n", + "\n", + "import os\n", + "\n", + "with open(all_traits_file, 'r') as f:\n", + " trait_paths = [os.path.join(sumstat_dir, line.strip()) for line in f if line.strip()]\n", + "\n", + "# Build (trait, target_dir) Cartesian product as parallel flat lists.\n", + "input_list = []\n", + "target_meta = []\n", + "for td in target_anno_dirs:\n", + " for t in trait_paths:\n", + " input_list.append(t)\n", + " target_meta.append(str(td))\n", + "\n", + "input: input_list, group_by = 1, group_with = \"target_meta\"\n", + "\n", + "output: f\"{cwd:a}/{os.path.basename(target_meta[_index])}/{os.path.basename(_input[0])}.log\", \\\n", + " f\"{cwd:a}/{os.path.basename(target_meta[_index])}/{os.path.basename(_input[0])}.results\"\n", + "\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'\n", + "\n", + "bash: expand = \"${ }\"\n", + " target_dir=\"${target_meta[_index]}\"\n", + " target_name=\"$(basename ${target_meta[_index]})\"\n", + " trait=\"$(basename ${_input[0]})\"\n", + " output_dir=\"${cwd:a}/$target_name\"\n", + " mkdir -p \"$output_dir\"\n", + "\n", + " # MAF cutoff handling. Only 0 (disabled) or 0.05 (sLDSC default) are supported;\n", + " # other values would require recomputing LD scores at that cutoff.\n", + " frq_file_check=\"${frqfile_dir}/${plink_name}22.frq\"\n", + " if [ \"${maf_cutoff}\" = \"0\" ] || [ \"${maf_cutoff}\" = \"0.0\" ]; then\n", + " echo \"maf_cutoff = 0: skipping MAF filter (--not-M-5-50)\"\n", + " frq_option=\"--not-M-5-50\"\n", + " elif [ \"${maf_cutoff}\" = \"0.05\" ]; then\n", + " if [ -f \"$frq_file_check\" ]; then\n", + " echo \"maf_cutoff = 0.05: using --frqfile-chr (MAF > 5%)\"\n", + " frq_option=\"--frqfile-chr ${frqfile_dir}/${plink_name}\"\n", + " else\n", + " echo \"ERROR: maf_cutoff=0.05 requires .frq files for the reference panel,\"\n", + " echo \" but none found at ${frqfile_dir}/${plink_name}*.frq.\"\n", + " echo \" Provide .frq files in frqfile_dir, or set maf_cutoff=0 (NOT recommended).\"\n", + " exit 1\n", + " fi\n", + " else\n", + " echo \"ERROR: maf_cutoff=${maf_cutoff} is not supported. Only 0 (no filter) or\"\n", + " echo \" 0.05 (sLDSC default) are accepted. Other values would require\"\n", + " echo \" recomputing LD scores at that cutoff.\"\n", + " exit 1\n", + " fi\n", + "\n", + " run_ldsc() {\n", + " local extra_args=\"$1\"\n", + " ${python_exec} ${polyfun_path}/ldsc.py \\\n", + " --h2 ${sumstat_dir}/$trait \\\n", + " --ref-ld-chr \"$target_dir/$target_name.\",\"${baseline_ld_dir}/${baseline_name}\" \\\n", + " --out \"$output_dir/$trait\" \\\n", + " --overlap-annot \\\n", + " --w-ld-chr ${weights_dir}/${weight_name} \\\n", + " $frq_option \\\n", + " --print-coefficients \\\n", + " --print-delete-vals \\\n", + " --n-blocks ${n_blocks} \\\n", + " $extra_args\n", + " }\n", + "\n", + " run_ldsc \"\"\n", + " log_file=\"$output_dir/$trait.log\"\n", + "\n", + " # FloatingPointError retry ladder (preserved from original): 30 -> 20 -> 10\n", + " for max in 30 20 10; do\n", + " if [ -f \"$log_file\" ] && grep -q \"FloatingPointError\\|invalid value encountered in sqrt\" \"$log_file\"; then\n", + " echo \"FloatingPointError detected, retrying with --chisq-max $max...\"\n", + " run_ldsc \"--chisq-max $max\"\n", + " else\n", + " break\n", + " fi\n", + " done\n", + "\n", + " if [ -f \"$log_file\" ] && grep -q \"FloatingPointError\\|invalid value encountered in sqrt\" \"$log_file\"; then\n", + " echo \"ERROR: FloatingPointError persists for trait $trait at target $target_name even with --chisq-max 10\"\n", + " echo \"This trait may have severe numerical instability issues in the summary statistics.\"\n", + " fi\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[munge_sumstats_polyfun]\n", + "parameter: sumstats = path\n", + "parameter: n = 0\n", + "parameter: min_info = 0.6\n", + "parameter: min_maf = 0.001\n", + "parameter: keep_hla = False\n", + "parameter: chi2_cut = 30\n", + "input: sumstats\n", + "output: f\"{_input:n}.munged.parquet\"\n", + "bash: expand=True, stderr=f'{_output:nn}.stderr', stdout=f'{_output:nn}.stdout'\n", + " {python_exec} {polyfun_path}/munge_polyfun_sumstats.py \\\n", + " --sumstats {_input} \\\n", + " --out {_output} \\\n", + " {'--n {}'.format(n) if n>0 else ''} \\\n", + " {'--min-info {}'.format(min_info)} \\\n", + " {'--min-maf {}'.format(min_maf)} \\\n", + " {'--chi2-cutoff {}'.format(chi2_cut)} \\\n", + " {'--keep-hla' if keep_hla else ''} \\\n", + " --remove-strand-ambig" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[postprocess]\n", + "# Post-processing of polyfun outputs via pecotmr::sldsc_postprocessing_pipeline.\n", + "# Reads .results / .log / .part_delete for all traits in `traits_file`, both\n", + "# single-target and (when present) joint-target runs, computes Gazal-style\n", + "# tau*, EnrichStat with back-solved jackknife SE, and runs the default\n", + "# DerSimonian-Laird random-effects meta across all supplied traits. Writes\n", + "# one RDS containing per-trait tables and three meta tables (tau*, E, EnrichStat).\n", + "\n", + "parameter: traits_file = path() # text file: one trait sumstats filename per line\n", + "parameter: heritability_cwd = path() # parent directory of [get_heritability] outputs (contains _single_/ subdirs and optionally _joint/)\n", + "parameter: target_categories = [] # target annotation names. Auto-detected from the joint-run results if empty.\n", + "parameter: target_categories_label = [] # optional display names, same order as target_categories;\n", + " # when given, every \"target\" column / tau*-block colname in\n", + " # the output RDS is renamed to these (params$target_categories\n", + " # holds the labels, params$target_categories_orig the originals).\n", + "parameter: target_anno_dir = path() # directory of target .annot.gz files used for sd_C and binary detection (typically the joint dir, since it carries all target columns)\n", + "\n", + "input: traits_file\n", + "output: f\"{cwd:a}/{annotation_name}.sldsc_postprocess.rds\"\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n", + "\n", + "R: expand = \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'\n", + " library(pecotmr)\n", + "\n", + " traits <- readLines(\"${traits_file}\")\n", + " target_cats <- c(${\",\".join('\"%s\"' % c for c in target_categories)})\n", + " target_lab <- c(${\",\".join('\"%s\"' % c for c in target_categories_label)})\n", + "\n", + " # Auto-detect single-target and joint-target output directories.\n", + " her_root <- \"${heritability_cwd}\"\n", + " all_subdirs <- list.dirs(her_root, recursive = FALSE)\n", + " single_pattern <- paste0(\"^\", \"${annotation_name}\", \"_single_([0-9]+)$\")\n", + " joint_name <- paste0(\"${annotation_name}\", \"_joint\")\n", + " single_dirs <- all_subdirs[grepl(single_pattern, basename(all_subdirs))]\n", + " single_indices <- as.integer(sub(single_pattern, \"\\\\1\", basename(single_dirs)))\n", + " single_dirs <- single_dirs[order(single_indices)]\n", + " joint_dir <- file.path(her_root, joint_name)\n", + " has_joint <- dir.exists(joint_dir)\n", + "\n", + " message(sprintf(\"Detected %d single-target dirs%s\",\n", + " length(single_dirs),\n", + " if (has_joint) \"; joint-target dir present\" else \"; no joint-target dir\"))\n", + "\n", + " # Build per-trait prefix maps. Each trait's polyfun output is at /\n", + " # (polyfun appends .results / .log / .part_delete).\n", + " trait_single_prefixes <- lapply(traits, function(t) file.path(single_dirs, t))\n", + " names(trait_single_prefixes) <- traits\n", + "\n", + " if (has_joint) {\n", + " trait_joint_prefix <- setNames(file.path(joint_dir, traits), traits)\n", + " } else {\n", + " trait_joint_prefix <- setNames(rep(NA_character_, length(traits)), traits)\n", + " }\n", + "\n", + " res <- sldsc_postprocessing_pipeline(\n", + " trait_single_prefixes = trait_single_prefixes,\n", + " trait_joint_prefix = trait_joint_prefix,\n", + " target_anno_dir = \"${target_anno_dir}\",\n", + " frqfile_dir = \"${frqfile_dir}\",\n", + " plink_name = \"${plink_name}\",\n", + " maf_cutoff = ${maf_cutoff},\n", + " target_categories = if (length(target_cats) > 0) target_cats else NULL,\n", + " target_labels = if (length(target_lab) > 0) target_lab else NULL\n", + " )\n", + "\n", + " saveRDS(res, \"${_output[0]}\")\n", + " message(\"S-LDSC post-processing complete; results written to ${_output[0]}\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[meta_subset]\n", + "# Optional: re-run random-effects meta on a user-defined subset of traits, using\n", + "# the cached per-trait standardized results from [postprocess]. No regression rerun.\n", + "\n", + "parameter: postprocess_rds = path() # output of [postprocess]\n", + "parameter: subset_traits_file = path() # text file: one trait id per line, subset of those passed to [postprocess]\n", + "parameter: subset_name = str # label used in the output filename\n", + "parameter: target_categories = [] # target annotation names to meta on; if empty, uses all from postprocess output\n", + "# If [postprocess] was run with --target-categories-label, the cached RDS already\n", + "# carries the display names (params$target_categories = the labels), so leave\n", + "# --target-categories empty here (or pass the labels, not the original ANNOT_* names).\n", + "\n", + "input: postprocess_rds, subset_traits_file\n", + "output: f\"{cwd:a}/{annotation_name}.{subset_name}.meta.rds\"\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n", + "\n", + "R: expand = \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'\n", + " library(pecotmr)\n", + "\n", + " res <- readRDS(\"${postprocess_rds}\")\n", + " subset_traits <- readLines(\"${subset_traits_file}\")\n", + " target_cats <- c(${\",\".join([f'\"{c}\"' for c in target_categories])})\n", + " if (length(target_cats) == 0) target_cats <- res$params$target_categories\n", + "\n", + " subset_per_trait <- res$per_trait[subset_traits]\n", + "\n", + " # Map wide names (tau_star_single/joint) to bare names meta_sldsc_random expects.\n", + " view_single <- pecotmr:::.sldsc_view_for_meta(subset_per_trait, \"single\")\n", + " view_joint <- pecotmr:::.sldsc_view_for_meta(subset_per_trait, \"joint\")\n", + "\n", + " out <- list(\n", + " tau_star_single = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_single, c, \"tau_star\")), target_cats),\n", + " tau_star_joint = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_joint, c, \"tau_star\")), target_cats),\n", + " enrichment = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_single, c, \"enrichment\")), target_cats),\n", + " enrichstat = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_single, c, \"enrichstat\")), target_cats)\n", + " )\n", + "\n", + " saveRDS(out, \"${_output[0]}\")\n", + " message(\"Subset meta complete; results written to ${_output[0]}\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "SoS", + "language": "sos", + "name": "sos" + }, + "language_info": { + "codemirror_mode": "sos", + "file_extension": ".sos", + "mimetype": "text/x-sos", + "name": "sos", + "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter", + "pygments_lexer": "sos" + }, + "sos": { + "kernels": [ + [ + "Bash", + "calysto_bash", + "Bash", + "#E6EEFF", + "shell" + ], + [ + "R", + "ir", + "R", + "#DCDCDA", + "r" + ], + [ + "SoS", + "sos", + "", + "", + "sos" + ] + ], + "panel": { + "displayed": true, + "height": 0 + }, + "version": "0.22.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From 304c2e7401e51469a7d8950066e0f630ca45d1c4 Mon Sep 17 00:00:00 2001 From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com> Date: Tue, 23 Jun 2026 12:11:19 -0400 Subject: [PATCH 5/6] Delete code/SoS/enrichment/sldsc_enrichment.ipynb --- code/SoS/enrichment/sldsc_enrichment.ipynb | 1491 -------------------- 1 file changed, 1491 deletions(-) delete mode 100644 code/SoS/enrichment/sldsc_enrichment.ipynb diff --git a/code/SoS/enrichment/sldsc_enrichment.ipynb b/code/SoS/enrichment/sldsc_enrichment.ipynb deleted file mode 100644 index 0569c353..00000000 --- a/code/SoS/enrichment/sldsc_enrichment.ipynb +++ /dev/null @@ -1,1491 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "# Stratified LD Score Regression (S-LDSC) Enrichment\n", - "\n", - "Minimal working-example driver for the S-LDSC functional-enrichment pipeline. The **Steps** section below gives one ready-to-run `sos run` command per workflow, using the toy inputs symlinked under `input/`.\n", - "\n", - "> **Environment note.** Steps 1–2 (`make_annotation_files_ldscore`, `get_heritability`) wrap the external **polyfun** toolkit (`compute_ldscores.py`, `ldsc.py`, `munge_polyfun_sumstats.py`) and require pre-computed reference-panel files (baseline-LD scores, LD weights, `.frq`, and PLINK `.bed/.bim/.fam`). polyfun is **not installed in this environment** and the reference panel is not shipped with the toy example, so those two steps cannot be executed here; their commands are provided for use on a system where polyfun and a matching panel are available. Steps 3–4 (`postprocess`, `meta_subset`) use `pecotmr::sldsc_postprocessing_pipeline` (available here) and read the `.results`/`.log` files produced by Step 2.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Description\n", - "This notebook implements the pipeline of [S-LDSC](https://www.nature.com/articles/ng.3404) for LD score and functional enrichment analysis.\n", - "\n", - "**Important: the S-LDSC implementation comes from the [polyfun](https://github.com/omerwe/polyfun/tree/master) package, not the original LDSC from `bulik/ldsc` GitHub repo.**" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "Markdown" - }, - "source": [ - "Uses GWAS summary statistics together with annotation and LD reference-panel data to compute per-SNP heritability enrichment for each annotation. It supports single-annotation (individual contribution) and joint multi-annotation (independent contribution) analysis.\n", - "\n", - "**Background.** LD Score Regression (Bulik-Sullivan et al. 2015) distinguishes confounding (e.g. population stratification) from true polygenic signal by regressing GWAS chi-square statistics on LD scores: SNPs tagging more variation (high LD score) show higher chi-square under true polygenicity, whereas confounding inflates statistics independently of LD. S-LDSC (Finucane et al. 2015) partitions heritability across overlapping annotation categories; standardized tau accounts for negative selection (Gazal et al. 2017). The model details and the tau*/EnrichStat definitions follow below.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Methods - Workflow Overview\n", - "\n", - "The pipeline runs in three stages: (1) annotation preparation and the S-LDSC regression (polyfun), (2) post-processing into standardized $\\tau^*$ and meta-analysis (the `pecotmr` package), and (3) optional re-meta on user-defined trait subsets. The concrete commands for stages 1-2 are in the **Steps** section below.\n", - "\n", - "**Stage 1 - polyfun.** Three SoS workflows wrap polyfun: `make_annotation_files_ldscore` converts target annotations into polyfun `.annot.gz` and runs `compute_ldscores.py` (toggles `compute_single` and `compute_joint`, both default `True`; the joint dir is only emitted when $N \\geq 2$); `munge_sumstats_polyfun` preprocesses each GWAS into LDSC format; `get_heritability` runs polyfun's `ldsc.py` once per `--target-anno-dir`, enforcing the MAF cutoff via `--frqfile-chr` (`maf_cutoff` accepts only `0` or `0.05`).\n", - "\n", - "**Stage 2 - pecotmr post-processing.** A single `pecotmr::sldsc_postprocessing_pipeline` call consumes all polyfun outputs: it extracts $\\tau$, $E$, $h^2_g$, EnrichStat p-value and per-block jackknife $\\tau$ values; computes $sd_C$ and $M_{\\mathrm{ref}}$ over the regression's MAF-cutoff SNP set; standardizes $\\tau \\to \\tau^*$ for single and joint modes; auto-detects binary vs continuous annotations; and runs a DerSimonian-Laird random-effects meta-analysis across traits, producing three meta tables ($\\tau^*$ cross-type comparable, $E$ within-binary, EnrichStat within-binary). Output is an R list with `per_trait` and `meta` entries.\n", - "\n", - "**Stage 3 - subset meta-analysis.** `pecotmr::meta_sldsc_random` re-runs the meta on a trait subset without re-running the regression (lightweight, interactive):\n", - "\n", - "```r\n", - "res <- readRDS(\"sldsc_results.rds\")\n", - "neuro <- c(\"AD_GWAX\", \"PD_meta\", \"ALS_meta\")\n", - "meta_neuro_taustar <- pecotmr::meta_sldsc_random(\n", - " res$per_trait[neuro], category = \"my_target_anno\", quantity = \"tau_star\"\n", - ")\n", - "```\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Theory\n", - "\n", - "The statistical model behind the pipeline is summarized below. Because the same framework underlies several of the workflow steps, the model, its stratified extension, and the tau-estimation / enrichment definitions are described together here rather than repeated per step." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### LDSC model\n", - "\n", - "Under a polygenic assumption, in which effect sizes for variants are drawn independently from distributions with variance proportional to $1/(p(1-p))$ where $p$ is the minor allele frequency (MAF), the expected $\\chi^2$ statistic of variant $j$ is:\n", - "\n", - "$$E[\\chi^2_j \\mid \\ell_j] \\;=\\; \\frac{N\\,h^2\\,\\ell_j}{M} \\;+\\; N a \\;+\\; 1 \\quad (1)$$\n", - "\n", - "where $N$ is the sample size; $M$ is the number of SNPs, so that $h^2/M$ is the average heritability per SNP; $a$ measures the contribution of confounding biases such as cryptic relatedness and population stratification; and $\\ell_j = \\sum_k r^2_{jk}$ is the LD Score of variant $j$, which measures the amount of genetic variation tagged by $j$. A full derivation is given in the Supplementary Note of Bulik-Sullivan et al. (2015); an alternative derivation appears in the Supplementary Note of Zhu and Stephens (2017) AoAS.\n", - "\n", - "Equation (1) shows that LD Score regression can compute SNP-based heritability for a phenotype from GWAS summary statistics alone, without requiring individual-level genotype data as REML and related methods do." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### Stratified LDSC\n", - "\n", - "Heritability is the proportion of phenotypic variation that is due to variation in genetic values, and it can also be partitioned over disjoint or overlapping categories of SNPs.\n", - "\n", - "Stratified LD Score Regression (S-LDSC) partitions heritability by leveraging both LD-score information and SNPs that have not reached genome-wide significance. S-LDSC exploits the fact that the $\\chi^2$ statistic for a given SNP reflects the cumulative effects of all SNPs tagged by it: in regions of high LD, the focal SNP captures the contribution of a group of nearby SNPs.\n", - "\n", - "S-LDSC declares an annotation enriched for heritability if SNPs with high LD to that annotation have higher $\\chi^2$ statistics than SNPs with low LD to it.\n", - "\n", - "Let $a_{jC}$ denote the value of annotation $C$ at SNP $j$:\n", - "\n", - "- **Binary annotation** (e.g. an indicator for \"in enhancer\", \"in exon\", \"in cell-type-specific peak\"): $a_{jC} \\in \\{0, 1\\}$.\n", - "- **Continuous annotation** (e.g. gene-specificity score, conservation score, continuous epigenomic signal): $a_{jC} \\in \\mathbb{R}$.\n", - "\n", - "Under a polygenic model the per-SNP heritability for SNP $j$ is\n", - "\n", - "$$\\mathrm{Var}(\\beta_j) \\;=\\; \\sum_C a_{jC}\\, \\tau_C$$\n", - "\n", - "and the expected $\\chi^2$ statistic of SNP $j$ is\n", - "\n", - "$$E[\\chi^2_j \\mid \\mathbf{a}_j] \\;=\\; N \\sum_C \\tau_C\\, \\ell(j, C) \\;+\\; N a \\;+\\; 1 \\quad (2)$$\n", - "\n", - "where $\\ell(j, C) = \\sum_k a_{kC}\\, r^2_{jk}$ is the partitioned LD score of SNP $j$ with respect to annotation $C$, and $a$ measures confounding bias. Equation (2) allows joint estimation of all $\\tau_C$ via a (computationally simple) multiple regression of $\\chi^2_j$ against $\\ell(j, C)$.\n", - "\n", - "Interpretation of $\\tau_C$:\n", - "- **Binary $C$**: $\\tau_C$ is the *additive increase in per-SNP heritability* for SNPs in category $C$, on top of the contributions from any other annotations they belong to.\n", - "- **Continuous $C$**: $\\tau_C$ is the *additive change in per-SNP heritability per unit increase* in the value of annotation $C$.\n", - "\n", - "For application to real data and comparisons to other methods, see the three papers cited at the top of this notebook." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### Tau Estimation and Enrichment Analysis\n", - "\n", - "Goal: quantify the contribution of functional annotations to trait heritability and assess statistical significance, accounting for LD structure and (for continuous annotations) annotation scale.\n", - "\n", - "The pipeline has two computational layers:\n", - "\n", - "- **Regression layer** — the S-LDSC regression itself, performed by the [polyfun](https://github.com/omerwe/polyfun) engine. We do not re-implement this.\n", - "- **Post-processing layer** — standardization, differential per-SNP heritability, binary/continuous detection, and random-effects meta-analysis across traits. Implemented in the [`pecotmr`](https://github.com/StatFunGen/pecotmr) R package (`R/sldsc_wrapper.R`).\n", - "\n", - "The notation below tags each modeling quantity as **(polyfun)** or **(pecotmr)**.\n", - "\n", - "#### Notation\n", - "\n", - "For each annotation $C$ we use:\n", - "\n", - "- $\\pi^{h^2}_C$ = proportion of trait heritability $h^2_g$ assigned to annotation $C$.\n", - "- $\\pi^{M}_C$ = proportion of (effective) SNPs in annotation $C$. For binary annotations this is $M_C / M_{\\mathrm{ref}}$; for continuous annotations it is the share of total annotation weight in $C$.\n", - "\n", - "#### Reference panel and MAF cutoff\n", - "\n", - "All LD-derived quantities — partitioned LD scores for the 97 baseline annotations and for our $K$ target annotations, the LD-score-regression weights, allele frequencies, and the SNP set — are computed against our own LD reference panel. We do not mix in pre-computed quantities from external panels (e.g. 1000G); $M_{\\mathrm{ref}}$ throughout this notebook denotes the number of common SNPs in our panel.\n", - "\n", - "By default we restrict to MAF $> 5\\%$ per the sLDSC recommendation: rare-variant LD is unstable and HapMap3-style regression weights are common-variant by construction. The cutoff is exposed as the SoS parameter `maf_cutoff` (default $0.05$); the regression, the standardized $sd_C$, and $M_{\\mathrm{ref}}$ are all evaluated on the same MAF $>$ cutoff SNP set. If allele-frequency files are not available the pipeline fails; the user must explicitly set `maf_cutoff = 0` to opt out (not recommended).\n", - "\n", - "#### Quantities from the regression layer (polyfun)\n", - "\n", - "Solving Equation (2) jointly across annotations, with 200-block genomic jackknife for inference, is performed by polyfun's `ldsc.py`. From each polyfun run we obtain, per annotation:\n", - "\n", - "- $\\tau_C$ and its standard error — **(polyfun)**.\n", - "- $\\pi^{h^2}_C$ and $\\pi^{M}_C$ — **(polyfun)**.\n", - "- $E_C = \\pi^{h^2}_C / \\pi^{M}_C$ and its standard error — **(polyfun)**.\n", - "- The p-value of the differential per-SNP heritability test (defined below) — **(polyfun)**, computed internally with the full coefficient covariance matrix.\n", - "\n", - "We also obtain, per run:\n", - "\n", - "- The total trait heritability $h^2_g$ — **(polyfun)**.\n", - "- The 200-block jackknife delete-values of $\\tau_C$ — **(polyfun)**.\n", - "\n", - "#### Quantities from the post-processing layer (pecotmr)\n", - "\n", - "From the polyfun outputs above plus our reference panel, the post-processing layer computes:\n", - "\n", - "- $sd_C$ — per-annotation standard deviation over MAF $>$ cutoff SNPs — **(pecotmr: `compute_sldsc_annot_sd`)**.\n", - "- $M_{\\mathrm{ref}}$ — reference SNP count at the MAF cutoff — **(pecotmr: `compute_sldsc_M_ref`)**.\n", - "- Whether each annotation is binary or continuous — **(pecotmr: `is_binary_sldsc_annot`)**.\n", - "- $\\tau^*_C$ point estimate and per-block $\\tau^*_C$ — **(pecotmr: `standardize_sldsc_trait`)**.\n", - "- EnrichStat point estimate and its standard error (formula below) — **(pecotmr: `standardize_sldsc_trait`)**.\n", - "- DerSimonian-Laird random-effects meta-analysis of $\\tau^*_C$, $E_C$, or EnrichStat across traits — **(pecotmr: `meta_sldsc_random`)**.\n", - "\n", - "The top-level entry point `pecotmr::sldsc_postprocessing_pipeline` orchestrates all of the above.\n", - "\n", - "#### Standardized tau ($\\tau^*$) — (pecotmr)\n", - "\n", - "$\\tau_C$ has units that depend on the scale of the annotation and on the total heritability of the trait, so raw $\\tau$ is not directly comparable across annotations or across traits. We compute the standardized version (Gazal et al. 2017)\n", - "\n", - "$$\\tau^*_C \\;=\\; \\tau_C \\cdot \\frac{sd_C \\cdot M_{\\mathrm{ref}}}{h^2_g}$$\n", - "\n", - "interpreted as the additive change in per-SNP heritability associated with a 1 standard deviation increase in annotation $C$, divided by the average per-SNP heritability across all SNPs. $\\tau^*_C$ is dimensionless and comparable across annotations and across traits. In a joint multi-annotation regression it is the *independent contribution* of annotation $C$ after controlling for overlapping effects of the others.\n", - "\n", - "Here $sd_C$ is the standard deviation of annotation $C$ across reference SNPs (MAF $>$ cutoff), $M_{\\mathrm{ref}}$ is the count of those SNPs, and $h^2_g$ is the trait heritability. Applying the same scaling to each of the 200 jackknife blocks yields per-block $\\tau^*_C$ values; their sample variance gives the jackknife standard error\n", - "$$SE^{\\text{jackknife}}(\\tau^*_C) \\;=\\; \\sqrt{\\,\\tfrac{(B-1)^2}{B}\\, \\mathrm{Var}_b(\\tau^*_{C,(b)})\\,}$$\n", - "with $B = 200$, used as the per-trait input to cross-trait meta-analysis.\n", - "\n", - "#### Differential per-SNP heritability (\"EnrichStat\") — (polyfun + pecotmr)\n", - "\n", - "To test whether the per-SNP heritability *inside* annotation $C$ differs from *outside* it (Finucane et al. 2015):\n", - "\n", - "$$\\text{EnrichStat}_C \\;=\\; \\frac{h^2_g}{M_{\\mathrm{ref}}} \\!\\left[\\, \\frac{\\pi^{h^2}_C}{\\pi^{M}_C} \\;-\\; \\frac{1 - \\pi^{h^2}_C}{1 - \\pi^{M}_C} \\,\\right]$$\n", - "\n", - "The point-estimate p-value of this test is computed by polyfun internally using the full coefficient covariance and reported as `Enrichment_p`. Its standard error is recovered from the reported p-value:\n", - "\n", - "$$|Z_C| \\;=\\; \\Phi^{-1}\\!\\left(1 - \\tfrac{p_C}{2}\\right), \\qquad SE_{\\text{EnrichStat}_C} \\;=\\; \\frac{|\\text{EnrichStat}_C|}{|Z_C|}.$$\n", - "\n", - "This per-trait point + SE is the input to cross-trait meta-analysis.\n", - "\n", - "#### Reporting: binary vs. continuous annotations — (pecotmr)\n", - "\n", - "The estimation machinery applies to both annotation types, but the *headline* quantity to report **within each type** differs.\n", - "\n", - "For a **binary annotation** (e.g. enhancer indicator, exon, in/out of a cell-type peak), $\\pi^{M}_C = M_C / M_{\\mathrm{ref}}$ has a direct interpretation and $E_C$ reads as \"the category explains $E_C$-fold more heritability than its share of SNPs.\" The within-type headline quantities are therefore $E_C$ and the EnrichStat p-value; $\\tau^*_C$ is reported alongside.\n", - "\n", - "For a **continuous annotation** (e.g. gene-specificity score, conservation score, continuous epigenomic signal), $E_C$ depends on the scale of the annotation: rescaling the annotation by a constant changes $E_C$ even though the underlying biology is unchanged. The within-type headline quantities are therefore $\\tau^*_C$ and its p-value; $E_C$ is reported alongside but should not be interpreted for continuous annotations.\n", - "\n", - "The pipeline determines whether an annotation is binary by inspecting whether its values lie in $\\{0, 1\\}$ and selects the appropriate within-type headline statistic automatically (pecotmr).\n", - "\n", - "> **From the official LDSC tutorial** ([Partitioned Heritability from Continuous Annotations](https://github.com/bulik/ldsc/wiki/Partitioned-Heritability-from-Continuous-Annotations)):\n", - ">\n", - "> *\"Enrichment is (Prop. heritability) / (Prop. SNPs). These outputs make sense only for binary annotations. Do not try to interpret them for continuous annotations. Using `--print-coefficients` outputs the regression coefficients and corresponding standard errors and Z score for each annotation. These coefficients measure the additional contribution of one annotation to the model and are interpretable for both binary and continuous annotations.\"*\n", - ">\n", - "> The pipeline always passes `--print-coefficients` to polyfun for this reason.\n", - "\n", - "#### Cross-type comparison: always use $\\tau^*_C$ — (pecotmr)\n", - "\n", - "For an apple-to-apple comparison **across binary and continuous annotations** — ranking annotations on a single axis, meta-analyzing a mixed set, or reporting a leaderboard that pools both types — use $\\tau^*_C$. The standardization in Gazal et al. (2017) was designed for exactly this purpose: $sd_C = \\sqrt{p(1-p)}$ for a binary annotation (where $p$ is the proportion in the category) and $sd_C = $ empirical standard deviation for a continuous annotation, so the resulting $\\tau^*_C$ is dimensionless and has the same interpretation in both cases — additive change in per-SNP heritability per 1 SD increase in the annotation, normalized by the average per-SNP heritability. $E_C$ does not have this property and must not be compared across types.\n", - "\n", - "The pipeline emits both $E_C$ and $\\tau^*_C$ for every annotation, with the binary/continuous flag, so callers can pick the right column for the comparison they are making.\n", - "\n", - "#### Joint analysis — (polyfun runs the regression; pecotmr standardizes both modes)\n", - "\n", - "For **joint analysis** (multiple annotations fit together), both $\\tau$ and $E$ are conditional on the other annotations in the model. We report joint $\\tau^*_C$ as the independent contribution of annotation $C$ after controlling for the others. The annotation-prep step exposes two independent toggles, `compute_single` and `compute_joint` (both default `True`), so the user can produce the $N$ single-target outputs, the joint output, or both in one invocation. With both defaults the post-processing layer reads all $N+1$ regression outputs per trait and presents single + joint side-by-side. When the joint subset is decided after looking at single-target results (exploratory $\\rightarrow$ conditional workflow), the user runs the annotation-prep step a second time with `compute_single=False` on the curated subset.\n", - "\n", - "### Meta-Analysis across Traits (Random Effects) — (pecotmr)\n", - "\n", - "DerSimonian-Laird random-effects meta-analysis of per-annotation estimates across traits, implemented in `pecotmr::meta_sldsc_random` (which delegates the numerics to `rmeta::meta.summaries(..., method = \"random\")`):\n", - "\n", - "$$\\hat\\theta_{\\mathrm{meta}} \\;=\\; \\frac{\\sum_i w_i\\, \\hat\\theta_i}{\\sum_i w_i}, \\qquad SE_{\\mathrm{meta}} \\;=\\; \\sqrt{\\frac{1}{\\sum_i w_i}}, \\qquad w_i \\;=\\; \\frac{1}{SE_i^2 + \\hat\\sigma^2}$$\n", - "\n", - "where $\\hat\\theta_i$ is the per-trait estimate and $SE_i$ its standard error:\n", - "\n", - "- **For $\\tau^*_C$ meta**: $SE_i$ is the jackknife SE from the per-block $\\tau^*_C$ values.\n", - "- **For $E_C$ meta**: $SE_i$ is the polyfun-reported `Enrichment_std_error`.\n", - "- **For EnrichStat meta**: $SE_i$ is the back-solved SE from polyfun's `Enrichment_p`.\n", - "\n", - "For binary-annotation enrichment reporting we use a two-channel meta: the **effect size and SE** come from the meta on $E_C$ (interpretable on the original enrichment-fold scale), while the **p-value** comes from the meta on EnrichStat (the appropriate hypothesis test). The pipeline produces a default meta over all supplied traits; users can re-run meta on any subset of traits without re-running the regression layer.\n", - "\n", - "$$Z_{\\mathrm{meta}} \\;=\\; \\frac{\\hat\\theta_{\\mathrm{meta}}}{SE_{\\mathrm{meta}}}, \\qquad p \\;=\\; 2\\,\\Phi(-|Z_{\\mathrm{meta}}|)$$" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Minimal Working Example (MWE)\n", - "\n", - "The steps below run the four pipeline workflows end to end on the example data. Each step lists what it does, then the `sos run` command to execute it.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Step 1. `make_annotation_files_ldscore`\n", - "\n", - "*Annotation preparation and S-LDSC regression (polyfun).* This step accepts a single annotation file for a single-tau analysis (one annotation as input) or several annotation files for a joint-tau analysis (multiple annotations as input)." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "#### **Inputs**\n", - "\n", - "##### 1. Target Annotation File\n", - "\n", - "- **Purpose**: Specifies the user-provided (\"target\") genome annotation files. The pipeline supports both binary and continuous annotations; the type is auto-detected per annotation column.\n", - "- **Formats**:\n", - " - Text file (`.txt`) listing per-chromosome paths to annotation files. Annotation files can be `.rds`/`.tsv`/`.txt`.\n", - " - Alternatively, files for specific chromosomes can be provided directly.\n", - " - **Multiple target annotations** are supported in one input file (one column per annotation, prefixed `path`, `path1`, `path2`, ...). Single-target and joint-target analyses are produced automatically in one pipeline pass.\n", - " - **Format** (the score column is optional; if absent, score is set to 1):\n", - " - `is_range = False`:\n", - " ```\n", - " chr pos score\n", - " 1 10001 1\n", - " 1 10002 1\n", - " ```\n", - " - `is_range = True`:\n", - " ```\n", - " chr start end score\n", - " 1 10001 20001 1\n", - " 1 30001 40001 1\n", - " ```\n", - "\n", - "##### 2. Reference Annotation File (baseline-LD)\n", - "\n", - "- **Purpose**: Provides the baseline annotations (typically the 97-annotation baseline-LD model from Gazal et al. 2017) in `.annot.gz` format for each chromosome. The baseline conditions every regression.\n", - "- **Formats**:\n", - " - Text file listing baseline annotation files for all chromosomes.\n", - " - Alternatively, files for specific chromosomes can be provided directly.\n", - "\n", - "##### 3. Genome Reference File\n", - "\n", - "- **Purpose**: PLINK-format `.bed/.bim/.fam` files for our LD reference panel, per chromosome. This is the panel against which all LD-derived quantities (target LD scores, baseline LD scores, regression weights, allele frequencies) must be computed. **Do not mix files derived from different panels** (e.g. 1000G vs ADSP).\n", - "- **Formats**:\n", - " - Text file listing per-chromosome reference files, or files for specific chromosomes.\n", - "\n", - "##### 4. SNP List\n", - "\n", - "- **Purpose**: Specifies the SNPs to include in LDSC analysis (typically a HapMap3-style list).\n", - "- **Format**: A list of `rsid`s, one per line.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": { - "kernel": "Bash" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "/restricted/projectnb/xqtl/jaempawi/xqtl-protocol\n" - ] - } - ], - "source": [ - "pwd" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": { - "kernel": "Bash" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n", - " import pkg_resources\n", - "INFO: Running \u001b[32mmake_annotation_files_ldscore\u001b[0m: \n", - "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=1) is \u001b[32mcompleted\u001b[0m.\n", - "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=3) is \u001b[32mcompleted\u001b[0m.\n", - "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=2) is \u001b[32mcompleted\u001b[0m.\n", - "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=0) is \u001b[32mcompleted\u001b[0m.\n", - "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=5) is \u001b[32mcompleted\u001b[0m.\n", - "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=6) is \u001b[32mcompleted\u001b[0m.\n", - "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=4) is \u001b[32mcompleted\u001b[0m.\n", - "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=7) is \u001b[32mcompleted\u001b[0m.\n", - "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=9) is \u001b[32mcompleted\u001b[0m.\n", - "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=10) is \u001b[32mcompleted\u001b[0m.\n", - "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=8) is \u001b[32mcompleted\u001b[0m.\n", - "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=11) is \u001b[32mcompleted\u001b[0m.\n", - "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=14) is \u001b[32mcompleted\u001b[0m.\n", - "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=13) is \u001b[32mcompleted\u001b[0m.\n", - "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=12) is \u001b[32mcompleted\u001b[0m.\n", - "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=15) is \u001b[32mcompleted\u001b[0m.\n", - "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=18) is \u001b[32mcompleted\u001b[0m.\n", - "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=16) is \u001b[32mcompleted\u001b[0m.\n", - "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=17) is \u001b[32mcompleted\u001b[0m.\n", - "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=19) is \u001b[32mcompleted\u001b[0m.\n", - "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=21) is \u001b[32mcompleted\u001b[0m.\n", - "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=20) is \u001b[32mcompleted\u001b[0m.\n", - "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m output: \u001b[32m/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_ldscore/protocol_example_single_1/protocol_example_single_1.1.annot.gz /restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_ldscore/protocol_example_single_1/protocol_example_single_1.1.l2.ldscore.parquet... (66 items in 22 groups)\u001b[0m\n", - "INFO: Workflow make_annotation_files_ldscore (ID=weae0ca3fdf468fd8) is executed successfully with 1 completed step and 22 completed substeps.\n" - ] - } - ], - "source": [ - "sos run pipeline/sldsc_enrichment.ipynb make_annotation_files_ldscore \\\n", - " --annotation_file input/enrichment/sldsc/colocboost_test_annotation_path.txt \\\n", - " --reference_anno_file input/enrichment/sldsc/reference_annotation0.txt \\\n", - " --genome_ref_file input/enrichment/sldsc/genome_reference_bfile.txt \\\n", - " --annotation_name protocol_example \\\n", - " --plink_name reference. --baseline_name annotations. --weight_name weights. \\\n", - " --python_exec python \\\n", - " --polyfun_path polyfun \\\n", - " --cwd output/sldsc_ldscore -j 4\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### Munge summary statistics (preprocessing, run before Step 2)\n", - "\n", - "Before estimating heritability, each raw GWAS summary-statistics file must be converted into the LDSC-compatible format consumed by `get_heritability`. Run `munge_sumstats_polyfun` once per trait; the munged files are then collected in the directory passed to `get_heritability` via `--sumstat_dir`.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "# sos run pipeline/sldsc_enrichment.ipynb munge_sumstats_polyfun \\\n", - "# --sumstats data/polyfun_new/example_data/trait_raw_sumstats.tsv \\\n", - "# --n 0 \\\n", - "# --min-info 0.6 \\\n", - "# --min-maf 0.001 \\\n", - "# --chi2-cutoff 30 \\\n", - "# --polyfun_path data/github/polyfun \\\n", - "# --cwd data/polyfun_new/example_data" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Step 2. `get_heritability`\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "**Inputs**\n", - "\n", - "##### 1. Allele Frequency Files (`.frq`, our panel)\n", - "\n", - "- **Purpose**: PLINK `.frq` files for the reference panel, used to enforce the MAF cutoff. **Required** when `maf_cutoff > 0` (default `0.05`); the pipeline fails if missing unless `maf_cutoff = 0` is explicitly set.\n", - "\n", - "##### 2. GWAS Summary Statistics\n", - "\n", - "- **Purpose**: One munged sumstats file per trait, listed in a text file (`all_traits_file`). The pipeline runs the regression once per trait per single/joint mode.\n", - "- **Format**:\n", - " ```\n", - " CAD_META.filtered.sumstats.gz\n", - " UKB.Lym.BOLT.sumstats.gz\n", - " ```\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": { - "kernel": "Bash" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n", - " import pkg_resources\n", - "INFO: Running \u001b[32mget_heritability\u001b[0m: \n", - "maf_cutoff = 0: skipping MAF filter (--not-M-5-50)\n", - "maf_cutoff = 0: skipping MAF filter (--not-M-5-50)\n", - "maf_cutoff = 0: skipping MAF filter (--not-M-5-50)\n", - "python: can't open file '/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/../polyfun/ldsc.py': [Errno 2] No such file or directory\n", - "python: can't open file '/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/../polyfun/ldsc.py': [Errno 2] No such file or directory\n", - "python: can't open file '/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/../polyfun/ldsc.py': [Errno 2] No such file or directory\n", - "INFO: \u001b[32mget_heritability\u001b[0m (index=1) is \u001b[32mcompleted\u001b[0m.\n", - "INFO: \u001b[32mget_heritability\u001b[0m (index=0) is \u001b[32mcompleted\u001b[0m.\n", - "INFO: \u001b[32mget_heritability\u001b[0m (index=2) is \u001b[32mcompleted\u001b[0m.\n", - "INFO: \u001b[32mget_heritability\u001b[0m output: \u001b[32m/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_heritability/protocol_example_single_1/sumstats.parquet.log /restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_heritability/protocol_example_single_1/sumstats.parquet.results... (6 items in 3 groups)\u001b[0m\n", - "INFO: Workflow get_heritability (ID=wa79eac1662f5dd2d) is executed successfully with 1 completed step and 3 completed substeps.\n" - ] - } - ], - "source": [ - "sos run pipeline/sldsc_enrichment.ipynb get_heritability \\\n", - " --target_anno_dirs output/sldsc_ldscore/protocol_example_single_1 \\\n", - " --all_traits_file input/enrichment/sldsc/sumstats_test_all.txt \\\n", - " --sumstat_dir input/enrichment/sldsc \\\n", - " --baseline_ld_dir input/enrichment/sldsc \\\n", - " --weights_dir input/enrichment/sldsc \\\n", - " --plink_name reference. --baseline_name annotations. --weight_name weights. \\\n", - " --annotation_name protocol_example --python_exec python \\\n", - " --polyfun_path ../polyfun \\\n", - " --maf_cutoff 0 --cwd output/sldsc_heritability -j 4\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Step 3. `Post-processing (pecotmr) and meta-analysis`\n", - "\n", - "*Post-Processing (`pecotmr::sldsc_postprocessing_pipeline`)*\n", - "\n", - "A single R function call consumes all polyfun outputs for the run and produces the final tables:\n", - "\n", - "- Reads each polyfun output and extracts $\\tau$, $E$, $h^2_g$, EnrichStat p-value, and per-block jackknife $\\tau$ values.\n", - "- Computes annotation $sd_C$ and $M_{\\mathrm{ref}}$ over the same MAF $>$ cutoff SNP set as the regression.\n", - "- Standardizes $\\tau \\to \\tau^*$ for both single-tau and joint-tau modes, including the per-block versions for jackknife SE.\n", - "- Auto-detects whether each annotation is binary or continuous and tags every output row accordingly.\n", - "- Reports the number and names of baseline annotations encountered (via `message()`) for transparency.\n", - "- Runs the default DerSimonian-Laird random-effects meta-analysis across all supplied traits, producing three meta tables: $\\tau^*$ (cross-type comparable), $E$ (within-binary), and EnrichStat (within-type).\n", - "\n", - "Outputs are returned as an R list with two top-level entries: `per_trait` (one tidy data frame per trait, single + joint estimates side-by-side per target) and `meta` (three tables, one per quantity, with rows = target annotations and columns = single/joint mean/SE/p plus an `is_binary` flag).\n", - "\n", - "The `[postprocess]` step reads all polyfun outputs under `heritability_cwd`\n", - "(which contains the $N$ single-target subdirectories and optionally the\n", - "joint subdirectory) and calls `pecotmr::sldsc_postprocessing_pipeline()`\n", - "to produce per-trait standardized tables and the default random-effects\n", - "meta across all traits.\n", - "\n", - "Use `--target-categories-label` (same order as `--target-categories`) to give the target annotations friendly names in the output — e.g. `--target-categories ANNOT_1_0 ANNOT_2_0 --target-categories-label quantile_eQTL eQTL` makes the `target` column read `quantile_eQTL` / `eQTL` instead of `ANNOT_1_0` / `ANNOT_2_0` (the original names are kept in `params$target_categories_orig`). Omit it to keep the polyfun `.results` names.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": { - "kernel": "Bash" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n", - " import pkg_resources\n", - "INFO: Running \u001b[32mpostprocess\u001b[0m: \n", - "INFO: \u001b[32mpostprocess\u001b[0m is \u001b[32mcompleted\u001b[0m.\n", - "INFO: \u001b[32mpostprocess\u001b[0m output: \u001b[32m/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_postprocess/protocol_example.sldsc_postprocess.rds\u001b[0m\n", - "INFO: Workflow postprocess (ID=wb64dc2b84958960c) is executed successfully with 1 completed step.\n" - ] - } - ], - "source": [ - "sos run pipeline/sldsc_enrichment.ipynb postprocess \\\n", - " --traits_file input/enrichment/sldsc/sumstats_test_all.txt \\\n", - " --heritability_cwd output/sldsc_heritability \\\n", - " --target_categories ANNOT_0 --target_categories_label protocol_example_annotation \\\n", - " --target_anno_dir output/sldsc_ldscore/protocol_example_single_1 \\\n", - " --annotation_name protocol_example --python_exec python \\\n", - " --polyfun_path ../polyfun \\\n", - " --maf_cutoff 0 --cwd output/sldsc_postprocess -j 4\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Step 4. `Subset Meta-Analysis (`pecotmr::meta_sldsc_random`)` (optional)\n", - "\n", - "The default meta in Step 2 pools all traits the user supplied. To re-run the meta on a subset (e.g., neurodegenerative traits only, or autoimmune traits only) without re-running the regression layer:\n", - "\n", - "\n", - "```r\n", - "res <- readRDS(\"sldsc_results.rds\")\n", - "neuro <- c(\"AD_GWAX\", \"PD_meta\", \"ALS_meta\")\n", - "meta_neuro_taustar <- pecotmr::meta_sldsc_random(\n", - " res$per_trait[neuro], category = \"my_target_anno\", quantity = \"tau_star\"\n", - ")\n", - "```\n", - "\n", - "This step is light-weight and can be run interactively.\n", - "\n", - "\n", - "The default meta in step 3 pools all traits supplied to `[postprocess]`. Use `[meta_subset]` to re-run the meta on a user-defined trait subset (e.g., neurodegenerative traits only, autoimmune traits only) without re-running the regression or the per-trait standardization. The subset operates on the cached `.sldsc_postprocess.rds` output; it is light-weight and can be run interactively or in batch.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": { - "kernel": "Bash" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n", - " import pkg_resources\n", - "INFO: Running \u001b[32mmeta_subset\u001b[0m: \n", - "INFO: \u001b[32mmeta_subset\u001b[0m is \u001b[32mcompleted\u001b[0m.\n", - "INFO: \u001b[32mmeta_subset\u001b[0m output: \u001b[32m/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_postprocess/protocol_example.category1.meta.rds\u001b[0m\n", - "INFO: Workflow meta_subset (ID=w09a2a0530119f1d2) is executed successfully with 1 completed step.\n" - ] - } - ], - "source": [ - "sos run pipeline/sldsc_enrichment.ipynb meta_subset \\\n", - " --postprocess_rds output/sldsc_postprocess/protocol_example.sldsc_postprocess.rds \\\n", - " --subset_traits_file input/enrichment/sldsc/sumstats_test_category1.txt \\\n", - " --subset_name category1 --target_categories ANNOT_0 \\\n", - " --annotation_name protocol_example --python_exec python \\\n", - " --polyfun_path ../polyfun \\\n", - " --maf_cutoff 0 --cwd output/sldsc_postprocess -j 4\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Output\n", - "\n", - "### Output summary\n", - "\n", - "| Stage | Cached on disk | Recomputable from | Purpose |\n", - "|---|---|---|---|\n", - "| Target LD scores | per-annotation, once | annotation + reference panel | input to every regression |\n", - "| polyfun `.results` per (trait, mode) | yes | regression run | $\\tau$, $E$, EnrichStat |\n", - "| Per-trait standardized table | yes (RDS) | polyfun outputs + $sd_C$ + $M_{\\mathrm{ref}}$ | reporting + meta |\n", - "| Default meta tables | yes (RDS) | per-trait standardized | headline figures |\n", - "| Subset meta | re-run on demand | per-trait standardized | custom analyses |\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### Per-stage outputs\n", - "\n", - "Each workflow writes into its `--cwd`:\n", - "\n", - "- **make_annotation_files_ldscore** — polyfun `.annot.gz` files plus per-annotation LD-score directories (`.l2.ldscore.{gz,parquet}`, `.l2.M`, `.l2.M_5_50`). One single-target directory per annotation, plus (when more than one annotation) a joint directory.\n", - "- **get_heritability** — per trait and per target directory, the S-LDSC regression outputs `.{results,log,part_delete}`. The `.results` `Category` column carries the annotation name with a `_` suffix.\n", - "- **postprocess** — a single `.sldsc_postprocess.rds` containing per-trait tables (Gazal-style tau*, EnrichStat with back-solved jackknife SE) and three DerSimonian–Laird random-effects meta tables (tau*, E, EnrichStat).\n", - "- **meta_subset** — a re-meta of the cached `.sldsc_postprocess.rds` over a user-defined trait subset (lightweight; no regression re-run).\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Anticipated Results\n", - "\n", - "Produces per-annotation enrichment statistics (tau, enrichment, p-value) from stratified LD score regression." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Command interface\n", - "\n", - "List all workflows and their options:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "Bash" - }, - "outputs": [], - "source": [ - "sos run pipeline/sldsc_enrichment.ipynb -h" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Workflow implementation\n", - "\n", - "The cells below are the pipeline definition (preserved from the original notebook): the `[global]` parameter block and the workflow step bodies.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[global]\n", - "# Path to the work directory of the analysis.\n", - "parameter: cwd = path('output')\n", - "# Prefix for the analysis output\n", - "parameter: annotation_name = str\n", - "parameter: python_exec = \"python\" # e.g. \"/home/you/.conda/envs/polyfun/bin/python\"\n", - "parameter: polyfun_path = path # e.g. \"/home/you/tools/polyfun\"\n", - "\n", - "# MAF cutoff for sLDSC. Default 0.05 per sLDSC recommendation (rare-variant LD is unstable\n", - "# and HapMap3-style regression weights are common-variant by construction).\n", - "# Set to 0 to opt out of MAF filtering (NOT recommended; only use if you understand the implications).\n", - "# Other values would require recomputing LD scores at that cutoff.\n", - "parameter: maf_cutoff = 0.05\n", - "\n", - "# for make_annotation_files_ldscore workflow:\n", - "parameter: annotation_file = path()\n", - "parameter: reference_anno_file = path()\n", - "parameter: genome_ref_file = path() # with .bed\n", - "parameter: chromosome = []\n", - "parameter: snp_list = path()\n", - "parameter: ld_wind_kb = 0 # use kb if the value is provided\n", - "parameter: ld_wind_cm = 1.0 # default using ld_wind_cm\n", - "\n", - "# for get_heritability workflow.\n", - "# Note: all LD-derived inputs (baseline LD scores, target LD scores, regression weights,\n", - "# allele frequencies) must be computed against the same reference panel as `genome_ref_file`.\n", - "# Do not mix files derived from different reference panels (e.g., 1000G vs ADSP).\n", - "parameter: all_traits_file = path() # txt file, each row contains all GWAS summary statistics name: e.g. CAD_META.filtered.sumstats.gz\n", - "parameter: sumstat_dir = path() # Directory containing GWAS summary statistics\n", - "parameter: target_anno_dir = path() # Directory containing target annotation files: output of ldscore\n", - "parameter: baseline_ld_dir = path() # Directory containing baseline LD score files (computed against our panel)\n", - "parameter: frqfile_dir = path() # Directory containing allele frequency files (.frq, our panel)\n", - "parameter: plink_name = \"ADSP_chr\"\n", - "parameter: weights_dir = path() # Directory containing LD weights (computed against our panel)\n", - "parameter: baseline_name = \"baseline_chr\" # Prefix of baseline annotation files\n", - "parameter: weight_name = \"weights_chr\" # Prefix of LD weights files\n", - "parameter: n_blocks = 200\n", - "\n", - "# Number of threads\n", - "parameter: numThreads = 16\n", - "# For cluster jobs, number commands to run per job\n", - "parameter: job_size = 1\n", - "parameter: walltime = '12h'\n", - "parameter: mem = '16G'" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "Python 3 (ipykernel)" - }, - "source": [ - "## Make Annotation File" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[make_annotation_files_ldscore]\n", - "# Annotation preparation. Takes one annotation_file with N target annotations\n", - "# and produces, in one invocation, any combination of:\n", - "# - N single-target LD-score directories (when compute_single = TRUE, default)\n", - "# - 1 joint LD-score directory containing all N (when compute_joint = TRUE\n", - "# and N >= 2, default)\n", - "#\n", - "# Outputs per chromosome :\n", - "# /_single_/_single_..annot.gz (i in 1..N, when compute_single)\n", - "# /_single_/_single_..l2.ldscore.{parquet|gz}\n", - "# /_single_/_single_..l2.M\n", - "# /_single_/_single_..l2.M_5_50 (when .frq present)\n", - "#\n", - "# /_joint/_joint..{...} (when compute_joint and N>=2)\n", - "#\n", - "# Workflows:\n", - "# - Workflow A (\"all at once\"): compute_single=TRUE, compute_joint=TRUE (defaults).\n", - "# Produces both, fits the case where you have already chosen the joint set.\n", - "# - Workflow B (\"exploratory then conditional\"):\n", - "# Step 1: compute_single=TRUE, compute_joint=FALSE.\n", - "# Run on N candidate annotations -> N single-target dirs.\n", - "# Inspect single-target results, identify K significant ones.\n", - "# Step 2: compute_single=FALSE, compute_joint=TRUE.\n", - "# Run on a NEW annotation_file with the K selected annotations\n", - "# -> 1 joint dir with the conditional model.\n", - "\n", - "#\n", - "# --- snplist (--snp_list) vs no-snplist: which polyfun script, output format,\n", - "# column name, and the CM requirement ---\n", - "# --snp_list given -> ldsc.py --l2 --print-snps -> output .l2.ldscore.gz\n", - "# --snp_list absent -> compute_ldscores.py -> output .l2.ldscore.parquet\n", - "#\n", - "# LD-score column name (this is what becomes the .results \"Category\" in\n", - "# [get_heritability], with a \"_\" suffix appended there):\n", - "# * compute_ldscores.py ALWAYS keeps the annot column name(s):\n", - "# single annot column \"ANNOT\" -> ldscore column \"ANNOT\"\n", - "# joint annot columns \"ANNOT_1\",\"ANNOT_2\",... -> \"ANNOT_1\",\"ANNOT_2\",...\n", - "# * ldsc.py --l2 has a quirk: with EXACTLY ONE annotation (n_annot == 1) it\n", - "# HARD-CODES the ldscore column name to \"L2\" and DROPS the annot's original\n", - "# column name. With >=2 annotations it uses \"L2\"\n", - "# (\"ANNOT_1L2\",\"ANNOT_2L2\",...).\n", - "# => a single-target snplist run reports \"L2_0\" in .results, while a\n", - "# single-target no-snplist run reports \"ANNOT_0\". [postprocess] auto-\n", - "# detects either; only matters if you pass --target-categories explicitly.\n", - "#\n", - "# CM column requirement for snplist: ldsc.py --l2 --print-snps requires the\n", - "# target annot to (a) carry a \"CM\" (centimorgan) column and (b) line up with\n", - "# the plink .bim (same SNP set, same row order). This step handles both\n", - "# internally (normalize_for_ldsc: takes CM from the .bim 4th column, re-expands\n", - "# the annot onto the .bim rows, filling 0). Therefore the plink .bim files MUST\n", - "# carry genetic-map (cM) positions when using --ld-wind-cm (the default);\n", - "# if your .bim has 0 in the cM column, switch to --ld-wind-kb instead.\n", - "#\n", - "parameter: compute_single = True\n", - "parameter: compute_joint = True\n", - "parameter: score_column = 3\n", - "parameter: is_range = False\n", - "\n", - "import pandas as pd\n", - "import os\n", - "\n", - "if not (compute_single or compute_joint):\n", - " raise ValueError(\"[make_annotation_files_ldscore] at least one of compute_single or compute_joint must be TRUE\")\n", - "\n", - "def adapt_file_path(file_path, reference_file):\n", - " reference_path = os.path.dirname(reference_file)\n", - " if os.path.isfile(file_path):\n", - " return file_path\n", - " file_name = os.path.basename(file_path)\n", - " if os.path.isfile(file_name):\n", - " return file_name\n", - " file_in_ref_dir = os.path.join(reference_path, file_name)\n", - " if os.path.isfile(file_in_ref_dir):\n", - " return file_in_ref_dir\n", - " file_prefixed = os.path.join(reference_path, file_path)\n", - " if os.path.isfile(file_prefixed):\n", - " return file_prefixed\n", - " raise FileNotFoundError(f\"No valid path found for file: {file_path}\")\n", - "\n", - "\n", - "# ---- Parse inputs and determine N ----\n", - "if (str(annotation_file).endswith(('rds', 'tsv', 'txt', 'tsv.gz', 'txt.gz')) and\n", - " str(reference_anno_file).endswith('annot.gz')):\n", - " # Case 1: direct file paths (single-chromosome run). Multiple target files separated by ','.\n", - " target_files_direct = str(annotation_file).split(',')\n", - " N_targets = len(target_files_direct)\n", - " target_names = [f\"target_{i+1}\" for i in range(N_targets)]\n", - " input_files = [[*target_files_direct, str(reference_anno_file), str(genome_ref_file)]]\n", - " if len(chromosome) > 0:\n", - " input_chroms = [int(x) for x in chromosome]\n", - " else:\n", - " input_chroms = [0]\n", - "else:\n", - " # Case 2: txt list with #id and one or more 'path' columns\n", - " target_files_df = pd.read_csv(annotation_file, sep=\"\\t\")\n", - " reference_files = pd.read_csv(reference_anno_file, sep=\"\\t\")\n", - " genome_ref_files = pd.read_csv(genome_ref_file, sep=\"\\t\")\n", - "\n", - " target_files_df[\"#id\"] = [x.replace(\"chr\", \"\") for x in target_files_df[\"#id\"].astype(str)]\n", - " reference_files[\"#id\"] = [x.replace(\"chr\", \"\") for x in reference_files[\"#id\"].astype(str)]\n", - " genome_ref_files[\"#id\"] = [x.replace(\"chr\", \"\") for x in genome_ref_files[\"#id\"].astype(str)]\n", - "\n", - " path_columns = [c for c in target_files_df.columns if c.startswith('path')]\n", - " N_targets = len(path_columns)\n", - " target_names = path_columns[:] # 'path', 'path1', 'path2', ...\n", - "\n", - " for col in path_columns:\n", - " target_files_df[col] = target_files_df[col].apply(lambda x: adapt_file_path(x, str(annotation_file)))\n", - " reference_files[\"path\"] = reference_files[\"path\"].apply(lambda x: adapt_file_path(x, str(reference_anno_file)))\n", - " genome_ref_files[\"path\"] = genome_ref_files[\"path\"].apply(lambda x: adapt_file_path(x, str(genome_ref_file)))\n", - "\n", - " merged = target_files_df.merge(reference_files, on=\"#id\").merge(genome_ref_files, on=\"#id\")\n", - " if len(chromosome) > 0:\n", - " merged = merged[merged[\"#id\"].isin([str(c) for c in chromosome])]\n", - "\n", - " rows = merged.values.tolist()\n", - " input_chroms = [r[0] for r in rows]\n", - " input_files = [[*r[1:N_targets+1], r[-2], r[-1]] for r in rows]\n", - "\n", - "# ---- Determine output format ----\n", - "use_print_snps = snp_list.is_file()\n", - "ldscore_ext = \"l2.ldscore.gz\" if use_print_snps else \"l2.ldscore.parquet\"\n", - "\n", - "if ld_wind_kb > 0:\n", - " use_kb_window = True\n", - " ld_window_param = ld_wind_kb\n", - " ld_window_flag = \"--ld-wind-kb\"\n", - "else:\n", - " use_kb_window = False\n", - " ld_window_param = ld_wind_cm\n", - " ld_window_flag = \"--ld-wind-cm\"\n", - "\n", - "emit_single = compute_single\n", - "emit_joint = compute_joint and N_targets >= 2\n", - "\n", - "# ---- Build per-chromosome output list ----\n", - "def chrom_outputs(chrom):\n", - " outs = []\n", - " if emit_single:\n", - " for i in range(N_targets):\n", - " name = f\"{annotation_name}_single_{i+1}\"\n", - " prefix = f\"{cwd:a}/{name}/{name}.{chrom}\"\n", - " outs += [f\"{prefix}.annot.gz\", f\"{prefix}.{ldscore_ext}\", f\"{prefix}.l2.M\"]\n", - " if emit_joint:\n", - " name = f\"{annotation_name}_joint\"\n", - " prefix = f\"{cwd:a}/{name}/{name}.{chrom}\"\n", - " outs += [f\"{prefix}.annot.gz\", f\"{prefix}.{ldscore_ext}\", f\"{prefix}.l2.M\"]\n", - " return outs\n", - "\n", - "input: input_files, group_by = N_targets + 2, group_with = \"input_chroms\"\n", - "\n", - "output: chrom_outputs(input_chroms[_index])\n", - "\n", - "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bnn}'\n", - "\n", - "# ----------------------------------------------------------------------------\n", - "# Step A: write the requested .annot files for this chromosome.\n", - "# ----------------------------------------------------------------------------\n", - "R: expand = \"${ }\", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'\n", - " library(data.table)\n", - "\n", - " clean_chr <- function(x) as.numeric(gsub(\"^chr\", \"\", x))\n", - "\n", - " process_range_data <- function(data, chr_value) {\n", - " data$chr <- clean_chr(data$chr)\n", - " data <- data[data$chr == chr_value,]\n", - " if (nrow(data) == 0) return(NULL)\n", - " expanded <- lapply(seq_len(nrow(data)), function(j) {\n", - " row <- data[j,]\n", - " pos_seq <- seq(row$start, row$end - 1)\n", - " result <- data.frame(chr = rep(row$chr, length(pos_seq)), pos = pos_seq)\n", - " if (ncol(data) > 3) {\n", - " for (col in 4:ncol(data))\n", - " result[[names(data)[col]]] <- rep(row[[col]], length(pos_seq))\n", - " }\n", - " result\n", - " })\n", - " unique(rbindlist(expanded))\n", - " }\n", - "\n", - " process_annotation <- function(target_anno, ref_anno, score_column_value) {\n", - " target_anno <- as.data.frame(target_anno)\n", - " ref_anno <- as.data.frame(ref_anno)\n", - " target_anno$chr <- clean_chr(target_anno$chr)\n", - " ref_anno$CHR <- clean_chr(ref_anno$CHR)\n", - " chr_value <- unique(ref_anno$CHR)\n", - " anno_scores <- rep(0, nrow(ref_anno))\n", - " match_pos <- match(target_anno$pos, ref_anno$BP)\n", - " valid_pos <- as.numeric(na.omit(match_pos))\n", - " if (score_column_value <= ncol(target_anno)) {\n", - " anno_scores[valid_pos] <- target_anno[[score_column_value]][!is.na(match_pos)]\n", - " } else {\n", - " anno_scores[valid_pos] <- 1\n", - " print(\"Warning: score column does not exist; setting scores to 1\")\n", - " }\n", - " anno_scores\n", - " }\n", - "\n", - " read_target_anno <- function(file_path, ref_anno) {\n", - " if (endsWith(file_path, \"rds\")) {\n", - " target_anno <- readRDS(file_path)\n", - " return(process_annotation(target_anno, ref_anno, ${score_column}))\n", - " }\n", - " target_anno <- fread(file_path)\n", - " if (${\"TRUE\" if is_range else \"FALSE\"}) {\n", - " names(target_anno)[1:3] <- c(\"chr\", \"start\", \"end\")\n", - " target_anno <- process_range_data(target_anno, unique(ref_anno$CHR))\n", - " if (is.null(target_anno)) return(rep(0, nrow(ref_anno)))\n", - " } else {\n", - " names(target_anno)[1:2] <- c(\"chr\", \"pos\")\n", - " }\n", - " process_annotation(target_anno, ref_anno, ${score_column})\n", - " }\n", - "\n", - " # ---- Read reference annotation ----\n", - " ref_anno <- as.data.frame(fread(${_input[-2]:ar}))\n", - " if (\"ANNOT\" %in% colnames(ref_anno)) ref_anno <- ref_anno[, -which(colnames(ref_anno) == \"ANNOT\")]\n", - "\n", - " # ---- Compute per-target annotation scores ----\n", - " target_files <- c(${\",\".join('\"%s\"' % str(p.absolute()) for p in _input[:-2])})\n", - " N_local <- length(target_files)\n", - " score_list <- lapply(target_files, read_target_anno, ref_anno = ref_anno)\n", - "\n", - " emit_single_local <- ${\"TRUE\" if emit_single else \"FALSE\"}\n", - " emit_joint_local <- ${\"TRUE\" if emit_joint else \"FALSE\"}\n", - " use_print_snps_local <- ${\"TRUE\" if use_print_snps else \"FALSE\"}\n", - " bfile_prefix <- \"${_input[-1]:na}\"\n", - "\n", - " # Reshape annot to match .bim panel for ldsc.py --l2 --print-snps\n", - " # (drop A1/A2/MAF, expand to .bim rows filling 0, take CM from .bim).\n", - " normalize_for_ldsc <- function(df) {\n", - " if (!use_print_snps_local) return(df)\n", - " df <- df[, !names(df) %in% c(\"A1\", \"A2\", \"MAF\", \"CM\"), drop = FALSE]\n", - " annot_cols <- setdiff(names(df), c(\"CHR\", \"BP\", \"SNP\"))\n", - " bim <- as.data.frame(fread(paste0(bfile_prefix, \".bim\"), header = FALSE,\n", - " col.names = c(\"CHR\", \"SNP\", \"CM\", \"BP\", \"A1\", \"A2\")))\n", - " bim$CHR <- as.character(bim$CHR); df$CHR <- as.character(df$CHR)\n", - " idx <- match(bim$SNP, df$SNP)\n", - " out <- data.frame(CHR = bim$CHR, BP = bim$BP, SNP = bim$SNP, CM = bim$CM,\n", - " stringsAsFactors = FALSE)\n", - " for (col in annot_cols) {\n", - " v <- rep(0, nrow(bim))\n", - " non_na <- !is.na(idx)\n", - " v[non_na] <- df[[col]][idx[non_na]]\n", - " out[[col]] <- v\n", - " }\n", - " out\n", - " }\n", - "\n", - " # ---- Write N single-target .annot files (when requested) ----\n", - " if (emit_single_local) {\n", - " for (i in seq_len(N_local)) {\n", - " out_anno <- ref_anno\n", - " out_anno$ANNOT <- score_list[[i]]\n", - " out_anno <- normalize_for_ldsc(out_anno)\n", - " name <- paste0(\"${annotation_name}\", \"_single_\", i)\n", - " out_path_gz <- file.path(\"${cwd:a}\", name, paste0(name, \".${input_chroms[_index]}.annot.gz\"))\n", - " out_path_tsv <- sub(\"\\\\.gz$\", \"\", out_path_gz)\n", - " dir.create(dirname(out_path_gz), showWarnings = FALSE, recursive = TRUE)\n", - " fwrite(out_anno, out_path_tsv, quote = FALSE, col.names = TRUE, row.names = FALSE, sep = \"\\t\")\n", - " }\n", - " }\n", - "\n", - " # ---- Optionally write joint .annot ----\n", - " if (emit_joint_local) {\n", - " joint_anno <- ref_anno\n", - " for (i in seq_len(N_local)) {\n", - " joint_anno[[paste0(\"ANNOT_\", i)]] <- score_list[[i]]\n", - " }\n", - " joint_anno <- normalize_for_ldsc(joint_anno)\n", - " joint_name <- paste0(\"${annotation_name}\", \"_joint\")\n", - " joint_out_gz <- file.path(\"${cwd:a}\", joint_name, paste0(joint_name, \".${input_chroms[_index]}.annot.gz\"))\n", - " joint_out_tsv <- sub(\"\\\\.gz$\", \"\", joint_out_gz)\n", - " dir.create(dirname(joint_out_gz), showWarnings = FALSE, recursive = TRUE)\n", - " fwrite(joint_anno, joint_out_tsv, quote = FALSE, col.names = TRUE, row.names = FALSE, sep = \"\\t\")\n", - " }\n", - "\n", - "# ----------------------------------------------------------------------------\n", - "# Step B: gzip all annot files. Uses expand=\"$[ ]\" so bash ${var} survives.\n", - "# ----------------------------------------------------------------------------\n", - "bash: expand = \"$[ ]\", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'\n", - " set -e\n", - " annots=()\n", - " if [ \"$[str(emit_single)]\" = \"True\" ]; then\n", - " for i in $(seq 1 $[N_targets]); do\n", - " annots+=(\"$[cwd:a]/$[annotation_name]_single_$i/$[annotation_name]_single_$i.$[input_chroms[_index]].annot\")\n", - " done\n", - " fi\n", - " if [ \"$[str(emit_joint)]\" = \"True\" ]; then\n", - " annots+=(\"$[cwd:a]/$[annotation_name]_joint/$[annotation_name]_joint.$[input_chroms[_index]].annot\")\n", - " fi\n", - " for a in \"${annots[@]}\"; do\n", - " gzip -f \"$a\"\n", - " done\n", - "\n", - "# ----------------------------------------------------------------------------\n", - "# Step C: run polyfun's LD-score computation for each emitted annotation file.\n", - "# ----------------------------------------------------------------------------\n", - "bash: expand = \"$[ ]\", stderr = f'{_output[1]}.stderr', stdout = f'{_output[1]}.stdout'\n", - " set -e\n", - " chrom=\"$[input_chroms[_index]]\"\n", - "\n", - " run_polyfun() {\n", - " local annot=\"$1\"\n", - " local out_prefix=\"$2\"\n", - " if [ \"$[str(use_print_snps)]\" = \"True\" ]; then\n", - " $[python_exec] $[polyfun_path]/ldsc.py \\\n", - " --print-snps $[snp_list] \\\n", - " $[ld_window_flag] $[ld_window_param] \\\n", - " --out \"$out_prefix\" \\\n", - " --bfile $[_input[-1]:nar] \\\n", - " --yes-really \\\n", - " --annot \"$annot\" \\\n", - " --l2\n", - " else\n", - " $[python_exec] $[polyfun_path]/compute_ldscores.py \\\n", - " --annot \"$annot\" \\\n", - " --bfile $[_input[-1]:nar] \\\n", - " $[ld_window_flag] $[ld_window_param] \\\n", - " --out \"${out_prefix}.$[ldscore_ext]\" \\\n", - " --allow-missing\n", - " fi\n", - " }\n", - "\n", - " if [ \"$[str(emit_single)]\" = \"True\" ]; then\n", - " for i in $(seq 1 $[N_targets]); do\n", - " name=\"$[annotation_name]_single_$i\"\n", - " annot=\"$[cwd:a]/$name/$name.$chrom.annot.gz\"\n", - " prefix=\"$[cwd:a]/$name/$name.$chrom\"\n", - " run_polyfun \"$annot\" \"$prefix\"\n", - " done\n", - " fi\n", - " if [ \"$[str(emit_joint)]\" = \"True\" ]; then\n", - " name=\"$[annotation_name]_joint\"\n", - " annot=\"$[cwd:a]/$name/$name.$chrom.annot.gz\"\n", - " prefix=\"$[cwd:a]/$name/$name.$chrom\"\n", - " run_polyfun \"$annot\" \"$prefix\"\n", - " fi\n", - "\n", - "# ----------------------------------------------------------------------------\n", - "# Step D: write .l2.M and .l2.M_5_50 files for each emitted annotation directory.\n", - "# ----------------------------------------------------------------------------\n", - "R: expand = \"${ }\", stderr = f'{_output[2]}.stderr', stdout = f'{_output[2]}.stdout'\n", - " suppressPackageStartupMessages({ library(data.table); library(dplyr) })\n", - " use_print_snps <- ${str(use_print_snps).upper()}\n", - "\n", - " chrom <- \"${input_chroms[_index]}\"\n", - " # Look up .frq file under frqfile_dir, using plink_name + chrom (matches cell 25).\n", - " frq_file <- file.path(\"${frqfile_dir}\", paste0(\"${plink_name}\", chrom, \".frq\"))\n", - " has_frq <- file.exists(frq_file)\n", - " frq_dt <- if (has_frq) fread(frq_file)[, .(SNP, MAF)] else NULL\n", - "\n", - " write_M_files <- function(annot_path, ldscore_path, m_path) {\n", - " if (use_print_snps && file.exists(m_path) && file.exists(paste0(m_path, \"_5_50\"))) {\n", - " cat(\"M files already exist for\", m_path, \"\\n\"); return(invisible())\n", - " }\n", - " ldscore_dt <- if (endsWith(ldscore_path, \".parquet\")) {\n", - " suppressPackageStartupMessages(library(arrow)); arrow::read_parquet(ldscore_path)\n", - " } else fread(ldscore_path)\n", - " annot_dt <- fread(annot_path)\n", - " annot_filtered <- annot_dt[annot_dt$SNP %in% ldscore_dt$SNP, ]\n", - " merged <- if (has_frq) merge(annot_filtered, frq_dt, by = \"SNP\", all.x = TRUE) else annot_filtered\n", - " std_cols <- c(\"CHR\", \"SNP\", \"BP\", \"CM\", \"A1\", \"A2\", if (has_frq) \"MAF\")\n", - " annot_cols <- setdiff(names(merged), std_cols)\n", - " if (length(annot_cols) == 0L) { merged[, ANNOT := 1L]; annot_cols <- \"ANNOT\" }\n", - " M <- merged[, lapply(.SD, sum, na.rm = TRUE), .SDcols = annot_cols]\n", - " writeLines(paste(as.numeric(M), collapse = \" \"), m_path)\n", - " if (has_frq) {\n", - " common <- merged[!is.na(MAF) & MAF > 0.05, ]\n", - " M5 <- common[, lapply(.SD, sum, na.rm = TRUE), .SDcols = annot_cols]\n", - " writeLines(paste(as.numeric(M5), collapse = \" \"), paste0(m_path, \"_5_50\"))\n", - " }\n", - " }\n", - "\n", - " targets <- c()\n", - " if (${\"TRUE\" if emit_single else \"FALSE\"}) {\n", - " for (i in seq_len(${N_targets})) {\n", - " targets <- c(targets, paste0(\"${annotation_name}\", \"_single_\", i))\n", - " }\n", - " }\n", - " if (${\"TRUE\" if emit_joint else \"FALSE\"}) {\n", - " targets <- c(targets, paste0(\"${annotation_name}\", \"_joint\"))\n", - " }\n", - " for (name in targets) {\n", - " annot_path <- file.path(\"${cwd:a}\", name, paste0(name, \".\", chrom, \".annot.gz\"))\n", - " ldscore_path <- file.path(\"${cwd:a}\", name, paste0(name, \".\", chrom, \".${ldscore_ext}\"))\n", - " m_path <- file.path(\"${cwd:a}\", name, paste0(name, \".\", chrom, \".l2.M\"))\n", - " write_M_files(annot_path, ldscore_path, m_path)\n", - " }\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "Python 3 (ipykernel)" - }, - "source": [ - "## Calculate Functional Enrichment using Annotations" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[get_heritability]\n", - "# Per-trait sLDSC regression via polyfun. Fans out across target_anno_dirs:\n", - "# each (trait, target_dir) pair becomes one polyfun invocation. Outputs go to\n", - "# //.{results,log,part_delete}.\n", - "#\n", - "# `target_anno_dirs` is the list produced by [make_annotation_files_ldscore]:\n", - "# typically N _single_ directories plus optionally one _joint directory.\n", - "\n", - "#\n", - "# --- about the \".results\" Category column and the \"_0 / _1\" suffix ---\n", - "# Each (trait, target_dir) pair is ONE polyfun call; its `ldsc.py --ref-ld-chr`\n", - "# always gets exactly two LD-score sources, in this order:\n", - "# \"/.\" (index 0) , \"/\" (index 1)\n", - "# With --overlap-annot, every annotation column in the .results \"Category\" is\n", - "# named _:\n", - "# index 0 = the target file -> \"ANNOT_0\" (no-snplist; compute_ldscores.py keeps the annot col name)\n", - "# -> \"L2_0\" (snplist + single annot; ldsc.py hard-codes \"L2\", see below)\n", - "# -> \"ANNOT_1_0\",\"ANNOT_2_0\" (no-snplist joint dir, N>=2 annot cols)\n", - "# -> \"ANNOT_1L2_0\",\"ANNOT_2L2_0\" (snplist joint dir, N>=2 -> \"L2\")\n", - "# index 1 = the baseline file -> \"base_1\",\"Coding_UCSC_1\", ... (the 97 baseline annots)\n", - "# So in this pipeline the suffix is only ever 0 (target) or 1 (baseline); it would\n", - "# continue 0,1,2,... only if you handed `ldsc.py --ref-ld-chr` more than two sources.\n", - "# (Why ANNOT_0 vs L2_0: see the [make_annotation_files_ldscore] header — ldsc.py's\n", - "# \"n_annot == 1 -> column name 'L2'\" quirk vs compute_ldscores.py keeping the annot\n", - "# column name.) [postprocess] auto-detects the target Category; if you instead pass\n", - "# --target-categories, the names must match this column exactly.\n", - "#\n", - "parameter: target_anno_dirs = paths()\n", - "parameter: all_traits = []\n", - "\n", - "import os\n", - "\n", - "with open(all_traits_file, 'r') as f:\n", - " trait_paths = [os.path.join(sumstat_dir, line.strip()) for line in f if line.strip()]\n", - "\n", - "# Build (trait, target_dir) Cartesian product as parallel flat lists.\n", - "input_list = []\n", - "target_meta = []\n", - "for td in target_anno_dirs:\n", - " for t in trait_paths:\n", - " input_list.append(t)\n", - " target_meta.append(str(td))\n", - "\n", - "input: input_list, group_by = 1, group_with = \"target_meta\"\n", - "\n", - "output: f\"{cwd:a}/{os.path.basename(target_meta[_index])}/{os.path.basename(_input[0])}.log\", \\\n", - " f\"{cwd:a}/{os.path.basename(target_meta[_index])}/{os.path.basename(_input[0])}.results\"\n", - "\n", - "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'\n", - "\n", - "bash: expand = \"${ }\"\n", - " target_dir=\"${target_meta[_index]}\"\n", - " target_name=\"$(basename ${target_meta[_index]})\"\n", - " trait=\"$(basename ${_input[0]})\"\n", - " output_dir=\"${cwd:a}/$target_name\"\n", - " mkdir -p \"$output_dir\"\n", - "\n", - " # MAF cutoff handling. Only 0 (disabled) or 0.05 (sLDSC default) are supported;\n", - " # other values would require recomputing LD scores at that cutoff.\n", - " frq_file_check=\"${frqfile_dir}/${plink_name}22.frq\"\n", - " if [ \"${maf_cutoff}\" = \"0\" ] || [ \"${maf_cutoff}\" = \"0.0\" ]; then\n", - " echo \"maf_cutoff = 0: skipping MAF filter (--not-M-5-50)\"\n", - " frq_option=\"--not-M-5-50\"\n", - " elif [ \"${maf_cutoff}\" = \"0.05\" ]; then\n", - " if [ -f \"$frq_file_check\" ]; then\n", - " echo \"maf_cutoff = 0.05: using --frqfile-chr (MAF > 5%)\"\n", - " frq_option=\"--frqfile-chr ${frqfile_dir}/${plink_name}\"\n", - " else\n", - " echo \"ERROR: maf_cutoff=0.05 requires .frq files for the reference panel,\"\n", - " echo \" but none found at ${frqfile_dir}/${plink_name}*.frq.\"\n", - " echo \" Provide .frq files in frqfile_dir, or set maf_cutoff=0 (NOT recommended).\"\n", - " exit 1\n", - " fi\n", - " else\n", - " echo \"ERROR: maf_cutoff=${maf_cutoff} is not supported. Only 0 (no filter) or\"\n", - " echo \" 0.05 (sLDSC default) are accepted. Other values would require\"\n", - " echo \" recomputing LD scores at that cutoff.\"\n", - " exit 1\n", - " fi\n", - "\n", - " run_ldsc() {\n", - " local extra_args=\"$1\"\n", - " ${python_exec} ${polyfun_path}/ldsc.py \\\n", - " --h2 ${sumstat_dir}/$trait \\\n", - " --ref-ld-chr \"$target_dir/$target_name.\",\"${baseline_ld_dir}/${baseline_name}\" \\\n", - " --out \"$output_dir/$trait\" \\\n", - " --overlap-annot \\\n", - " --w-ld-chr ${weights_dir}/${weight_name} \\\n", - " $frq_option \\\n", - " --print-coefficients \\\n", - " --print-delete-vals \\\n", - " --n-blocks ${n_blocks} \\\n", - " $extra_args\n", - " }\n", - "\n", - " run_ldsc \"\"\n", - " log_file=\"$output_dir/$trait.log\"\n", - "\n", - " # FloatingPointError retry ladder (preserved from original): 30 -> 20 -> 10\n", - " for max in 30 20 10; do\n", - " if [ -f \"$log_file\" ] && grep -q \"FloatingPointError\\|invalid value encountered in sqrt\" \"$log_file\"; then\n", - " echo \"FloatingPointError detected, retrying with --chisq-max $max...\"\n", - " run_ldsc \"--chisq-max $max\"\n", - " else\n", - " break\n", - " fi\n", - " done\n", - "\n", - " if [ -f \"$log_file\" ] && grep -q \"FloatingPointError\\|invalid value encountered in sqrt\" \"$log_file\"; then\n", - " echo \"ERROR: FloatingPointError persists for trait $trait at target $target_name even with --chisq-max 10\"\n", - " echo \"This trait may have severe numerical instability issues in the summary statistics.\"\n", - " fi\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[munge_sumstats_polyfun]\n", - "parameter: sumstats = path\n", - "parameter: n = 0\n", - "parameter: min_info = 0.6\n", - "parameter: min_maf = 0.001\n", - "parameter: keep_hla = False\n", - "parameter: chi2_cut = 30\n", - "input: sumstats\n", - "output: f\"{_input:n}.munged.parquet\"\n", - "bash: expand=True, stderr=f'{_output:nn}.stderr', stdout=f'{_output:nn}.stdout'\n", - " {python_exec} {polyfun_path}/munge_polyfun_sumstats.py \\\n", - " --sumstats {_input} \\\n", - " --out {_output} \\\n", - " {'--n {}'.format(n) if n>0 else ''} \\\n", - " {'--min-info {}'.format(min_info)} \\\n", - " {'--min-maf {}'.format(min_maf)} \\\n", - " {'--chi2-cutoff {}'.format(chi2_cut)} \\\n", - " {'--keep-hla' if keep_hla else ''} \\\n", - " --remove-strand-ambig" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[postprocess]\n", - "# Post-processing of polyfun outputs via pecotmr::sldsc_postprocessing_pipeline.\n", - "# Reads .results / .log / .part_delete for all traits in `traits_file`, both\n", - "# single-target and (when present) joint-target runs, computes Gazal-style\n", - "# tau*, EnrichStat with back-solved jackknife SE, and runs the default\n", - "# DerSimonian-Laird random-effects meta across all supplied traits. Writes\n", - "# one RDS containing per-trait tables and three meta tables (tau*, E, EnrichStat).\n", - "\n", - "parameter: traits_file = path() # text file: one trait sumstats filename per line\n", - "parameter: heritability_cwd = path() # parent directory of [get_heritability] outputs (contains _single_/ subdirs and optionally _joint/)\n", - "parameter: target_categories = [] # target annotation names. Auto-detected from the joint-run results if empty.\n", - "parameter: target_categories_label = [] # optional display names, same order as target_categories;\n", - " # when given, every \"target\" column / tau*-block colname in\n", - " # the output RDS is renamed to these (params$target_categories\n", - " # holds the labels, params$target_categories_orig the originals).\n", - "parameter: target_anno_dir = path() # directory of target .annot.gz files used for sd_C and binary detection (typically the joint dir, since it carries all target columns)\n", - "\n", - "input: traits_file\n", - "output: f\"{cwd:a}/{annotation_name}.sldsc_postprocess.rds\"\n", - "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n", - "\n", - "R: expand = \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'\n", - " library(pecotmr)\n", - "\n", - " traits <- readLines(\"${traits_file}\")\n", - " target_cats <- c(${\",\".join('\"%s\"' % c for c in target_categories)})\n", - " target_lab <- c(${\",\".join('\"%s\"' % c for c in target_categories_label)})\n", - "\n", - " # Auto-detect single-target and joint-target output directories.\n", - " her_root <- \"${heritability_cwd}\"\n", - " all_subdirs <- list.dirs(her_root, recursive = FALSE)\n", - " single_pattern <- paste0(\"^\", \"${annotation_name}\", \"_single_([0-9]+)$\")\n", - " joint_name <- paste0(\"${annotation_name}\", \"_joint\")\n", - " single_dirs <- all_subdirs[grepl(single_pattern, basename(all_subdirs))]\n", - " single_indices <- as.integer(sub(single_pattern, \"\\\\1\", basename(single_dirs)))\n", - " single_dirs <- single_dirs[order(single_indices)]\n", - " joint_dir <- file.path(her_root, joint_name)\n", - " has_joint <- dir.exists(joint_dir)\n", - "\n", - " message(sprintf(\"Detected %d single-target dirs%s\",\n", - " length(single_dirs),\n", - " if (has_joint) \"; joint-target dir present\" else \"; no joint-target dir\"))\n", - "\n", - " # Build per-trait prefix maps. Each trait's polyfun output is at /\n", - " # (polyfun appends .results / .log / .part_delete).\n", - " trait_single_prefixes <- lapply(traits, function(t) file.path(single_dirs, t))\n", - " names(trait_single_prefixes) <- traits\n", - "\n", - " if (has_joint) {\n", - " trait_joint_prefix <- setNames(file.path(joint_dir, traits), traits)\n", - " } else {\n", - " trait_joint_prefix <- setNames(rep(NA_character_, length(traits)), traits)\n", - " }\n", - "\n", - " res <- sldsc_postprocessing_pipeline(\n", - " trait_single_prefixes = trait_single_prefixes,\n", - " trait_joint_prefix = trait_joint_prefix,\n", - " target_anno_dir = \"${target_anno_dir}\",\n", - " frqfile_dir = \"${frqfile_dir}\",\n", - " plink_name = \"${plink_name}\",\n", - " maf_cutoff = ${maf_cutoff},\n", - " target_categories = if (length(target_cats) > 0) target_cats else NULL,\n", - " target_labels = if (length(target_lab) > 0) target_lab else NULL\n", - " )\n", - "\n", - " saveRDS(res, \"${_output[0]}\")\n", - " message(\"S-LDSC post-processing complete; results written to ${_output[0]}\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[meta_subset]\n", - "# Optional: re-run random-effects meta on a user-defined subset of traits, using\n", - "# the cached per-trait standardized results from [postprocess]. No regression rerun.\n", - "\n", - "parameter: postprocess_rds = path() # output of [postprocess]\n", - "parameter: subset_traits_file = path() # text file: one trait id per line, subset of those passed to [postprocess]\n", - "parameter: subset_name = str # label used in the output filename\n", - "parameter: target_categories = [] # target annotation names to meta on; if empty, uses all from postprocess output\n", - "# If [postprocess] was run with --target-categories-label, the cached RDS already\n", - "# carries the display names (params$target_categories = the labels), so leave\n", - "# --target-categories empty here (or pass the labels, not the original ANNOT_* names).\n", - "\n", - "input: postprocess_rds, subset_traits_file\n", - "output: f\"{cwd:a}/{annotation_name}.{subset_name}.meta.rds\"\n", - "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n", - "\n", - "R: expand = \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'\n", - " library(pecotmr)\n", - "\n", - " res <- readRDS(\"${postprocess_rds}\")\n", - " subset_traits <- readLines(\"${subset_traits_file}\")\n", - " target_cats <- c(${\",\".join([f'\"{c}\"' for c in target_categories])})\n", - " if (length(target_cats) == 0) target_cats <- res$params$target_categories\n", - "\n", - " subset_per_trait <- res$per_trait[subset_traits]\n", - "\n", - " # Map wide names (tau_star_single/joint) to bare names meta_sldsc_random expects.\n", - " view_single <- pecotmr:::.sldsc_view_for_meta(subset_per_trait, \"single\")\n", - " view_joint <- pecotmr:::.sldsc_view_for_meta(subset_per_trait, \"joint\")\n", - "\n", - " out <- list(\n", - " tau_star_single = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_single, c, \"tau_star\")), target_cats),\n", - " tau_star_joint = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_joint, c, \"tau_star\")), target_cats),\n", - " enrichment = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_single, c, \"enrichment\")), target_cats),\n", - " enrichstat = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_single, c, \"enrichstat\")), target_cats)\n", - " )\n", - "\n", - " saveRDS(out, \"${_output[0]}\")\n", - " message(\"Subset meta complete; results written to ${_output[0]}\")" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "SoS", - "language": "sos", - "name": "sos" - }, - "language_info": { - "codemirror_mode": "sos", - "file_extension": ".sos", - "mimetype": "text/x-sos", - "name": "sos", - "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter", - "pygments_lexer": "sos" - }, - "sos": { - "kernels": [ - [ - "Bash", - "calysto_bash", - "Bash", - "#E6EEFF", - "shell" - ], - [ - "R", - "ir", - "R", - "#DCDCDA", - "r" - ], - [ - "SoS", - "sos", - "", - "", - "sos" - ] - ], - "panel": { - "displayed": true, - "height": 0 - }, - "version": "0.22.4" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} From d1284945db46c6ee4175d7ea3c8149c0b96c9141 Mon Sep 17 00:00:00 2001 From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com> Date: Tue, 23 Jun 2026 12:11:46 -0400 Subject: [PATCH 6/6] remove absolute local path --- code/SoS/enrichment/sldsc_enrichment.ipynb | 1472 ++++++++++++++++++++ 1 file changed, 1472 insertions(+) create mode 100644 code/SoS/enrichment/sldsc_enrichment.ipynb diff --git a/code/SoS/enrichment/sldsc_enrichment.ipynb b/code/SoS/enrichment/sldsc_enrichment.ipynb new file mode 100644 index 00000000..8b352789 --- /dev/null +++ b/code/SoS/enrichment/sldsc_enrichment.ipynb @@ -0,0 +1,1472 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "# Stratified LD Score Regression (S-LDSC) Enrichment\n", + "\n", + "Minimal working-example driver for the S-LDSC functional-enrichment pipeline. The **Steps** section below gives one ready-to-run `sos run` command per workflow, using the toy inputs symlinked under `input/`.\n", + "\n", + "> **Environment note.** Steps 1–2 (`make_annotation_files_ldscore`, `get_heritability`) wrap the external **polyfun** toolkit (`compute_ldscores.py`, `ldsc.py`, `munge_polyfun_sumstats.py`) and require pre-computed reference-panel files (baseline-LD scores, LD weights, `.frq`, and PLINK `.bed/.bim/.fam`). polyfun is **not installed in this environment** and the reference panel is not shipped with the toy example, so those two steps cannot be executed here; their commands are provided for use on a system where polyfun and a matching panel are available. Steps 3–4 (`postprocess`, `meta_subset`) use `pecotmr::sldsc_postprocessing_pipeline` (available here) and read the `.results`/`.log` files produced by Step 2.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Description\n", + "This notebook implements the pipeline of [S-LDSC](https://www.nature.com/articles/ng.3404) for LD score and functional enrichment analysis.\n", + "\n", + "**Important: the S-LDSC implementation comes from the [polyfun](https://github.com/omerwe/polyfun/tree/master) package, not the original LDSC from `bulik/ldsc` GitHub repo.**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "Markdown" + }, + "source": [ + "Uses GWAS summary statistics together with annotation and LD reference-panel data to compute per-SNP heritability enrichment for each annotation. It supports single-annotation (individual contribution) and joint multi-annotation (independent contribution) analysis.\n", + "\n", + "**Background.** LD Score Regression (Bulik-Sullivan et al. 2015) distinguishes confounding (e.g. population stratification) from true polygenic signal by regressing GWAS chi-square statistics on LD scores: SNPs tagging more variation (high LD score) show higher chi-square under true polygenicity, whereas confounding inflates statistics independently of LD. S-LDSC (Finucane et al. 2015) partitions heritability across overlapping annotation categories; standardized tau accounts for negative selection (Gazal et al. 2017). The model details and the tau*/EnrichStat definitions follow below.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Methods - Workflow Overview\n", + "\n", + "The pipeline runs in three stages: (1) annotation preparation and the S-LDSC regression (polyfun), (2) post-processing into standardized $\\tau^*$ and meta-analysis (the `pecotmr` package), and (3) optional re-meta on user-defined trait subsets. The concrete commands for stages 1-2 are in the **Steps** section below.\n", + "\n", + "**Stage 1 - polyfun.** Three SoS workflows wrap polyfun: `make_annotation_files_ldscore` converts target annotations into polyfun `.annot.gz` and runs `compute_ldscores.py` (toggles `compute_single` and `compute_joint`, both default `True`; the joint dir is only emitted when $N \\geq 2$); `munge_sumstats_polyfun` preprocesses each GWAS into LDSC format; `get_heritability` runs polyfun's `ldsc.py` once per `--target-anno-dir`, enforcing the MAF cutoff via `--frqfile-chr` (`maf_cutoff` accepts only `0` or `0.05`).\n", + "\n", + "**Stage 2 - pecotmr post-processing.** A single `pecotmr::sldsc_postprocessing_pipeline` call consumes all polyfun outputs: it extracts $\\tau$, $E$, $h^2_g$, EnrichStat p-value and per-block jackknife $\\tau$ values; computes $sd_C$ and $M_{\\mathrm{ref}}$ over the regression's MAF-cutoff SNP set; standardizes $\\tau \\to \\tau^*$ for single and joint modes; auto-detects binary vs continuous annotations; and runs a DerSimonian-Laird random-effects meta-analysis across traits, producing three meta tables ($\\tau^*$ cross-type comparable, $E$ within-binary, EnrichStat within-binary). Output is an R list with `per_trait` and `meta` entries.\n", + "\n", + "**Stage 3 - subset meta-analysis.** `pecotmr::meta_sldsc_random` re-runs the meta on a trait subset without re-running the regression (lightweight, interactive):\n", + "\n", + "```r\n", + "res <- readRDS(\"sldsc_results.rds\")\n", + "neuro <- c(\"AD_GWAX\", \"PD_meta\", \"ALS_meta\")\n", + "meta_neuro_taustar <- pecotmr::meta_sldsc_random(\n", + " res$per_trait[neuro], category = \"my_target_anno\", quantity = \"tau_star\"\n", + ")\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Theory\n", + "\n", + "The statistical model behind the pipeline is summarized below. Because the same framework underlies several of the workflow steps, the model, its stratified extension, and the tau-estimation / enrichment definitions are described together here rather than repeated per step." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### LDSC model\n", + "\n", + "Under a polygenic assumption, in which effect sizes for variants are drawn independently from distributions with variance proportional to $1/(p(1-p))$ where $p$ is the minor allele frequency (MAF), the expected $\\chi^2$ statistic of variant $j$ is:\n", + "\n", + "$$E[\\chi^2_j \\mid \\ell_j] \\;=\\; \\frac{N\\,h^2\\,\\ell_j}{M} \\;+\\; N a \\;+\\; 1 \\quad (1)$$\n", + "\n", + "where $N$ is the sample size; $M$ is the number of SNPs, so that $h^2/M$ is the average heritability per SNP; $a$ measures the contribution of confounding biases such as cryptic relatedness and population stratification; and $\\ell_j = \\sum_k r^2_{jk}$ is the LD Score of variant $j$, which measures the amount of genetic variation tagged by $j$. A full derivation is given in the Supplementary Note of Bulik-Sullivan et al. (2015); an alternative derivation appears in the Supplementary Note of Zhu and Stephens (2017) AoAS.\n", + "\n", + "Equation (1) shows that LD Score regression can compute SNP-based heritability for a phenotype from GWAS summary statistics alone, without requiring individual-level genotype data as REML and related methods do." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Stratified LDSC\n", + "\n", + "Heritability is the proportion of phenotypic variation that is due to variation in genetic values, and it can also be partitioned over disjoint or overlapping categories of SNPs.\n", + "\n", + "Stratified LD Score Regression (S-LDSC) partitions heritability by leveraging both LD-score information and SNPs that have not reached genome-wide significance. S-LDSC exploits the fact that the $\\chi^2$ statistic for a given SNP reflects the cumulative effects of all SNPs tagged by it: in regions of high LD, the focal SNP captures the contribution of a group of nearby SNPs.\n", + "\n", + "S-LDSC declares an annotation enriched for heritability if SNPs with high LD to that annotation have higher $\\chi^2$ statistics than SNPs with low LD to it.\n", + "\n", + "Let $a_{jC}$ denote the value of annotation $C$ at SNP $j$:\n", + "\n", + "- **Binary annotation** (e.g. an indicator for \"in enhancer\", \"in exon\", \"in cell-type-specific peak\"): $a_{jC} \\in \\{0, 1\\}$.\n", + "- **Continuous annotation** (e.g. gene-specificity score, conservation score, continuous epigenomic signal): $a_{jC} \\in \\mathbb{R}$.\n", + "\n", + "Under a polygenic model the per-SNP heritability for SNP $j$ is\n", + "\n", + "$$\\mathrm{Var}(\\beta_j) \\;=\\; \\sum_C a_{jC}\\, \\tau_C$$\n", + "\n", + "and the expected $\\chi^2$ statistic of SNP $j$ is\n", + "\n", + "$$E[\\chi^2_j \\mid \\mathbf{a}_j] \\;=\\; N \\sum_C \\tau_C\\, \\ell(j, C) \\;+\\; N a \\;+\\; 1 \\quad (2)$$\n", + "\n", + "where $\\ell(j, C) = \\sum_k a_{kC}\\, r^2_{jk}$ is the partitioned LD score of SNP $j$ with respect to annotation $C$, and $a$ measures confounding bias. Equation (2) allows joint estimation of all $\\tau_C$ via a (computationally simple) multiple regression of $\\chi^2_j$ against $\\ell(j, C)$.\n", + "\n", + "Interpretation of $\\tau_C$:\n", + "- **Binary $C$**: $\\tau_C$ is the *additive increase in per-SNP heritability* for SNPs in category $C$, on top of the contributions from any other annotations they belong to.\n", + "- **Continuous $C$**: $\\tau_C$ is the *additive change in per-SNP heritability per unit increase* in the value of annotation $C$.\n", + "\n", + "For application to real data and comparisons to other methods, see the three papers cited at the top of this notebook." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Tau Estimation and Enrichment Analysis\n", + "\n", + "Goal: quantify the contribution of functional annotations to trait heritability and assess statistical significance, accounting for LD structure and (for continuous annotations) annotation scale.\n", + "\n", + "The pipeline has two computational layers:\n", + "\n", + "- **Regression layer** — the S-LDSC regression itself, performed by the [polyfun](https://github.com/omerwe/polyfun) engine. We do not re-implement this.\n", + "- **Post-processing layer** — standardization, differential per-SNP heritability, binary/continuous detection, and random-effects meta-analysis across traits. Implemented in the [`pecotmr`](https://github.com/StatFunGen/pecotmr) R package (`R/sldsc_wrapper.R`).\n", + "\n", + "The notation below tags each modeling quantity as **(polyfun)** or **(pecotmr)**.\n", + "\n", + "#### Notation\n", + "\n", + "For each annotation $C$ we use:\n", + "\n", + "- $\\pi^{h^2}_C$ = proportion of trait heritability $h^2_g$ assigned to annotation $C$.\n", + "- $\\pi^{M}_C$ = proportion of (effective) SNPs in annotation $C$. For binary annotations this is $M_C / M_{\\mathrm{ref}}$; for continuous annotations it is the share of total annotation weight in $C$.\n", + "\n", + "#### Reference panel and MAF cutoff\n", + "\n", + "All LD-derived quantities — partitioned LD scores for the 97 baseline annotations and for our $K$ target annotations, the LD-score-regression weights, allele frequencies, and the SNP set — are computed against our own LD reference panel. We do not mix in pre-computed quantities from external panels (e.g. 1000G); $M_{\\mathrm{ref}}$ throughout this notebook denotes the number of common SNPs in our panel.\n", + "\n", + "By default we restrict to MAF $> 5\\%$ per the sLDSC recommendation: rare-variant LD is unstable and HapMap3-style regression weights are common-variant by construction. The cutoff is exposed as the SoS parameter `maf_cutoff` (default $0.05$); the regression, the standardized $sd_C$, and $M_{\\mathrm{ref}}$ are all evaluated on the same MAF $>$ cutoff SNP set. If allele-frequency files are not available the pipeline fails; the user must explicitly set `maf_cutoff = 0` to opt out (not recommended).\n", + "\n", + "#### Quantities from the regression layer (polyfun)\n", + "\n", + "Solving Equation (2) jointly across annotations, with 200-block genomic jackknife for inference, is performed by polyfun's `ldsc.py`. From each polyfun run we obtain, per annotation:\n", + "\n", + "- $\\tau_C$ and its standard error — **(polyfun)**.\n", + "- $\\pi^{h^2}_C$ and $\\pi^{M}_C$ — **(polyfun)**.\n", + "- $E_C = \\pi^{h^2}_C / \\pi^{M}_C$ and its standard error — **(polyfun)**.\n", + "- The p-value of the differential per-SNP heritability test (defined below) — **(polyfun)**, computed internally with the full coefficient covariance matrix.\n", + "\n", + "We also obtain, per run:\n", + "\n", + "- The total trait heritability $h^2_g$ — **(polyfun)**.\n", + "- The 200-block jackknife delete-values of $\\tau_C$ — **(polyfun)**.\n", + "\n", + "#### Quantities from the post-processing layer (pecotmr)\n", + "\n", + "From the polyfun outputs above plus our reference panel, the post-processing layer computes:\n", + "\n", + "- $sd_C$ — per-annotation standard deviation over MAF $>$ cutoff SNPs — **(pecotmr: `compute_sldsc_annot_sd`)**.\n", + "- $M_{\\mathrm{ref}}$ — reference SNP count at the MAF cutoff — **(pecotmr: `compute_sldsc_M_ref`)**.\n", + "- Whether each annotation is binary or continuous — **(pecotmr: `is_binary_sldsc_annot`)**.\n", + "- $\\tau^*_C$ point estimate and per-block $\\tau^*_C$ — **(pecotmr: `standardize_sldsc_trait`)**.\n", + "- EnrichStat point estimate and its standard error (formula below) — **(pecotmr: `standardize_sldsc_trait`)**.\n", + "- DerSimonian-Laird random-effects meta-analysis of $\\tau^*_C$, $E_C$, or EnrichStat across traits — **(pecotmr: `meta_sldsc_random`)**.\n", + "\n", + "The top-level entry point `pecotmr::sldsc_postprocessing_pipeline` orchestrates all of the above.\n", + "\n", + "#### Standardized tau ($\\tau^*$) — (pecotmr)\n", + "\n", + "$\\tau_C$ has units that depend on the scale of the annotation and on the total heritability of the trait, so raw $\\tau$ is not directly comparable across annotations or across traits. We compute the standardized version (Gazal et al. 2017)\n", + "\n", + "$$\\tau^*_C \\;=\\; \\tau_C \\cdot \\frac{sd_C \\cdot M_{\\mathrm{ref}}}{h^2_g}$$\n", + "\n", + "interpreted as the additive change in per-SNP heritability associated with a 1 standard deviation increase in annotation $C$, divided by the average per-SNP heritability across all SNPs. $\\tau^*_C$ is dimensionless and comparable across annotations and across traits. In a joint multi-annotation regression it is the *independent contribution* of annotation $C$ after controlling for overlapping effects of the others.\n", + "\n", + "Here $sd_C$ is the standard deviation of annotation $C$ across reference SNPs (MAF $>$ cutoff), $M_{\\mathrm{ref}}$ is the count of those SNPs, and $h^2_g$ is the trait heritability. Applying the same scaling to each of the 200 jackknife blocks yields per-block $\\tau^*_C$ values; their sample variance gives the jackknife standard error\n", + "$$SE^{\\text{jackknife}}(\\tau^*_C) \\;=\\; \\sqrt{\\,\\tfrac{(B-1)^2}{B}\\, \\mathrm{Var}_b(\\tau^*_{C,(b)})\\,}$$\n", + "with $B = 200$, used as the per-trait input to cross-trait meta-analysis.\n", + "\n", + "#### Differential per-SNP heritability (\"EnrichStat\") — (polyfun + pecotmr)\n", + "\n", + "To test whether the per-SNP heritability *inside* annotation $C$ differs from *outside* it (Finucane et al. 2015):\n", + "\n", + "$$\\text{EnrichStat}_C \\;=\\; \\frac{h^2_g}{M_{\\mathrm{ref}}} \\!\\left[\\, \\frac{\\pi^{h^2}_C}{\\pi^{M}_C} \\;-\\; \\frac{1 - \\pi^{h^2}_C}{1 - \\pi^{M}_C} \\,\\right]$$\n", + "\n", + "The point-estimate p-value of this test is computed by polyfun internally using the full coefficient covariance and reported as `Enrichment_p`. Its standard error is recovered from the reported p-value:\n", + "\n", + "$$|Z_C| \\;=\\; \\Phi^{-1}\\!\\left(1 - \\tfrac{p_C}{2}\\right), \\qquad SE_{\\text{EnrichStat}_C} \\;=\\; \\frac{|\\text{EnrichStat}_C|}{|Z_C|}.$$\n", + "\n", + "This per-trait point + SE is the input to cross-trait meta-analysis.\n", + "\n", + "#### Reporting: binary vs. continuous annotations — (pecotmr)\n", + "\n", + "The estimation machinery applies to both annotation types, but the *headline* quantity to report **within each type** differs.\n", + "\n", + "For a **binary annotation** (e.g. enhancer indicator, exon, in/out of a cell-type peak), $\\pi^{M}_C = M_C / M_{\\mathrm{ref}}$ has a direct interpretation and $E_C$ reads as \"the category explains $E_C$-fold more heritability than its share of SNPs.\" The within-type headline quantities are therefore $E_C$ and the EnrichStat p-value; $\\tau^*_C$ is reported alongside.\n", + "\n", + "For a **continuous annotation** (e.g. gene-specificity score, conservation score, continuous epigenomic signal), $E_C$ depends on the scale of the annotation: rescaling the annotation by a constant changes $E_C$ even though the underlying biology is unchanged. The within-type headline quantities are therefore $\\tau^*_C$ and its p-value; $E_C$ is reported alongside but should not be interpreted for continuous annotations.\n", + "\n", + "The pipeline determines whether an annotation is binary by inspecting whether its values lie in $\\{0, 1\\}$ and selects the appropriate within-type headline statistic automatically (pecotmr).\n", + "\n", + "> **From the official LDSC tutorial** ([Partitioned Heritability from Continuous Annotations](https://github.com/bulik/ldsc/wiki/Partitioned-Heritability-from-Continuous-Annotations)):\n", + ">\n", + "> *\"Enrichment is (Prop. heritability) / (Prop. SNPs). These outputs make sense only for binary annotations. Do not try to interpret them for continuous annotations. Using `--print-coefficients` outputs the regression coefficients and corresponding standard errors and Z score for each annotation. These coefficients measure the additional contribution of one annotation to the model and are interpretable for both binary and continuous annotations.\"*\n", + ">\n", + "> The pipeline always passes `--print-coefficients` to polyfun for this reason.\n", + "\n", + "#### Cross-type comparison: always use $\\tau^*_C$ — (pecotmr)\n", + "\n", + "For an apple-to-apple comparison **across binary and continuous annotations** — ranking annotations on a single axis, meta-analyzing a mixed set, or reporting a leaderboard that pools both types — use $\\tau^*_C$. The standardization in Gazal et al. (2017) was designed for exactly this purpose: $sd_C = \\sqrt{p(1-p)}$ for a binary annotation (where $p$ is the proportion in the category) and $sd_C = $ empirical standard deviation for a continuous annotation, so the resulting $\\tau^*_C$ is dimensionless and has the same interpretation in both cases — additive change in per-SNP heritability per 1 SD increase in the annotation, normalized by the average per-SNP heritability. $E_C$ does not have this property and must not be compared across types.\n", + "\n", + "The pipeline emits both $E_C$ and $\\tau^*_C$ for every annotation, with the binary/continuous flag, so callers can pick the right column for the comparison they are making.\n", + "\n", + "#### Joint analysis — (polyfun runs the regression; pecotmr standardizes both modes)\n", + "\n", + "For **joint analysis** (multiple annotations fit together), both $\\tau$ and $E$ are conditional on the other annotations in the model. We report joint $\\tau^*_C$ as the independent contribution of annotation $C$ after controlling for the others. The annotation-prep step exposes two independent toggles, `compute_single` and `compute_joint` (both default `True`), so the user can produce the $N$ single-target outputs, the joint output, or both in one invocation. With both defaults the post-processing layer reads all $N+1$ regression outputs per trait and presents single + joint side-by-side. When the joint subset is decided after looking at single-target results (exploratory $\\rightarrow$ conditional workflow), the user runs the annotation-prep step a second time with `compute_single=False` on the curated subset.\n", + "\n", + "### Meta-Analysis across Traits (Random Effects) — (pecotmr)\n", + "\n", + "DerSimonian-Laird random-effects meta-analysis of per-annotation estimates across traits, implemented in `pecotmr::meta_sldsc_random` (which delegates the numerics to `rmeta::meta.summaries(..., method = \"random\")`):\n", + "\n", + "$$\\hat\\theta_{\\mathrm{meta}} \\;=\\; \\frac{\\sum_i w_i\\, \\hat\\theta_i}{\\sum_i w_i}, \\qquad SE_{\\mathrm{meta}} \\;=\\; \\sqrt{\\frac{1}{\\sum_i w_i}}, \\qquad w_i \\;=\\; \\frac{1}{SE_i^2 + \\hat\\sigma^2}$$\n", + "\n", + "where $\\hat\\theta_i$ is the per-trait estimate and $SE_i$ its standard error:\n", + "\n", + "- **For $\\tau^*_C$ meta**: $SE_i$ is the jackknife SE from the per-block $\\tau^*_C$ values.\n", + "- **For $E_C$ meta**: $SE_i$ is the polyfun-reported `Enrichment_std_error`.\n", + "- **For EnrichStat meta**: $SE_i$ is the back-solved SE from polyfun's `Enrichment_p`.\n", + "\n", + "For binary-annotation enrichment reporting we use a two-channel meta: the **effect size and SE** come from the meta on $E_C$ (interpretable on the original enrichment-fold scale), while the **p-value** comes from the meta on EnrichStat (the appropriate hypothesis test). The pipeline produces a default meta over all supplied traits; users can re-run meta on any subset of traits without re-running the regression layer.\n", + "\n", + "$$Z_{\\mathrm{meta}} \\;=\\; \\frac{\\hat\\theta_{\\mathrm{meta}}}{SE_{\\mathrm{meta}}}, \\qquad p \\;=\\; 2\\,\\Phi(-|Z_{\\mathrm{meta}}|)$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Minimal Working Example (MWE)\n", + "\n", + "The steps below run the four pipeline workflows end to end on the example data. Each step lists what it does, then the `sos run` command to execute it.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Step 1. `make_annotation_files_ldscore`\n", + "\n", + "*Annotation preparation and S-LDSC regression (polyfun).* This step accepts a single annotation file for a single-tau analysis (one annotation as input) or several annotation files for a joint-tau analysis (multiple annotations as input)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "#### **Inputs**\n", + "\n", + "##### 1. Target Annotation File\n", + "\n", + "- **Purpose**: Specifies the user-provided (\"target\") genome annotation files. The pipeline supports both binary and continuous annotations; the type is auto-detected per annotation column.\n", + "- **Formats**:\n", + " - Text file (`.txt`) listing per-chromosome paths to annotation files. Annotation files can be `.rds`/`.tsv`/`.txt`.\n", + " - Alternatively, files for specific chromosomes can be provided directly.\n", + " - **Multiple target annotations** are supported in one input file (one column per annotation, prefixed `path`, `path1`, `path2`, ...). Single-target and joint-target analyses are produced automatically in one pipeline pass.\n", + " - **Format** (the score column is optional; if absent, score is set to 1):\n", + " - `is_range = False`:\n", + " ```\n", + " chr pos score\n", + " 1 10001 1\n", + " 1 10002 1\n", + " ```\n", + " - `is_range = True`:\n", + " ```\n", + " chr start end score\n", + " 1 10001 20001 1\n", + " 1 30001 40001 1\n", + " ```\n", + "\n", + "##### 2. Reference Annotation File (baseline-LD)\n", + "\n", + "- **Purpose**: Provides the baseline annotations (typically the 97-annotation baseline-LD model from Gazal et al. 2017) in `.annot.gz` format for each chromosome. The baseline conditions every regression.\n", + "- **Formats**:\n", + " - Text file listing baseline annotation files for all chromosomes.\n", + " - Alternatively, files for specific chromosomes can be provided directly.\n", + "\n", + "##### 3. Genome Reference File\n", + "\n", + "- **Purpose**: PLINK-format `.bed/.bim/.fam` files for our LD reference panel, per chromosome. This is the panel against which all LD-derived quantities (target LD scores, baseline LD scores, regression weights, allele frequencies) must be computed. **Do not mix files derived from different panels** (e.g. 1000G vs ADSP).\n", + "- **Formats**:\n", + " - Text file listing per-chromosome reference files, or files for specific chromosomes.\n", + "\n", + "##### 4. SNP List\n", + "\n", + "- **Purpose**: Specifies the SNPs to include in LDSC analysis (typically a HapMap3-style list).\n", + "- **Format**: A list of `rsid`s, one per line.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "kernel": "Bash" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n", + " import pkg_resources\n", + "INFO: Running \u001b[32mmake_annotation_files_ldscore\u001b[0m: \n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=1) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=3) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=2) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=0) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=5) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=6) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=4) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=7) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=9) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=10) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=8) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=11) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=14) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=13) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=12) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=15) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=18) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=16) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=17) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=19) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=21) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=20) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m output: \u001b[32m/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_ldscore/protocol_example_single_1/protocol_example_single_1.1.annot.gz /restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_ldscore/protocol_example_single_1/protocol_example_single_1.1.l2.ldscore.parquet... (66 items in 22 groups)\u001b[0m\n", + "INFO: Workflow make_annotation_files_ldscore (ID=weae0ca3fdf468fd8) is executed successfully with 1 completed step and 22 completed substeps.\n" + ] + } + ], + "source": [ + "sos run pipeline/sldsc_enrichment.ipynb make_annotation_files_ldscore \\\n", + " --annotation_file input/enrichment/sldsc/colocboost_test_annotation_path.txt \\\n", + " --reference_anno_file input/enrichment/sldsc/reference_annotation0.txt \\\n", + " --genome_ref_file input/enrichment/sldsc/genome_reference_bfile.txt \\\n", + " --annotation_name protocol_example \\\n", + " --plink_name reference. --baseline_name annotations. --weight_name weights. \\\n", + " --python_exec python \\\n", + " --polyfun_path polyfun \\\n", + " --cwd output/sldsc_ldscore -j 4\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Munge summary statistics (preprocessing, run before Step 2)\n", + "\n", + "Before estimating heritability, each raw GWAS summary-statistics file must be converted into the LDSC-compatible format consumed by `get_heritability`. Run `munge_sumstats_polyfun` once per trait; the munged files are then collected in the directory passed to `get_heritability` via `--sumstat_dir`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "# sos run pipeline/sldsc_enrichment.ipynb munge_sumstats_polyfun \\\n", + "# --sumstats data/polyfun_new/example_data/trait_raw_sumstats.tsv \\\n", + "# --n 0 \\\n", + "# --min-info 0.6 \\\n", + "# --min-maf 0.001 \\\n", + "# --chi2-cutoff 30 \\\n", + "# --polyfun_path data/github/polyfun \\\n", + "# --cwd data/polyfun_new/example_data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Step 2. `get_heritability`\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "**Inputs**\n", + "\n", + "##### 1. Allele Frequency Files (`.frq`, our panel)\n", + "\n", + "- **Purpose**: PLINK `.frq` files for the reference panel, used to enforce the MAF cutoff. **Required** when `maf_cutoff > 0` (default `0.05`); the pipeline fails if missing unless `maf_cutoff = 0` is explicitly set.\n", + "\n", + "##### 2. GWAS Summary Statistics\n", + "\n", + "- **Purpose**: One munged sumstats file per trait, listed in a text file (`all_traits_file`). The pipeline runs the regression once per trait per single/joint mode.\n", + "- **Format**:\n", + " ```\n", + " CAD_META.filtered.sumstats.gz\n", + " UKB.Lym.BOLT.sumstats.gz\n", + " ```\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "kernel": "Bash" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n", + " import pkg_resources\n", + "INFO: Running \u001b[32mget_heritability\u001b[0m: \n", + "maf_cutoff = 0: skipping MAF filter (--not-M-5-50)\n", + "maf_cutoff = 0: skipping MAF filter (--not-M-5-50)\n", + "maf_cutoff = 0: skipping MAF filter (--not-M-5-50)\n", + "python: can't open file '/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/../polyfun/ldsc.py': [Errno 2] No such file or directory\n", + "python: can't open file '/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/../polyfun/ldsc.py': [Errno 2] No such file or directory\n", + "python: can't open file '/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/../polyfun/ldsc.py': [Errno 2] No such file or directory\n", + "INFO: \u001b[32mget_heritability\u001b[0m (index=1) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mget_heritability\u001b[0m (index=0) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mget_heritability\u001b[0m (index=2) is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mget_heritability\u001b[0m output: \u001b[32m/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_heritability/protocol_example_single_1/sumstats.parquet.log /restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_heritability/protocol_example_single_1/sumstats.parquet.results... (6 items in 3 groups)\u001b[0m\n", + "INFO: Workflow get_heritability (ID=wa79eac1662f5dd2d) is executed successfully with 1 completed step and 3 completed substeps.\n" + ] + } + ], + "source": [ + "sos run pipeline/sldsc_enrichment.ipynb get_heritability \\\n", + " --target_anno_dirs output/sldsc_ldscore/protocol_example_single_1 \\\n", + " --all_traits_file input/enrichment/sldsc/sumstats_test_all.txt \\\n", + " --sumstat_dir input/enrichment/sldsc \\\n", + " --baseline_ld_dir input/enrichment/sldsc \\\n", + " --weights_dir input/enrichment/sldsc \\\n", + " --plink_name reference. --baseline_name annotations. --weight_name weights. \\\n", + " --annotation_name protocol_example --python_exec python \\\n", + " --polyfun_path ../polyfun \\\n", + " --maf_cutoff 0 --cwd output/sldsc_heritability -j 4\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Step 3. `Post-processing (pecotmr) and meta-analysis`\n", + "\n", + "*Post-Processing (`pecotmr::sldsc_postprocessing_pipeline`)*\n", + "\n", + "A single R function call consumes all polyfun outputs for the run and produces the final tables:\n", + "\n", + "- Reads each polyfun output and extracts $\\tau$, $E$, $h^2_g$, EnrichStat p-value, and per-block jackknife $\\tau$ values.\n", + "- Computes annotation $sd_C$ and $M_{\\mathrm{ref}}$ over the same MAF $>$ cutoff SNP set as the regression.\n", + "- Standardizes $\\tau \\to \\tau^*$ for both single-tau and joint-tau modes, including the per-block versions for jackknife SE.\n", + "- Auto-detects whether each annotation is binary or continuous and tags every output row accordingly.\n", + "- Reports the number and names of baseline annotations encountered (via `message()`) for transparency.\n", + "- Runs the default DerSimonian-Laird random-effects meta-analysis across all supplied traits, producing three meta tables: $\\tau^*$ (cross-type comparable), $E$ (within-binary), and EnrichStat (within-type).\n", + "\n", + "Outputs are returned as an R list with two top-level entries: `per_trait` (one tidy data frame per trait, single + joint estimates side-by-side per target) and `meta` (three tables, one per quantity, with rows = target annotations and columns = single/joint mean/SE/p plus an `is_binary` flag).\n", + "\n", + "The `[postprocess]` step reads all polyfun outputs under `heritability_cwd`\n", + "(which contains the $N$ single-target subdirectories and optionally the\n", + "joint subdirectory) and calls `pecotmr::sldsc_postprocessing_pipeline()`\n", + "to produce per-trait standardized tables and the default random-effects\n", + "meta across all traits.\n", + "\n", + "Use `--target-categories-label` (same order as `--target-categories`) to give the target annotations friendly names in the output — e.g. `--target-categories ANNOT_1_0 ANNOT_2_0 --target-categories-label quantile_eQTL eQTL` makes the `target` column read `quantile_eQTL` / `eQTL` instead of `ANNOT_1_0` / `ANNOT_2_0` (the original names are kept in `params$target_categories_orig`). Omit it to keep the polyfun `.results` names.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "kernel": "Bash" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n", + " import pkg_resources\n", + "INFO: Running \u001b[32mpostprocess\u001b[0m: \n", + "INFO: \u001b[32mpostprocess\u001b[0m is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mpostprocess\u001b[0m output: \u001b[32m/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_postprocess/protocol_example.sldsc_postprocess.rds\u001b[0m\n", + "INFO: Workflow postprocess (ID=wb64dc2b84958960c) is executed successfully with 1 completed step.\n" + ] + } + ], + "source": [ + "sos run pipeline/sldsc_enrichment.ipynb postprocess \\\n", + " --traits_file input/enrichment/sldsc/sumstats_test_all.txt \\\n", + " --heritability_cwd output/sldsc_heritability \\\n", + " --target_categories ANNOT_0 --target_categories_label protocol_example_annotation \\\n", + " --target_anno_dir output/sldsc_ldscore/protocol_example_single_1 \\\n", + " --annotation_name protocol_example --python_exec python \\\n", + " --polyfun_path ../polyfun \\\n", + " --maf_cutoff 0 --cwd output/sldsc_postprocess -j 4\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Step 4. `Subset Meta-Analysis (`pecotmr::meta_sldsc_random`)` (optional)\n", + "\n", + "The default meta in Step 2 pools all traits the user supplied. To re-run the meta on a subset (e.g., neurodegenerative traits only, or autoimmune traits only) without re-running the regression layer:\n", + "\n", + "\n", + "```r\n", + "res <- readRDS(\"sldsc_results.rds\")\n", + "neuro <- c(\"AD_GWAX\", \"PD_meta\", \"ALS_meta\")\n", + "meta_neuro_taustar <- pecotmr::meta_sldsc_random(\n", + " res$per_trait[neuro], category = \"my_target_anno\", quantity = \"tau_star\"\n", + ")\n", + "```\n", + "\n", + "This step is light-weight and can be run interactively.\n", + "\n", + "\n", + "The default meta in step 3 pools all traits supplied to `[postprocess]`. Use `[meta_subset]` to re-run the meta on a user-defined trait subset (e.g., neurodegenerative traits only, autoimmune traits only) without re-running the regression or the per-trait standardization. The subset operates on the cached `.sldsc_postprocess.rds` output; it is light-weight and can be run interactively or in batch.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "kernel": "Bash" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n", + " import pkg_resources\n", + "INFO: Running \u001b[32mmeta_subset\u001b[0m: \n", + "INFO: \u001b[32mmeta_subset\u001b[0m is \u001b[32mcompleted\u001b[0m.\n", + "INFO: \u001b[32mmeta_subset\u001b[0m output: \u001b[32m/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_postprocess/protocol_example.category1.meta.rds\u001b[0m\n", + "INFO: Workflow meta_subset (ID=w09a2a0530119f1d2) is executed successfully with 1 completed step.\n" + ] + } + ], + "source": [ + "sos run pipeline/sldsc_enrichment.ipynb meta_subset \\\n", + " --postprocess_rds output/sldsc_postprocess/protocol_example.sldsc_postprocess.rds \\\n", + " --subset_traits_file input/enrichment/sldsc/sumstats_test_category1.txt \\\n", + " --subset_name category1 --target_categories ANNOT_0 \\\n", + " --annotation_name protocol_example --python_exec python \\\n", + " --polyfun_path ../polyfun \\\n", + " --maf_cutoff 0 --cwd output/sldsc_postprocess -j 4\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Output\n", + "\n", + "### Output summary\n", + "\n", + "| Stage | Cached on disk | Recomputable from | Purpose |\n", + "|---|---|---|---|\n", + "| Target LD scores | per-annotation, once | annotation + reference panel | input to every regression |\n", + "| polyfun `.results` per (trait, mode) | yes | regression run | $\\tau$, $E$, EnrichStat |\n", + "| Per-trait standardized table | yes (RDS) | polyfun outputs + $sd_C$ + $M_{\\mathrm{ref}}$ | reporting + meta |\n", + "| Default meta tables | yes (RDS) | per-trait standardized | headline figures |\n", + "| Subset meta | re-run on demand | per-trait standardized | custom analyses |\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Per-stage outputs\n", + "\n", + "Each workflow writes into its `--cwd`:\n", + "\n", + "- **make_annotation_files_ldscore** — polyfun `.annot.gz` files plus per-annotation LD-score directories (`.l2.ldscore.{gz,parquet}`, `.l2.M`, `.l2.M_5_50`). One single-target directory per annotation, plus (when more than one annotation) a joint directory.\n", + "- **get_heritability** — per trait and per target directory, the S-LDSC regression outputs `.{results,log,part_delete}`. The `.results` `Category` column carries the annotation name with a `_` suffix.\n", + "- **postprocess** — a single `.sldsc_postprocess.rds` containing per-trait tables (Gazal-style tau*, EnrichStat with back-solved jackknife SE) and three DerSimonian–Laird random-effects meta tables (tau*, E, EnrichStat).\n", + "- **meta_subset** — a re-meta of the cached `.sldsc_postprocess.rds` over a user-defined trait subset (lightweight; no regression re-run).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Anticipated Results\n", + "\n", + "Produces per-annotation enrichment statistics (tau, enrichment, p-value) from stratified LD score regression." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Command interface\n", + "\n", + "List all workflows and their options:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "Bash" + }, + "outputs": [], + "source": [ + "sos run pipeline/sldsc_enrichment.ipynb -h" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Workflow implementation\n", + "\n", + "The cells below are the pipeline definition (preserved from the original notebook): the `[global]` parameter block and the workflow step bodies.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[global]\n", + "# Path to the work directory of the analysis.\n", + "parameter: cwd = path('output')\n", + "# Prefix for the analysis output\n", + "parameter: annotation_name = str\n", + "parameter: python_exec = \"python\" # e.g. \"/home/you/.conda/envs/polyfun/bin/python\"\n", + "parameter: polyfun_path = path # e.g. \"/home/you/tools/polyfun\"\n", + "\n", + "# MAF cutoff for sLDSC. Default 0.05 per sLDSC recommendation (rare-variant LD is unstable\n", + "# and HapMap3-style regression weights are common-variant by construction).\n", + "# Set to 0 to opt out of MAF filtering (NOT recommended; only use if you understand the implications).\n", + "# Other values would require recomputing LD scores at that cutoff.\n", + "parameter: maf_cutoff = 0.05\n", + "\n", + "# for make_annotation_files_ldscore workflow:\n", + "parameter: annotation_file = path()\n", + "parameter: reference_anno_file = path()\n", + "parameter: genome_ref_file = path() # with .bed\n", + "parameter: chromosome = []\n", + "parameter: snp_list = path()\n", + "parameter: ld_wind_kb = 0 # use kb if the value is provided\n", + "parameter: ld_wind_cm = 1.0 # default using ld_wind_cm\n", + "\n", + "# for get_heritability workflow.\n", + "# Note: all LD-derived inputs (baseline LD scores, target LD scores, regression weights,\n", + "# allele frequencies) must be computed against the same reference panel as `genome_ref_file`.\n", + "# Do not mix files derived from different reference panels (e.g., 1000G vs ADSP).\n", + "parameter: all_traits_file = path() # txt file, each row contains all GWAS summary statistics name: e.g. CAD_META.filtered.sumstats.gz\n", + "parameter: sumstat_dir = path() # Directory containing GWAS summary statistics\n", + "parameter: target_anno_dir = path() # Directory containing target annotation files: output of ldscore\n", + "parameter: baseline_ld_dir = path() # Directory containing baseline LD score files (computed against our panel)\n", + "parameter: frqfile_dir = path() # Directory containing allele frequency files (.frq, our panel)\n", + "parameter: plink_name = \"ADSP_chr\"\n", + "parameter: weights_dir = path() # Directory containing LD weights (computed against our panel)\n", + "parameter: baseline_name = \"baseline_chr\" # Prefix of baseline annotation files\n", + "parameter: weight_name = \"weights_chr\" # Prefix of LD weights files\n", + "parameter: n_blocks = 200\n", + "\n", + "# Number of threads\n", + "parameter: numThreads = 16\n", + "# For cluster jobs, number commands to run per job\n", + "parameter: job_size = 1\n", + "parameter: walltime = '12h'\n", + "parameter: mem = '16G'" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "Python 3 (ipykernel)" + }, + "source": [ + "## Make Annotation File" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[make_annotation_files_ldscore]\n", + "# Annotation preparation. Takes one annotation_file with N target annotations\n", + "# and produces, in one invocation, any combination of:\n", + "# - N single-target LD-score directories (when compute_single = TRUE, default)\n", + "# - 1 joint LD-score directory containing all N (when compute_joint = TRUE\n", + "# and N >= 2, default)\n", + "#\n", + "# Outputs per chromosome :\n", + "# /_single_/_single_..annot.gz (i in 1..N, when compute_single)\n", + "# /_single_/_single_..l2.ldscore.{parquet|gz}\n", + "# /_single_/_single_..l2.M\n", + "# /_single_/_single_..l2.M_5_50 (when .frq present)\n", + "#\n", + "# /_joint/_joint..{...} (when compute_joint and N>=2)\n", + "#\n", + "# Workflows:\n", + "# - Workflow A (\"all at once\"): compute_single=TRUE, compute_joint=TRUE (defaults).\n", + "# Produces both, fits the case where you have already chosen the joint set.\n", + "# - Workflow B (\"exploratory then conditional\"):\n", + "# Step 1: compute_single=TRUE, compute_joint=FALSE.\n", + "# Run on N candidate annotations -> N single-target dirs.\n", + "# Inspect single-target results, identify K significant ones.\n", + "# Step 2: compute_single=FALSE, compute_joint=TRUE.\n", + "# Run on a NEW annotation_file with the K selected annotations\n", + "# -> 1 joint dir with the conditional model.\n", + "\n", + "#\n", + "# --- snplist (--snp_list) vs no-snplist: which polyfun script, output format,\n", + "# column name, and the CM requirement ---\n", + "# --snp_list given -> ldsc.py --l2 --print-snps -> output .l2.ldscore.gz\n", + "# --snp_list absent -> compute_ldscores.py -> output .l2.ldscore.parquet\n", + "#\n", + "# LD-score column name (this is what becomes the .results \"Category\" in\n", + "# [get_heritability], with a \"_\" suffix appended there):\n", + "# * compute_ldscores.py ALWAYS keeps the annot column name(s):\n", + "# single annot column \"ANNOT\" -> ldscore column \"ANNOT\"\n", + "# joint annot columns \"ANNOT_1\",\"ANNOT_2\",... -> \"ANNOT_1\",\"ANNOT_2\",...\n", + "# * ldsc.py --l2 has a quirk: with EXACTLY ONE annotation (n_annot == 1) it\n", + "# HARD-CODES the ldscore column name to \"L2\" and DROPS the annot's original\n", + "# column name. With >=2 annotations it uses \"L2\"\n", + "# (\"ANNOT_1L2\",\"ANNOT_2L2\",...).\n", + "# => a single-target snplist run reports \"L2_0\" in .results, while a\n", + "# single-target no-snplist run reports \"ANNOT_0\". [postprocess] auto-\n", + "# detects either; only matters if you pass --target-categories explicitly.\n", + "#\n", + "# CM column requirement for snplist: ldsc.py --l2 --print-snps requires the\n", + "# target annot to (a) carry a \"CM\" (centimorgan) column and (b) line up with\n", + "# the plink .bim (same SNP set, same row order). This step handles both\n", + "# internally (normalize_for_ldsc: takes CM from the .bim 4th column, re-expands\n", + "# the annot onto the .bim rows, filling 0). Therefore the plink .bim files MUST\n", + "# carry genetic-map (cM) positions when using --ld-wind-cm (the default);\n", + "# if your .bim has 0 in the cM column, switch to --ld-wind-kb instead.\n", + "#\n", + "parameter: compute_single = True\n", + "parameter: compute_joint = True\n", + "parameter: score_column = 3\n", + "parameter: is_range = False\n", + "\n", + "import pandas as pd\n", + "import os\n", + "\n", + "if not (compute_single or compute_joint):\n", + " raise ValueError(\"[make_annotation_files_ldscore] at least one of compute_single or compute_joint must be TRUE\")\n", + "\n", + "def adapt_file_path(file_path, reference_file):\n", + " reference_path = os.path.dirname(reference_file)\n", + " if os.path.isfile(file_path):\n", + " return file_path\n", + " file_name = os.path.basename(file_path)\n", + " if os.path.isfile(file_name):\n", + " return file_name\n", + " file_in_ref_dir = os.path.join(reference_path, file_name)\n", + " if os.path.isfile(file_in_ref_dir):\n", + " return file_in_ref_dir\n", + " file_prefixed = os.path.join(reference_path, file_path)\n", + " if os.path.isfile(file_prefixed):\n", + " return file_prefixed\n", + " raise FileNotFoundError(f\"No valid path found for file: {file_path}\")\n", + "\n", + "\n", + "# ---- Parse inputs and determine N ----\n", + "if (str(annotation_file).endswith(('rds', 'tsv', 'txt', 'tsv.gz', 'txt.gz')) and\n", + " str(reference_anno_file).endswith('annot.gz')):\n", + " # Case 1: direct file paths (single-chromosome run). Multiple target files separated by ','.\n", + " target_files_direct = str(annotation_file).split(',')\n", + " N_targets = len(target_files_direct)\n", + " target_names = [f\"target_{i+1}\" for i in range(N_targets)]\n", + " input_files = [[*target_files_direct, str(reference_anno_file), str(genome_ref_file)]]\n", + " if len(chromosome) > 0:\n", + " input_chroms = [int(x) for x in chromosome]\n", + " else:\n", + " input_chroms = [0]\n", + "else:\n", + " # Case 2: txt list with #id and one or more 'path' columns\n", + " target_files_df = pd.read_csv(annotation_file, sep=\"\\t\")\n", + " reference_files = pd.read_csv(reference_anno_file, sep=\"\\t\")\n", + " genome_ref_files = pd.read_csv(genome_ref_file, sep=\"\\t\")\n", + "\n", + " target_files_df[\"#id\"] = [x.replace(\"chr\", \"\") for x in target_files_df[\"#id\"].astype(str)]\n", + " reference_files[\"#id\"] = [x.replace(\"chr\", \"\") for x in reference_files[\"#id\"].astype(str)]\n", + " genome_ref_files[\"#id\"] = [x.replace(\"chr\", \"\") for x in genome_ref_files[\"#id\"].astype(str)]\n", + "\n", + " path_columns = [c for c in target_files_df.columns if c.startswith('path')]\n", + " N_targets = len(path_columns)\n", + " target_names = path_columns[:] # 'path', 'path1', 'path2', ...\n", + "\n", + " for col in path_columns:\n", + " target_files_df[col] = target_files_df[col].apply(lambda x: adapt_file_path(x, str(annotation_file)))\n", + " reference_files[\"path\"] = reference_files[\"path\"].apply(lambda x: adapt_file_path(x, str(reference_anno_file)))\n", + " genome_ref_files[\"path\"] = genome_ref_files[\"path\"].apply(lambda x: adapt_file_path(x, str(genome_ref_file)))\n", + "\n", + " merged = target_files_df.merge(reference_files, on=\"#id\").merge(genome_ref_files, on=\"#id\")\n", + " if len(chromosome) > 0:\n", + " merged = merged[merged[\"#id\"].isin([str(c) for c in chromosome])]\n", + "\n", + " rows = merged.values.tolist()\n", + " input_chroms = [r[0] for r in rows]\n", + " input_files = [[*r[1:N_targets+1], r[-2], r[-1]] for r in rows]\n", + "\n", + "# ---- Determine output format ----\n", + "use_print_snps = snp_list.is_file()\n", + "ldscore_ext = \"l2.ldscore.gz\" if use_print_snps else \"l2.ldscore.parquet\"\n", + "\n", + "if ld_wind_kb > 0:\n", + " use_kb_window = True\n", + " ld_window_param = ld_wind_kb\n", + " ld_window_flag = \"--ld-wind-kb\"\n", + "else:\n", + " use_kb_window = False\n", + " ld_window_param = ld_wind_cm\n", + " ld_window_flag = \"--ld-wind-cm\"\n", + "\n", + "emit_single = compute_single\n", + "emit_joint = compute_joint and N_targets >= 2\n", + "\n", + "# ---- Build per-chromosome output list ----\n", + "def chrom_outputs(chrom):\n", + " outs = []\n", + " if emit_single:\n", + " for i in range(N_targets):\n", + " name = f\"{annotation_name}_single_{i+1}\"\n", + " prefix = f\"{cwd:a}/{name}/{name}.{chrom}\"\n", + " outs += [f\"{prefix}.annot.gz\", f\"{prefix}.{ldscore_ext}\", f\"{prefix}.l2.M\"]\n", + " if emit_joint:\n", + " name = f\"{annotation_name}_joint\"\n", + " prefix = f\"{cwd:a}/{name}/{name}.{chrom}\"\n", + " outs += [f\"{prefix}.annot.gz\", f\"{prefix}.{ldscore_ext}\", f\"{prefix}.l2.M\"]\n", + " return outs\n", + "\n", + "input: input_files, group_by = N_targets + 2, group_with = \"input_chroms\"\n", + "\n", + "output: chrom_outputs(input_chroms[_index])\n", + "\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bnn}'\n", + "\n", + "# ----------------------------------------------------------------------------\n", + "# Step A: write the requested .annot files for this chromosome.\n", + "# ----------------------------------------------------------------------------\n", + "R: expand = \"${ }\", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'\n", + " library(data.table)\n", + "\n", + " clean_chr <- function(x) as.numeric(gsub(\"^chr\", \"\", x))\n", + "\n", + " process_range_data <- function(data, chr_value) {\n", + " data$chr <- clean_chr(data$chr)\n", + " data <- data[data$chr == chr_value,]\n", + " if (nrow(data) == 0) return(NULL)\n", + " expanded <- lapply(seq_len(nrow(data)), function(j) {\n", + " row <- data[j,]\n", + " pos_seq <- seq(row$start, row$end - 1)\n", + " result <- data.frame(chr = rep(row$chr, length(pos_seq)), pos = pos_seq)\n", + " if (ncol(data) > 3) {\n", + " for (col in 4:ncol(data))\n", + " result[[names(data)[col]]] <- rep(row[[col]], length(pos_seq))\n", + " }\n", + " result\n", + " })\n", + " unique(rbindlist(expanded))\n", + " }\n", + "\n", + " process_annotation <- function(target_anno, ref_anno, score_column_value) {\n", + " target_anno <- as.data.frame(target_anno)\n", + " ref_anno <- as.data.frame(ref_anno)\n", + " target_anno$chr <- clean_chr(target_anno$chr)\n", + " ref_anno$CHR <- clean_chr(ref_anno$CHR)\n", + " chr_value <- unique(ref_anno$CHR)\n", + " anno_scores <- rep(0, nrow(ref_anno))\n", + " match_pos <- match(target_anno$pos, ref_anno$BP)\n", + " valid_pos <- as.numeric(na.omit(match_pos))\n", + " if (score_column_value <= ncol(target_anno)) {\n", + " anno_scores[valid_pos] <- target_anno[[score_column_value]][!is.na(match_pos)]\n", + " } else {\n", + " anno_scores[valid_pos] <- 1\n", + " print(\"Warning: score column does not exist; setting scores to 1\")\n", + " }\n", + " anno_scores\n", + " }\n", + "\n", + " read_target_anno <- function(file_path, ref_anno) {\n", + " if (endsWith(file_path, \"rds\")) {\n", + " target_anno <- readRDS(file_path)\n", + " return(process_annotation(target_anno, ref_anno, ${score_column}))\n", + " }\n", + " target_anno <- fread(file_path)\n", + " if (${\"TRUE\" if is_range else \"FALSE\"}) {\n", + " names(target_anno)[1:3] <- c(\"chr\", \"start\", \"end\")\n", + " target_anno <- process_range_data(target_anno, unique(ref_anno$CHR))\n", + " if (is.null(target_anno)) return(rep(0, nrow(ref_anno)))\n", + " } else {\n", + " names(target_anno)[1:2] <- c(\"chr\", \"pos\")\n", + " }\n", + " process_annotation(target_anno, ref_anno, ${score_column})\n", + " }\n", + "\n", + " # ---- Read reference annotation ----\n", + " ref_anno <- as.data.frame(fread(${_input[-2]:ar}))\n", + " if (\"ANNOT\" %in% colnames(ref_anno)) ref_anno <- ref_anno[, -which(colnames(ref_anno) == \"ANNOT\")]\n", + "\n", + " # ---- Compute per-target annotation scores ----\n", + " target_files <- c(${\",\".join('\"%s\"' % str(p.absolute()) for p in _input[:-2])})\n", + " N_local <- length(target_files)\n", + " score_list <- lapply(target_files, read_target_anno, ref_anno = ref_anno)\n", + "\n", + " emit_single_local <- ${\"TRUE\" if emit_single else \"FALSE\"}\n", + " emit_joint_local <- ${\"TRUE\" if emit_joint else \"FALSE\"}\n", + " use_print_snps_local <- ${\"TRUE\" if use_print_snps else \"FALSE\"}\n", + " bfile_prefix <- \"${_input[-1]:na}\"\n", + "\n", + " # Reshape annot to match .bim panel for ldsc.py --l2 --print-snps\n", + " # (drop A1/A2/MAF, expand to .bim rows filling 0, take CM from .bim).\n", + " normalize_for_ldsc <- function(df) {\n", + " if (!use_print_snps_local) return(df)\n", + " df <- df[, !names(df) %in% c(\"A1\", \"A2\", \"MAF\", \"CM\"), drop = FALSE]\n", + " annot_cols <- setdiff(names(df), c(\"CHR\", \"BP\", \"SNP\"))\n", + " bim <- as.data.frame(fread(paste0(bfile_prefix, \".bim\"), header = FALSE,\n", + " col.names = c(\"CHR\", \"SNP\", \"CM\", \"BP\", \"A1\", \"A2\")))\n", + " bim$CHR <- as.character(bim$CHR); df$CHR <- as.character(df$CHR)\n", + " idx <- match(bim$SNP, df$SNP)\n", + " out <- data.frame(CHR = bim$CHR, BP = bim$BP, SNP = bim$SNP, CM = bim$CM,\n", + " stringsAsFactors = FALSE)\n", + " for (col in annot_cols) {\n", + " v <- rep(0, nrow(bim))\n", + " non_na <- !is.na(idx)\n", + " v[non_na] <- df[[col]][idx[non_na]]\n", + " out[[col]] <- v\n", + " }\n", + " out\n", + " }\n", + "\n", + " # ---- Write N single-target .annot files (when requested) ----\n", + " if (emit_single_local) {\n", + " for (i in seq_len(N_local)) {\n", + " out_anno <- ref_anno\n", + " out_anno$ANNOT <- score_list[[i]]\n", + " out_anno <- normalize_for_ldsc(out_anno)\n", + " name <- paste0(\"${annotation_name}\", \"_single_\", i)\n", + " out_path_gz <- file.path(\"${cwd:a}\", name, paste0(name, \".${input_chroms[_index]}.annot.gz\"))\n", + " out_path_tsv <- sub(\"\\\\.gz$\", \"\", out_path_gz)\n", + " dir.create(dirname(out_path_gz), showWarnings = FALSE, recursive = TRUE)\n", + " fwrite(out_anno, out_path_tsv, quote = FALSE, col.names = TRUE, row.names = FALSE, sep = \"\\t\")\n", + " }\n", + " }\n", + "\n", + " # ---- Optionally write joint .annot ----\n", + " if (emit_joint_local) {\n", + " joint_anno <- ref_anno\n", + " for (i in seq_len(N_local)) {\n", + " joint_anno[[paste0(\"ANNOT_\", i)]] <- score_list[[i]]\n", + " }\n", + " joint_anno <- normalize_for_ldsc(joint_anno)\n", + " joint_name <- paste0(\"${annotation_name}\", \"_joint\")\n", + " joint_out_gz <- file.path(\"${cwd:a}\", joint_name, paste0(joint_name, \".${input_chroms[_index]}.annot.gz\"))\n", + " joint_out_tsv <- sub(\"\\\\.gz$\", \"\", joint_out_gz)\n", + " dir.create(dirname(joint_out_gz), showWarnings = FALSE, recursive = TRUE)\n", + " fwrite(joint_anno, joint_out_tsv, quote = FALSE, col.names = TRUE, row.names = FALSE, sep = \"\\t\")\n", + " }\n", + "\n", + "# ----------------------------------------------------------------------------\n", + "# Step B: gzip all annot files. Uses expand=\"$[ ]\" so bash ${var} survives.\n", + "# ----------------------------------------------------------------------------\n", + "bash: expand = \"$[ ]\", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'\n", + " set -e\n", + " annots=()\n", + " if [ \"$[str(emit_single)]\" = \"True\" ]; then\n", + " for i in $(seq 1 $[N_targets]); do\n", + " annots+=(\"$[cwd:a]/$[annotation_name]_single_$i/$[annotation_name]_single_$i.$[input_chroms[_index]].annot\")\n", + " done\n", + " fi\n", + " if [ \"$[str(emit_joint)]\" = \"True\" ]; then\n", + " annots+=(\"$[cwd:a]/$[annotation_name]_joint/$[annotation_name]_joint.$[input_chroms[_index]].annot\")\n", + " fi\n", + " for a in \"${annots[@]}\"; do\n", + " gzip -f \"$a\"\n", + " done\n", + "\n", + "# ----------------------------------------------------------------------------\n", + "# Step C: run polyfun's LD-score computation for each emitted annotation file.\n", + "# ----------------------------------------------------------------------------\n", + "bash: expand = \"$[ ]\", stderr = f'{_output[1]}.stderr', stdout = f'{_output[1]}.stdout'\n", + " set -e\n", + " chrom=\"$[input_chroms[_index]]\"\n", + "\n", + " run_polyfun() {\n", + " local annot=\"$1\"\n", + " local out_prefix=\"$2\"\n", + " if [ \"$[str(use_print_snps)]\" = \"True\" ]; then\n", + " $[python_exec] $[polyfun_path]/ldsc.py \\\n", + " --print-snps $[snp_list] \\\n", + " $[ld_window_flag] $[ld_window_param] \\\n", + " --out \"$out_prefix\" \\\n", + " --bfile $[_input[-1]:nar] \\\n", + " --yes-really \\\n", + " --annot \"$annot\" \\\n", + " --l2\n", + " else\n", + " $[python_exec] $[polyfun_path]/compute_ldscores.py \\\n", + " --annot \"$annot\" \\\n", + " --bfile $[_input[-1]:nar] \\\n", + " $[ld_window_flag] $[ld_window_param] \\\n", + " --out \"${out_prefix}.$[ldscore_ext]\" \\\n", + " --allow-missing\n", + " fi\n", + " }\n", + "\n", + " if [ \"$[str(emit_single)]\" = \"True\" ]; then\n", + " for i in $(seq 1 $[N_targets]); do\n", + " name=\"$[annotation_name]_single_$i\"\n", + " annot=\"$[cwd:a]/$name/$name.$chrom.annot.gz\"\n", + " prefix=\"$[cwd:a]/$name/$name.$chrom\"\n", + " run_polyfun \"$annot\" \"$prefix\"\n", + " done\n", + " fi\n", + " if [ \"$[str(emit_joint)]\" = \"True\" ]; then\n", + " name=\"$[annotation_name]_joint\"\n", + " annot=\"$[cwd:a]/$name/$name.$chrom.annot.gz\"\n", + " prefix=\"$[cwd:a]/$name/$name.$chrom\"\n", + " run_polyfun \"$annot\" \"$prefix\"\n", + " fi\n", + "\n", + "# ----------------------------------------------------------------------------\n", + "# Step D: write .l2.M and .l2.M_5_50 files for each emitted annotation directory.\n", + "# ----------------------------------------------------------------------------\n", + "R: expand = \"${ }\", stderr = f'{_output[2]}.stderr', stdout = f'{_output[2]}.stdout'\n", + " suppressPackageStartupMessages({ library(data.table); library(dplyr) })\n", + " use_print_snps <- ${str(use_print_snps).upper()}\n", + "\n", + " chrom <- \"${input_chroms[_index]}\"\n", + " # Look up .frq file under frqfile_dir, using plink_name + chrom (matches cell 25).\n", + " frq_file <- file.path(\"${frqfile_dir}\", paste0(\"${plink_name}\", chrom, \".frq\"))\n", + " has_frq <- file.exists(frq_file)\n", + " frq_dt <- if (has_frq) fread(frq_file)[, .(SNP, MAF)] else NULL\n", + "\n", + " write_M_files <- function(annot_path, ldscore_path, m_path) {\n", + " if (use_print_snps && file.exists(m_path) && file.exists(paste0(m_path, \"_5_50\"))) {\n", + " cat(\"M files already exist for\", m_path, \"\\n\"); return(invisible())\n", + " }\n", + " ldscore_dt <- if (endsWith(ldscore_path, \".parquet\")) {\n", + " suppressPackageStartupMessages(library(arrow)); arrow::read_parquet(ldscore_path)\n", + " } else fread(ldscore_path)\n", + " annot_dt <- fread(annot_path)\n", + " annot_filtered <- annot_dt[annot_dt$SNP %in% ldscore_dt$SNP, ]\n", + " merged <- if (has_frq) merge(annot_filtered, frq_dt, by = \"SNP\", all.x = TRUE) else annot_filtered\n", + " std_cols <- c(\"CHR\", \"SNP\", \"BP\", \"CM\", \"A1\", \"A2\", if (has_frq) \"MAF\")\n", + " annot_cols <- setdiff(names(merged), std_cols)\n", + " if (length(annot_cols) == 0L) { merged[, ANNOT := 1L]; annot_cols <- \"ANNOT\" }\n", + " M <- merged[, lapply(.SD, sum, na.rm = TRUE), .SDcols = annot_cols]\n", + " writeLines(paste(as.numeric(M), collapse = \" \"), m_path)\n", + " if (has_frq) {\n", + " common <- merged[!is.na(MAF) & MAF > 0.05, ]\n", + " M5 <- common[, lapply(.SD, sum, na.rm = TRUE), .SDcols = annot_cols]\n", + " writeLines(paste(as.numeric(M5), collapse = \" \"), paste0(m_path, \"_5_50\"))\n", + " }\n", + " }\n", + "\n", + " targets <- c()\n", + " if (${\"TRUE\" if emit_single else \"FALSE\"}) {\n", + " for (i in seq_len(${N_targets})) {\n", + " targets <- c(targets, paste0(\"${annotation_name}\", \"_single_\", i))\n", + " }\n", + " }\n", + " if (${\"TRUE\" if emit_joint else \"FALSE\"}) {\n", + " targets <- c(targets, paste0(\"${annotation_name}\", \"_joint\"))\n", + " }\n", + " for (name in targets) {\n", + " annot_path <- file.path(\"${cwd:a}\", name, paste0(name, \".\", chrom, \".annot.gz\"))\n", + " ldscore_path <- file.path(\"${cwd:a}\", name, paste0(name, \".\", chrom, \".${ldscore_ext}\"))\n", + " m_path <- file.path(\"${cwd:a}\", name, paste0(name, \".\", chrom, \".l2.M\"))\n", + " write_M_files(annot_path, ldscore_path, m_path)\n", + " }\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "Python 3 (ipykernel)" + }, + "source": [ + "## Calculate Functional Enrichment using Annotations" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[get_heritability]\n", + "# Per-trait sLDSC regression via polyfun. Fans out across target_anno_dirs:\n", + "# each (trait, target_dir) pair becomes one polyfun invocation. Outputs go to\n", + "# //.{results,log,part_delete}.\n", + "#\n", + "# `target_anno_dirs` is the list produced by [make_annotation_files_ldscore]:\n", + "# typically N _single_ directories plus optionally one _joint directory.\n", + "\n", + "#\n", + "# --- about the \".results\" Category column and the \"_0 / _1\" suffix ---\n", + "# Each (trait, target_dir) pair is ONE polyfun call; its `ldsc.py --ref-ld-chr`\n", + "# always gets exactly two LD-score sources, in this order:\n", + "# \"/.\" (index 0) , \"/\" (index 1)\n", + "# With --overlap-annot, every annotation column in the .results \"Category\" is\n", + "# named _:\n", + "# index 0 = the target file -> \"ANNOT_0\" (no-snplist; compute_ldscores.py keeps the annot col name)\n", + "# -> \"L2_0\" (snplist + single annot; ldsc.py hard-codes \"L2\", see below)\n", + "# -> \"ANNOT_1_0\",\"ANNOT_2_0\" (no-snplist joint dir, N>=2 annot cols)\n", + "# -> \"ANNOT_1L2_0\",\"ANNOT_2L2_0\" (snplist joint dir, N>=2 -> \"L2\")\n", + "# index 1 = the baseline file -> \"base_1\",\"Coding_UCSC_1\", ... (the 97 baseline annots)\n", + "# So in this pipeline the suffix is only ever 0 (target) or 1 (baseline); it would\n", + "# continue 0,1,2,... only if you handed `ldsc.py --ref-ld-chr` more than two sources.\n", + "# (Why ANNOT_0 vs L2_0: see the [make_annotation_files_ldscore] header — ldsc.py's\n", + "# \"n_annot == 1 -> column name 'L2'\" quirk vs compute_ldscores.py keeping the annot\n", + "# column name.) [postprocess] auto-detects the target Category; if you instead pass\n", + "# --target-categories, the names must match this column exactly.\n", + "#\n", + "parameter: target_anno_dirs = paths()\n", + "parameter: all_traits = []\n", + "\n", + "import os\n", + "\n", + "with open(all_traits_file, 'r') as f:\n", + " trait_paths = [os.path.join(sumstat_dir, line.strip()) for line in f if line.strip()]\n", + "\n", + "# Build (trait, target_dir) Cartesian product as parallel flat lists.\n", + "input_list = []\n", + "target_meta = []\n", + "for td in target_anno_dirs:\n", + " for t in trait_paths:\n", + " input_list.append(t)\n", + " target_meta.append(str(td))\n", + "\n", + "input: input_list, group_by = 1, group_with = \"target_meta\"\n", + "\n", + "output: f\"{cwd:a}/{os.path.basename(target_meta[_index])}/{os.path.basename(_input[0])}.log\", \\\n", + " f\"{cwd:a}/{os.path.basename(target_meta[_index])}/{os.path.basename(_input[0])}.results\"\n", + "\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'\n", + "\n", + "bash: expand = \"${ }\"\n", + " target_dir=\"${target_meta[_index]}\"\n", + " target_name=\"$(basename ${target_meta[_index]})\"\n", + " trait=\"$(basename ${_input[0]})\"\n", + " output_dir=\"${cwd:a}/$target_name\"\n", + " mkdir -p \"$output_dir\"\n", + "\n", + " # MAF cutoff handling. Only 0 (disabled) or 0.05 (sLDSC default) are supported;\n", + " # other values would require recomputing LD scores at that cutoff.\n", + " frq_file_check=\"${frqfile_dir}/${plink_name}22.frq\"\n", + " if [ \"${maf_cutoff}\" = \"0\" ] || [ \"${maf_cutoff}\" = \"0.0\" ]; then\n", + " echo \"maf_cutoff = 0: skipping MAF filter (--not-M-5-50)\"\n", + " frq_option=\"--not-M-5-50\"\n", + " elif [ \"${maf_cutoff}\" = \"0.05\" ]; then\n", + " if [ -f \"$frq_file_check\" ]; then\n", + " echo \"maf_cutoff = 0.05: using --frqfile-chr (MAF > 5%)\"\n", + " frq_option=\"--frqfile-chr ${frqfile_dir}/${plink_name}\"\n", + " else\n", + " echo \"ERROR: maf_cutoff=0.05 requires .frq files for the reference panel,\"\n", + " echo \" but none found at ${frqfile_dir}/${plink_name}*.frq.\"\n", + " echo \" Provide .frq files in frqfile_dir, or set maf_cutoff=0 (NOT recommended).\"\n", + " exit 1\n", + " fi\n", + " else\n", + " echo \"ERROR: maf_cutoff=${maf_cutoff} is not supported. Only 0 (no filter) or\"\n", + " echo \" 0.05 (sLDSC default) are accepted. Other values would require\"\n", + " echo \" recomputing LD scores at that cutoff.\"\n", + " exit 1\n", + " fi\n", + "\n", + " run_ldsc() {\n", + " local extra_args=\"$1\"\n", + " ${python_exec} ${polyfun_path}/ldsc.py \\\n", + " --h2 ${sumstat_dir}/$trait \\\n", + " --ref-ld-chr \"$target_dir/$target_name.\",\"${baseline_ld_dir}/${baseline_name}\" \\\n", + " --out \"$output_dir/$trait\" \\\n", + " --overlap-annot \\\n", + " --w-ld-chr ${weights_dir}/${weight_name} \\\n", + " $frq_option \\\n", + " --print-coefficients \\\n", + " --print-delete-vals \\\n", + " --n-blocks ${n_blocks} \\\n", + " $extra_args\n", + " }\n", + "\n", + " run_ldsc \"\"\n", + " log_file=\"$output_dir/$trait.log\"\n", + "\n", + " # FloatingPointError retry ladder (preserved from original): 30 -> 20 -> 10\n", + " for max in 30 20 10; do\n", + " if [ -f \"$log_file\" ] && grep -q \"FloatingPointError\\|invalid value encountered in sqrt\" \"$log_file\"; then\n", + " echo \"FloatingPointError detected, retrying with --chisq-max $max...\"\n", + " run_ldsc \"--chisq-max $max\"\n", + " else\n", + " break\n", + " fi\n", + " done\n", + "\n", + " if [ -f \"$log_file\" ] && grep -q \"FloatingPointError\\|invalid value encountered in sqrt\" \"$log_file\"; then\n", + " echo \"ERROR: FloatingPointError persists for trait $trait at target $target_name even with --chisq-max 10\"\n", + " echo \"This trait may have severe numerical instability issues in the summary statistics.\"\n", + " fi\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[munge_sumstats_polyfun]\n", + "parameter: sumstats = path\n", + "parameter: n = 0\n", + "parameter: min_info = 0.6\n", + "parameter: min_maf = 0.001\n", + "parameter: keep_hla = False\n", + "parameter: chi2_cut = 30\n", + "input: sumstats\n", + "output: f\"{_input:n}.munged.parquet\"\n", + "bash: expand=True, stderr=f'{_output:nn}.stderr', stdout=f'{_output:nn}.stdout'\n", + " {python_exec} {polyfun_path}/munge_polyfun_sumstats.py \\\n", + " --sumstats {_input} \\\n", + " --out {_output} \\\n", + " {'--n {}'.format(n) if n>0 else ''} \\\n", + " {'--min-info {}'.format(min_info)} \\\n", + " {'--min-maf {}'.format(min_maf)} \\\n", + " {'--chi2-cutoff {}'.format(chi2_cut)} \\\n", + " {'--keep-hla' if keep_hla else ''} \\\n", + " --remove-strand-ambig" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[postprocess]\n", + "# Post-processing of polyfun outputs via pecotmr::sldsc_postprocessing_pipeline.\n", + "# Reads .results / .log / .part_delete for all traits in `traits_file`, both\n", + "# single-target and (when present) joint-target runs, computes Gazal-style\n", + "# tau*, EnrichStat with back-solved jackknife SE, and runs the default\n", + "# DerSimonian-Laird random-effects meta across all supplied traits. Writes\n", + "# one RDS containing per-trait tables and three meta tables (tau*, E, EnrichStat).\n", + "\n", + "parameter: traits_file = path() # text file: one trait sumstats filename per line\n", + "parameter: heritability_cwd = path() # parent directory of [get_heritability] outputs (contains _single_/ subdirs and optionally _joint/)\n", + "parameter: target_categories = [] # target annotation names. Auto-detected from the joint-run results if empty.\n", + "parameter: target_categories_label = [] # optional display names, same order as target_categories;\n", + " # when given, every \"target\" column / tau*-block colname in\n", + " # the output RDS is renamed to these (params$target_categories\n", + " # holds the labels, params$target_categories_orig the originals).\n", + "parameter: target_anno_dir = path() # directory of target .annot.gz files used for sd_C and binary detection (typically the joint dir, since it carries all target columns)\n", + "\n", + "input: traits_file\n", + "output: f\"{cwd:a}/{annotation_name}.sldsc_postprocess.rds\"\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n", + "\n", + "R: expand = \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'\n", + " library(pecotmr)\n", + "\n", + " traits <- readLines(\"${traits_file}\")\n", + " target_cats <- c(${\",\".join('\"%s\"' % c for c in target_categories)})\n", + " target_lab <- c(${\",\".join('\"%s\"' % c for c in target_categories_label)})\n", + "\n", + " # Auto-detect single-target and joint-target output directories.\n", + " her_root <- \"${heritability_cwd}\"\n", + " all_subdirs <- list.dirs(her_root, recursive = FALSE)\n", + " single_pattern <- paste0(\"^\", \"${annotation_name}\", \"_single_([0-9]+)$\")\n", + " joint_name <- paste0(\"${annotation_name}\", \"_joint\")\n", + " single_dirs <- all_subdirs[grepl(single_pattern, basename(all_subdirs))]\n", + " single_indices <- as.integer(sub(single_pattern, \"\\\\1\", basename(single_dirs)))\n", + " single_dirs <- single_dirs[order(single_indices)]\n", + " joint_dir <- file.path(her_root, joint_name)\n", + " has_joint <- dir.exists(joint_dir)\n", + "\n", + " message(sprintf(\"Detected %d single-target dirs%s\",\n", + " length(single_dirs),\n", + " if (has_joint) \"; joint-target dir present\" else \"; no joint-target dir\"))\n", + "\n", + " # Build per-trait prefix maps. Each trait's polyfun output is at /\n", + " # (polyfun appends .results / .log / .part_delete).\n", + " trait_single_prefixes <- lapply(traits, function(t) file.path(single_dirs, t))\n", + " names(trait_single_prefixes) <- traits\n", + "\n", + " if (has_joint) {\n", + " trait_joint_prefix <- setNames(file.path(joint_dir, traits), traits)\n", + " } else {\n", + " trait_joint_prefix <- setNames(rep(NA_character_, length(traits)), traits)\n", + " }\n", + "\n", + " res <- sldsc_postprocessing_pipeline(\n", + " trait_single_prefixes = trait_single_prefixes,\n", + " trait_joint_prefix = trait_joint_prefix,\n", + " target_anno_dir = \"${target_anno_dir}\",\n", + " frqfile_dir = \"${frqfile_dir}\",\n", + " plink_name = \"${plink_name}\",\n", + " maf_cutoff = ${maf_cutoff},\n", + " target_categories = if (length(target_cats) > 0) target_cats else NULL,\n", + " target_labels = if (length(target_lab) > 0) target_lab else NULL\n", + " )\n", + "\n", + " saveRDS(res, \"${_output[0]}\")\n", + " message(\"S-LDSC post-processing complete; results written to ${_output[0]}\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[meta_subset]\n", + "# Optional: re-run random-effects meta on a user-defined subset of traits, using\n", + "# the cached per-trait standardized results from [postprocess]. No regression rerun.\n", + "\n", + "parameter: postprocess_rds = path() # output of [postprocess]\n", + "parameter: subset_traits_file = path() # text file: one trait id per line, subset of those passed to [postprocess]\n", + "parameter: subset_name = str # label used in the output filename\n", + "parameter: target_categories = [] # target annotation names to meta on; if empty, uses all from postprocess output\n", + "# If [postprocess] was run with --target-categories-label, the cached RDS already\n", + "# carries the display names (params$target_categories = the labels), so leave\n", + "# --target-categories empty here (or pass the labels, not the original ANNOT_* names).\n", + "\n", + "input: postprocess_rds, subset_traits_file\n", + "output: f\"{cwd:a}/{annotation_name}.{subset_name}.meta.rds\"\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n", + "\n", + "R: expand = \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'\n", + " library(pecotmr)\n", + "\n", + " res <- readRDS(\"${postprocess_rds}\")\n", + " subset_traits <- readLines(\"${subset_traits_file}\")\n", + " target_cats <- c(${\",\".join([f'\"{c}\"' for c in target_categories])})\n", + " if (length(target_cats) == 0) target_cats <- res$params$target_categories\n", + "\n", + " subset_per_trait <- res$per_trait[subset_traits]\n", + "\n", + " # Map wide names (tau_star_single/joint) to bare names meta_sldsc_random expects.\n", + " view_single <- pecotmr:::.sldsc_view_for_meta(subset_per_trait, \"single\")\n", + " view_joint <- pecotmr:::.sldsc_view_for_meta(subset_per_trait, \"joint\")\n", + "\n", + " out <- list(\n", + " tau_star_single = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_single, c, \"tau_star\")), target_cats),\n", + " tau_star_joint = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_joint, c, \"tau_star\")), target_cats),\n", + " enrichment = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_single, c, \"enrichment\")), target_cats),\n", + " enrichstat = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_single, c, \"enrichstat\")), target_cats)\n", + " )\n", + "\n", + " saveRDS(out, \"${_output[0]}\")\n", + " message(\"Subset meta complete; results written to ${_output[0]}\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "SoS", + "language": "sos", + "name": "sos" + }, + "language_info": { + "codemirror_mode": "sos", + "file_extension": ".sos", + "mimetype": "text/x-sos", + "name": "sos", + "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter", + "pygments_lexer": "sos" + }, + "sos": { + "kernels": [ + [ + "Bash", + "calysto_bash", + "Bash", + "#E6EEFF", + "shell" + ], + [ + "R", + "ir", + "R", + "#DCDCDA", + "r" + ], + [ + "SoS", + "sos", + "", + "", + "sos" + ] + ], + "panel": { + "displayed": true, + "height": 0 + }, + "version": "0.22.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}