From 305d38cf578bb2adb79d5b6482b772c1aaaf19e7 Mon Sep 17 00:00:00 2001
From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com>
Date: Tue, 23 Jun 2026 11:59:38 -0400
Subject: [PATCH 1/6] Delete code/SoS/reference_data/rss_ld_sketch.ipynb

---
 code/SoS/reference_data/rss_ld_sketch.ipynb | 738 --------------------
 1 file changed, 738 deletions(-)
 delete mode 100644 code/SoS/reference_data/rss_ld_sketch.ipynb

diff --git a/code/SoS/reference_data/rss_ld_sketch.ipynb b/code/SoS/reference_data/rss_ld_sketch.ipynb
deleted file mode 100644
index c1ac6c9a..00000000
--- a/code/SoS/reference_data/rss_ld_sketch.ipynb
+++ /dev/null
@@ -1,738 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "8bdb623a",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "# RSS LD Sketch Pipeline"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "4b8d670a",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Description\n",
-    "\n",
-    "This pipeline generates a stochastic genotype sample **U = W\u1d40G** from whole-genome sequencing VCF files and stores it as a PLINK2 pgen file for use as an LD reference panel with SuSiE-RSS fine-mapping.\n",
-    "\n",
-    "**Key idea:** Rather than storing the full genotype matrix G (n \u00d7 p), we compute U = W\u1d40G (B \u00d7 p) using a random projection matrix $W \\sim N(0, 1/\\sqrt{n})$. The approximate LD matrix $R = U^T U / B \\approx G^T G / n$ by the Johnson\u2013Lindenstrauss lemma. G is never stored.\n",
-    "\n",
-    "**Matrix dimensions:**\n",
-    "- G : (n \u00d7 p) \u2014 n individuals \u00d7 p variants\n",
-    "- W : (n \u00d7 B) \u2014 projection matrix, generated once per cohort\n",
-    "- U : (B \u00d7 p) \u2014 stochastic genotype sample = W\u1d40G, stored in pgen\n",
-    "- $\\hat{R}$ : (p \u00d7 p) \u2014 approximate LD matrix, computed on-the-fly by SuSiE-RSS from U\n",
-    "\n",
-    "The workflow has three steps run in order: `generate_W` (build the projection matrix), `process_block` (read VCF per LD block and write per-block dosage sketches), and `merge_chrom` (merge per-block dosages into one per-chromosome pgen).\n",
-    "\n",
-    "**Note on data:** This example runs on a clearly-labeled synthetic toy dataset \u2014 a chr22 VCF (`protocol_example.genotype.chr22.bgz`, 60 individuals) and a 3-block LD-block BED (`protocol_example.ld_blocks.bed`). No access-controlled individual-level human genomic data is used."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "ffdf8d92",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Input\n",
-    "\n",
-    "- **LD-block BED** (`--ld-block-file`): tab-separated file with columns `chr`, `start`, `end` (0-based half-open) defining the regions to sketch. Toy file: `input/rss_ld_sketch/protocol_example.ld_blocks.bed` (3 chr22 blocks).\n",
-    "- **VCF directory** (`--vcf-base`) and **prefix** (`--vcf-prefix`): bgzipped (`.bgz`) + tabix-indexed VCF(s) named `{vcf_prefix}{chr}.*.bgz`. Toy file: `input/rss_ld_sketch/protocol_example.genotype.chr22.bgz` (60 individuals), discovered with `--vcf-base input/rss_ld_sketch --vcf-prefix protocol_example.genotype.`\n",
-    "- **`--n-samples`**: number of individuals in the VCF (here 60). Must match the VCF sample count.\n",
-    "- **`--B`**: number of sketch (pseudo-)samples / projection dimension (the toy uses a small B for speed; production uses ~10000).\n",
-    "- **`--chrom`**: chromosome to process (e.g. 22; 0 = all autosomes found).\n",
-    "- **`--cohort-id`**: label used to name output files.\n",
-    "\n",
-    "Filter thresholds (defaults shown in the implementation): `--maf-min 0.0005`, `--mac-min 5`, `--msng-min 0.05`."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "dce33bc7",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Steps\n",
-    "\n",
-    "Run the three workflows in order. `generate_W` builds the shared projection matrix once; `process_block` sketches each LD block; `merge_chrom` assembles the per-chromosome pgen."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "4633fe31",
-   "metadata": {},
-   "source": [
-    "**Timing:** ~10-20 min (chr22) on typical compute infrastructure."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "6ffb31a4-7ad4-479d-8955-ba598a16ef07",
-   "metadata": {},
-   "source": [
-    "### Step 1. Generate the projection matrix W (run once per cohort; `--n-samples` must equal the VCF sample count)."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "03612385",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "sos run pipeline/rss_ld_sketch.ipynb generate_W \\\n",
-    "    --n-samples 60 \\\n",
-    "    --output-dir output/rss_ld_sketch \\\n",
-    "    --B 50 \\\n",
-    "    --seed 123 \\\n",
-    "    --cwd output/rss_ld_sketch"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "b62d37b8-da5d-4d5f-a3b7-7633ff5ff70f",
-   "metadata": {},
-   "source": [
-    "### Step 2. Process all LD blocks for the chromosome \u2014 read the VCF, filter variants, and write per-block dosage sketches U = W\u1d40G.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "d2eccffd",
-   "metadata": {
-    "kernel": "Bash"
-   },
-   "outputs": [],
-   "source": [
-    "sos run pipeline/rss_ld_sketch.ipynb process_block \\\n",
-    "    --ld-block-file input/rss_ld_sketch/protocol_example.ld_blocks.bed \\\n",
-    "    --chrom 22 \\\n",
-    "    --vcf-base input/rss_ld_sketch \\\n",
-    "    --vcf-prefix protocol_example.genotype. \\\n",
-    "    --output-dir output/rss_ld_sketch \\\n",
-    "    --W-matrix output/rss_ld_sketch/W_B50.npy \\\n",
-    "    --B 50 \\\n",
-    "    --cohort-id protocol_example. \\\n",
-    "    --cwd output/rss_ld_sketch"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "452c348f-487f-44e4-96c1-75fe118cbc9a",
-   "metadata": {},
-   "source": [
-    "### Step 3. Merge the per-block dosage sketches into one per-chromosome PLINK2 pgen."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "81f28809",
-   "metadata": {
-    "kernel": "Bash"
-   },
-   "outputs": [],
-   "source": [
-    "sos run pipeline/rss_ld_sketch.ipynb merge_chrom \\\n",
-    "    --output-dir output/rss_ld_sketch \\\n",
-    "    --cohort-id protocol_example. \\\n",
-    "    --chrom 22 \\\n",
-    "    --cwd output/rss_ld_sketch"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "32c022be",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Command interface\n",
-    "\n",
-    "List every workflow and its parameters:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f3569a70",
-   "metadata": {
-    "kernel": "Bash"
-   },
-   "outputs": [],
-   "source": [
-    "sos run pipeline/rss_ld_sketch.ipynb -h"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "ac50d174",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Workflow implementation"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "a7886e46",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": [
-    "[global]\n",
-    "parameter: cwd        = path(\"output\")\n",
-    "parameter: job_size   = 1\n",
-    "parameter: walltime   = \"24:00:00\"\n",
-    "parameter: mem        = \"32G\"\n",
-    "parameter: numThreads = 8\n",
-    "\n",
-    "cwd = path(f'{cwd:a}')\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "c321bef5",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": [
-    "[generate_W]\n",
-    "# Generate projection matrix $W \\sim N(0, 1/\\sqrt{n})$, shape (n x B).\n",
-    "# Run ONCE before processing any chromosome.\n",
-    "#\n",
-    "# W depends only on n (total sample size) and B -- not on any variant data.\n",
-    "# n_samples is passed directly as a parameter; no VCF reading is needed.\n",
-    "# All 22 chromosomes reuse the same W so that per-chromosome stochastic\n",
-    "# genotype samples can be arithmetically merged for meta-analysis.\n",
-    "parameter: n_samples = int\n",
-    "parameter: output_dir    = str\n",
-    "parameter: B         = 10000\n",
-    "parameter: seed      = 123\n",
-    "\n",
-    "import os\n",
-    "input:  []\n",
-    "output: f'{output_dir}/W_B{B}.npy'\n",
-    "task: trunk_workers = 1, trunk_size = 1, walltime = '00:05:00', mem = '4G', cores = 1\n",
-    "python: expand = \"${ }\", stdout = f'{_output:n}.stdout', stderr = f'{_output:n}.stderr'\n",
-    "\n",
-    "    import numpy as np\n",
-    "    import os\n",
-    "\n",
-    "    n      = ${n_samples}\n",
-    "    B      = ${B}\n",
-    "    seed   = ${seed}\n",
-    "    W_out  = \"${_output}\"\n",
-    "\n",
-    "    # -- Generate $W \\sim N(0, 1/\\sqrt{n})$ -----------------------------\n",
-    "    # Convention: W = np.random.normal(0, 1/np.sqrt(n), size=(n, B))\n",
-    "    # W is shared across all chromosomes -- do not regenerate per chromosome.\n",
-    "    print(f\"Generating W ~ N(0, 1/sqrt({n})),  shape ({n}, {B}),  seed={seed}\")\n",
-    "    np.random.seed(seed)\n",
-    "    W = np.random.normal(0, 1.0 / np.sqrt(n), size=(n, B)).astype(np.float32)\n",
-    "\n",
-    "    os.makedirs(os.path.dirname(os.path.abspath(W_out)), exist_ok=True)\n",
-    "    np.save(W_out, W)\n",
-    "    print(f\"Saved: {W_out}\")\n",
-    "    print(f\"Shape: {W.shape},  size: {os.path.getsize(W_out)/1e9:.2f} GB\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "68a93ed9",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": [
-    "[process_block]\n",
-    "parameter: ld_block_file = str\n",
-    "parameter: chrom         = 0\n",
-    "parameter: vcf_base      = str\n",
-    "parameter: vcf_prefix    = str\n",
-    "parameter: cohort_id     = \"ADSP.R5.EUR\"\n",
-    "parameter: output_dir    = str\n",
-    "parameter: W_matrix      = str\n",
-    "parameter: B             = 10000\n",
-    "parameter: maf_min       = 0.0005\n",
-    "parameter: mac_min       = 5\n",
-    "parameter: msng_min      = 0.05\n",
-    "parameter: sample_list   = \"\"\n",
-    "\n",
-    "import os\n",
-    "\n",
-    "def _read_blocks(bed, chrom_filter):\n",
-    "    blocks = []\n",
-    "    with open(bed) as fh:\n",
-    "        for line in fh:\n",
-    "            if line.startswith(\"#\") or not line.strip():\n",
-    "                continue\n",
-    "            parts = line.split()\n",
-    "            c = parts[0]\n",
-    "            if not (c.startswith(\"chr\") and c[3:].isdigit()):\n",
-    "                continue\n",
-    "            cnum = int(c[3:])\n",
-    "            if not (1 <= cnum <= 22):\n",
-    "                continue\n",
-    "            if chrom_filter != 0 and cnum != chrom_filter:\n",
-    "                continue\n",
-    "            blocks.append({\"chr\": c, \"start\": int(parts[1]), \"end\": int(parts[2])})\n",
-    "    if not blocks:\n",
-    "        raise ValueError(f\"No blocks found for chrom={chrom_filter} in {bed}\")\n",
-    "    return blocks\n",
-    "\n",
-    "blocks = _read_blocks(ld_block_file, chrom)\n",
-    "print(f\"  {len(blocks)} LD blocks queued\")\n",
-    "\n",
-    "input: for_each = \"blocks\"\n",
-    "output: f'{output_dir}/{_blocks[\"chr\"]}/{_blocks[\"chr\"]}_{_blocks[\"start\"]}_{_blocks[\"end\"]}/{cohort_id}.{_blocks[\"chr\"]}_{_blocks[\"start\"]}_{_blocks[\"end\"]}.dosage.gz'\n",
-    "task: trunk_workers = 1, trunk_size = 1, walltime = walltime, mem = mem, cores = numThreads\n",
-    "python: expand = \"${ }\"\n",
-    "\n",
-    "    import numpy as np\n",
-    "    import os\n",
-    "    import gzip\n",
-    "    import sys\n",
-    "    import atexit\n",
-    "    from math import nan\n",
-    "    from cyvcf2 import VCF\n",
-    "    from os import listdir\n",
-    "\n",
-    "    # Block coordinates from for_each loop\n",
-    "    chrm_str    = \"${_blocks['chr']}\"\n",
-    "    block_start = ${_blocks[\"start\"]}\n",
-    "    block_end   = ${_blocks[\"end\"]}\n",
-    "\n",
-    "    vcf_base    = \"${vcf_base}\"\n",
-    "    vcf_prefix  = \"${vcf_prefix}\"\n",
-    "    W_path      = \"${W_matrix}\"\n",
-    "    B           = ${B}\n",
-    "    maf_min     = ${maf_min}\n",
-    "    mac_min     = ${mac_min}\n",
-    "    msng_min    = ${msng_min}\n",
-    "    sample_list = \"${sample_list}\"\n",
-    "    cohort_id   = \"${cohort_id}\"\n",
-    "    base_dir    = \"${output_dir}\"\n",
-    "\n",
-    "    block_tag   = f\"{chrm_str}_{block_start}_{block_end}\"\n",
-    "    output_dir  = os.path.join(base_dir, chrm_str, block_tag)\n",
-    "    os.makedirs(output_dir, exist_ok=True)\n",
-    "\n",
-    "    log_path = os.path.join(output_dir, f\"{block_tag}.log\")\n",
-    "    log_fh   = open(log_path, \"w\")\n",
-    "    sys.stdout = log_fh\n",
-    "    sys.stderr = log_fh\n",
-    "    atexit.register(log_fh.close)\n",
-    "\n",
-    "    # -- Load sample subset (optional) -----------------------------\n",
-    "    sample_subset = None\n",
-    "    if sample_list:\n",
-    "        if not os.path.exists(sample_list):\n",
-    "            raise FileNotFoundError(f\"sample_list not found: {sample_list}\")\n",
-    "        with open(sample_list) as fh:\n",
-    "            sample_subset = set(line.strip() for line in fh if line.strip())\n",
-    "        print(f\"  Sample subset: {len(sample_subset):,} samples\")\n",
-    "\n",
-    "    # -- Helpers ---------------------------------------------------\n",
-    "    def get_vcf_files(chrm_str):\n",
-    "        files = sorted([\n",
-    "            os.path.join(vcf_base, x)\n",
-    "            for x in listdir(vcf_base)\n",
-    "            if x.endswith(\".bgz\") and (\n",
-    "                x.startswith(vcf_prefix + chrm_str + \":\") or\n",
-    "                x.startswith(vcf_prefix + chrm_str + \".\")\n",
-    "            )\n",
-    "        ])\n",
-    "        if not files:\n",
-    "            raise FileNotFoundError(f\"No VCF files for {chrm_str} in {vcf_base}\")\n",
-    "        return files\n",
-    "\n",
-    "    def open_vcf(vf, sample_subset):\n",
-    "        \"\"\"Open a VCF file, applying sample subset if provided.\"\"\"\n",
-    "        vcf = VCF(vf)\n",
-    "        if sample_subset is not None:\n",
-    "            vcf_samples = vcf.samples\n",
-    "            keep = [s for s in vcf_samples if s in sample_subset]\n",
-    "            if not keep:\n",
-    "                raise ValueError(f\"No sample_list samples in {os.path.basename(vf)}\")\n",
-    "            vcf.set_samples(keep)\n",
-    "        return vcf\n",
-    "\n",
-    "    def extract_dosage(var):\n",
-    "        \"\"\"Extract diploid dosage from a cyvcf2 variant. Returns list of floats (nan for missing).\"\"\"\n",
-    "        return [sum(x[0:2]) for x in [[nan if v == -1 else v for v in gt] for gt in var.genotypes]]\n",
-    "\n",
-    "    def fill_missing_col_means(G):\n",
-    "        col_means = np.nanmean(G, axis=0)\n",
-    "        return np.where(np.isnan(G), col_means, G)\n",
-    "\n",
-    "    # -- Single-pass: scan variants, filter, and collect dosages ---\n",
-    "    # BED is 0-based half-open [start, end); VCF is 1-based.\n",
-    "    print(f\"[1/3] Scanning {chrm_str} [{block_start:,}, {block_end:,}) ...\")\n",
-    "    vcf_files = get_vcf_files(chrm_str)\n",
-    "    region    = f\"{chrm_str}:{block_start+1}-{block_end}\"\n",
-    "    var_info  = []\n",
-    "    dosage_matrix = []\n",
-    "    n_samples = None\n",
-    "    # Filter counters\n",
-    "    n_total = 0\n",
-    "    n_multiallelic = 0\n",
-    "    n_monomorphic = 0\n",
-    "    n_all_na = 0\n",
-    "    n_low_maf = 0\n",
-    "    n_low_mac = 0\n",
-    "    n_high_msng = 0\n",
-    "\n",
-    "    for vf in vcf_files:\n",
-    "        vcf = open_vcf(vf, sample_subset)\n",
-    "        if n_samples is None:\n",
-    "            n_samples = len(vcf.samples)\n",
-    "        for var in vcf(region):\n",
-    "            if not (block_start <= var.POS - 1 < block_end):\n",
-    "                continue\n",
-    "            n_total += 1\n",
-    "            if len(var.ALT) != 1:\n",
-    "                n_multiallelic += 1\n",
-    "                continue\n",
-    "            dosage = extract_dosage(var)\n",
-    "            if np.nanvar(dosage) == 0:\n",
-    "                n_monomorphic += 1\n",
-    "                continue\n",
-    "            nan_count = int(np.sum(np.isnan(dosage)))\n",
-    "            n_non_na  = len(dosage) - nan_count\n",
-    "            if n_non_na == 0:\n",
-    "                n_all_na += 1\n",
-    "                continue\n",
-    "            alt_sum   = float(np.nansum(dosage))\n",
-    "            mac       = min(2 * n_non_na - alt_sum, alt_sum)\n",
-    "            maf       = mac / (2 * n_non_na)\n",
-    "            af        = alt_sum / (2 * n_non_na)\n",
-    "            msng_rate = nan_count / len(dosage)\n",
-    "            if msng_rate > msng_min:\n",
-    "                n_high_msng += 1\n",
-    "                continue\n",
-    "            if maf < maf_min:\n",
-    "                n_low_maf += 1\n",
-    "                continue\n",
-    "            if mac < mac_min:\n",
-    "                n_low_mac += 1\n",
-    "                continue\n",
-    "            var_info.append({\n",
-    "                \"chr\": var.CHROM, \"pos\": var.POS,\n",
-    "                \"ref\": var.REF,   \"alt\": var.ALT[0],\n",
-    "                \"af\":  round(float(af), 6),\n",
-    "                \"id\":  f\"{var.CHROM}:{var.POS}:{var.REF}:{var.ALT[0]}\",\n",
-    "                \"obs_ct\": 2 * n_non_na,\n",
-    "            })\n",
-    "            dosage_matrix.append(dosage)\n",
-    "        vcf.close()\n",
-    "\n",
-    "    n_passed = len(var_info)\n",
-    "    print(f\"  {n_total:,} total variants in region\")\n",
-    "    print(f\"  {n_passed:,} passed filters (n={n_samples:,})\")\n",
-    "    print(f\"  Filtered: {n_multiallelic:,} multiallelic, \"\n",
-    "          f\"{n_monomorphic:,} monomorphic, {n_all_na:,} all-NA, \"\n",
-    "          f\"{n_high_msng:,} high-missingness, \"\n",
-    "          f\"{n_low_maf:,} low-MAF, {n_low_mac:,} low-MAC\")\n",
-    "\n",
-    "    if not var_info:\n",
-    "        raise ValueError(f\"No passing variants in {chrm_str} [{block_start:,}, {block_end:,})\")\n",
-    "\n",
-    "    # -- Load W ----------------------------------------------------\n",
-    "    print(f\"[2/3] Loading W ...\")\n",
-    "    W = np.load(W_path)\n",
-    "    if W.shape != (n_samples, B):\n",
-    "        raise ValueError(f\"W shape mismatch: {W.shape} vs ({n_samples},{B})\")\n",
-    "    W = W.astype(np.float32)\n",
-    "    print(f\"  W: {W.shape}\")\n",
-    "\n",
-    "    # -- Compute U = $W^T G$ and write output files --------------------\n",
-    "    print(f\"[3/3] Computing U and writing output files ...\")\n",
-    "\n",
-    "    dosage_path = os.path.join(output_dir, f\"{cohort_id}.{block_tag}.dosage.gz\")\n",
-    "    map_path    = os.path.join(output_dir, f\"{cohort_id}.{block_tag}.map\")\n",
-    "    afreq_path  = os.path.join(output_dir, f\"{cohort_id}.{block_tag}.afreq\")\n",
-    "    meta_path   = os.path.join(output_dir, f\"{cohort_id}.{block_tag}.meta\")\n",
-    "\n",
-    "    # Write .map\n",
-    "    with open(map_path, \"w\") as fh:\n",
-    "        for v in var_info:\n",
-    "            fh.write(f\"{v['chr']}\\t{v['id']}\\t0\\t{v['pos']}\\n\")\n",
-    "\n",
-    "    # Write .meta\n",
-    "    with open(meta_path, \"w\") as fh:\n",
-    "        fh.write(f\"source_n_samples={n_samples}\\nB={B}\\n\")\n",
-    "        fh.write(f\"chrom={chrm_str}\\nblock_start={block_start}\\nblock_end={block_end}\\n\")\n",
-    "        fh.write(f\"n_total={n_total}\\nn_passed={n_passed}\\n\")\n",
-    "        fh.write(f\"n_multiallelic={n_multiallelic}\\nn_monomorphic={n_monomorphic}\\n\")\n",
-    "        fh.write(f\"n_all_na={n_all_na}\\nn_high_msng={n_high_msng}\\n\")\n",
-    "        fh.write(f\"n_low_maf={n_low_maf}\\nn_low_mac={n_low_mac}\\n\")\n",
-    "\n",
-    "\n",
-    "    # Build G from collected dosages, compute U = $W^T G$, write dosage.gz\n",
-    "    # Dosage format=1: ID ALT REF val_S1 ... val_SB\n",
-    "    # Min-max scaling to [0, 2] makes the output plink2-compatible as dosage.\n",
-    "    # This preserves correlation structure (cor is scale-invariant) which is\n",
-    "    # what matters for LD computation downstream.\n",
-    "    G = np.array(dosage_matrix, dtype=np.float32).T  # (n_samples, n_variants)\n",
-    "    del dosage_matrix\n",
-    "    G = fill_missing_col_means(G)\n",
-    "\n",
-    "    # variant-wise scaling\n",
-    "    col_mean = G.mean(axis=0, keepdims=True)\n",
-    "    col_std  = G.std(axis=0, keepdims=True)\n",
-    "    # avoid division by zero\n",
-    "    col_std[col_std == 0] = 1.0\n",
-    "    G = (G - col_mean) / col_std\n",
-    "\n",
-    "    U = W.T @ G  # (B, n_variants)\n",
-    "    del G\n",
-    "\n",
-    "    col_min = U.min(axis=0)\n",
-    "    col_max = U.max(axis=0)\n",
-    "    denom   = col_max - col_min\n",
-    "    denom[denom == 0] = 1.0\n",
-    "    U = 2.0 * (U - col_min) / denom\n",
-    "    U = np.round(U, 4)\n",
-    "\n",
-    "    # Record the col min and max for U\n",
-    "\n",
-    "    with open(afreq_path, \"w\") as fh:\n",
-    "        # Add column headers\n",
-    "        fh.write(\"#CHROM\\tID\\tREF\\tALT\\tALT_FREQS\\tOBS_CT\\tU_MIN\\tU_MAX\\n\")\n",
-    "        for j, v in enumerate(var_info):\n",
-    "            fh.write(f\"{v['chr']}\\t{v['id']}\\t{v['ref']}\\t{v['alt']}\\t\"\n",
-    "                     f\"{v['af']:.6f}\\t{v['obs_ct']}\\t\"\n",
-    "                     f\"{col_min[j]:.6f}\\t{col_max[j]:.6f}\\n\")\n",
-    "\n",
-    "    with gzip.open(dosage_path, \"wt\", compresslevel=4) as gz:\n",
-    "        for j, v in enumerate(var_info):\n",
-    "            vals = \" \".join(f\"{x:.4f}\" for x in U[:, j])\n",
-    "            gz.write(f\"{v['id']} {v['alt']} {v['ref']} {vals}\\n\")\n",
-    "\n",
-    "    del U\n",
-    "    print(f\"  Written: {len(var_info):,} variants -> {os.path.basename(dosage_path)}\")\n",
-    "    print(f\"  Written: {os.path.basename(map_path)}, {os.path.basename(afreq_path)}\")\n",
-    "    print(f\"\\nDone: {chrm_str} [{block_start:,}, {block_end:,})\")\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "9e8fff43",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": [
-    "[merge_chrom]\n",
-    "parameter: chrom      = 0\n",
-    "parameter: output_dir = str\n",
-    "parameter: cohort_id  = str\n",
-    "parameter: plink2_bin = \"plink2\"\n",
-    "\n",
-    "import os, glob\n",
-    "\n",
-    "def _chroms_to_process(output_dir, chrom_filter):\n",
-    "    if chrom_filter != 0:\n",
-    "        return [f\"chr{chrom_filter}\"]\n",
-    "    return sorted(set(\n",
-    "        os.path.basename(d)\n",
-    "        for d in glob.glob(os.path.join(output_dir, \"chr*\"))\n",
-    "        if os.path.isdir(d)\n",
-    "    ))\n",
-    "\n",
-    "chroms = _chroms_to_process(output_dir, chrom)\n",
-    "\n",
-    "input: for_each = \"chroms\"\n",
-    "output: f\"{output_dir}/{_chroms}/{cohort_id}.{_chroms}.pgen\"\n",
-    "task: trunk_workers = 1, trunk_size = 1, walltime = walltime, mem = mem, cores = numThreads\n",
-    "bash: expand = \"$[ ]\"\n",
-    "    set -euo pipefail\n",
-    "    shopt -s nullglob\n",
-    "\n",
-    "    chrom_dir=\"$[output_dir]/$[_chroms]\"\n",
-    "    final_prefix=\"${chrom_dir}/$[cohort_id].$[_chroms]\"\n",
-    "    merge_list=\"${chrom_dir}/$[cohort_id].$[_chroms]_pmerge_list.txt\"\n",
-    "\n",
-    "    # Step 1: Convert each block dosage.gz -> sorted per-block pgen\n",
-    "    > \"${merge_list}\"\n",
-    "    files=(\"${chrom_dir}\"/*/*.dosage.gz)\n",
-    "    if [ ${#files[@]} -eq 0 ]; then\n",
-    "        echo \"No dosage files found in ${chrom_dir}\" >&2\n",
-    "        exit 1\n",
-    "    fi\n",
-    "    for dosage_gz in \"${files[@]}\"; do\n",
-    "        block_dir=$(dirname \"${dosage_gz}\")\n",
-    "        block_tag=$(basename \"${block_dir}\")\n",
-    "        prefix=\"${block_dir}/$[cohort_id].${block_tag}_tmp\"\n",
-    "        map_file=\"${block_dir}/$[cohort_id].${block_tag}.map\"\n",
-    "        psam_file=\"${block_dir}/$[cohort_id].${block_tag}.psam\"\n",
-    "        meta_file=\"${block_dir}/$[cohort_id].${block_tag}.meta\"\n",
-    "        B=$(grep \"^B=\" \"${meta_file}\" | cut -d= -f2)\n",
-    "        printf '#FID\\tIID\\n' > \"${psam_file}\"\n",
-    "        for i in $(seq 1 ${B}); do\n",
-    "            printf 'S%d\\tS%d\\n' ${i} ${i} >> \"${psam_file}\"\n",
-    "        done\n",
-    "        $[plink2_bin] \\\n",
-    "            --import-dosage \"${dosage_gz}\" format=1 noheader \\\n",
-    "            --psam \"${psam_file}\" \\\n",
-    "            --map  \"${map_file}\" \\\n",
-    "            --make-pgen \\\n",
-    "            --out  \"${prefix}_unsorted\" \\\n",
-    "            --silent\n",
-    "        $[plink2_bin] \\\n",
-    "            --pfile \"${prefix}_unsorted\" \\\n",
-    "            --make-pgen \\\n",
-    "            --sort-vars \\\n",
-    "            --out  \"${prefix}\" \\\n",
-    "            --silent\n",
-    "        rm -f \"${prefix}_unsorted.pgen\" \"${prefix}_unsorted.pvar\" \"${prefix}_unsorted.psam\"\n",
-    "        echo \"${prefix}\" >> \"${merge_list}\"\n",
-    "    done\n",
-    "\n",
-    "    # Step 2: Merge all per-block pgens -> one per-chrom pgen\n",
-    "    $[plink2_bin] \\\n",
-    "        --pmerge-list \"${merge_list}\" pfile \\\n",
-    "        --make-pgen \\\n",
-    "        --sort-vars \\\n",
-    "        --out  \"${final_prefix}\"\n",
-    "\n",
-    "    # Step 3: Concatenate .afreq\n",
-    "    first=1\n",
-    "    for f in \"${chrom_dir}\"/*/*.afreq; do\n",
-    "        if [ \"${first}\" -eq 1 ]; then\n",
-    "            cat \"${f}\" > \"${final_prefix}.afreq\"\n",
-    "            first=0\n",
-    "        else\n",
-    "            tail -n +2 \"${f}\" >> \"${final_prefix}.afreq\"\n",
-    "        fi\n",
-    "    done\n",
-    "\n",
-    "R: expand = \"$[ ]\"\n",
-    "    library(data.table)\n",
-    "    meta_files <- list.files(\"$[output_dir]/$[_chroms]\",\n",
-    "                             pattern = \"[.]meta$\", recursive = TRUE,\n",
-    "                             full.names = TRUE)\n",
-    "    if (length(meta_files) > 0) {\n",
-    "      fields <- c(\"n_total\", \"n_passed\", \"n_multiallelic\", \"n_monomorphic\",\n",
-    "                  \"n_all_na\", \"n_high_msng\", \"n_low_maf\", \"n_low_mac\")\n",
-    "      stats <- rbindlist(lapply(meta_files, function(f) {\n",
-    "        lines <- grep(\"^n_\", readLines(f), value = TRUE)\n",
-    "        kv <- strsplit(lines, \"=\")\n",
-    "        vals <- setNames(as.integer(sapply(kv, `[`, 2)), sapply(kv, `[`, 1))\n",
-    "        as.data.table(as.list(vals[fields]))\n",
-    "      }))\n",
-    "      totals <- colSums(stats, na.rm = TRUE)\n",
-    "      summary <- data.frame(t(totals))\n",
-    "      summary$pct_dropped <- round(100 * (1 - summary$n_passed / summary$n_total), 1)\n",
-    "      cat(\"\\n=== Filter Summary for $[_chroms] ===\\n\")\n",
-    "      print(data.frame(value = unlist(summary), row.names = names(summary)))\n",
-    "    }\n",
-    "bash: expand = \"$[ ]\"\n",
-    "    # Step 5: Cleanup block intermediates\n",
-    "    chrom_dir=\"$[output_dir]/$[_chroms]\"\n",
-    "    final_prefix=\"${chrom_dir}/$[cohort_id].$[_chroms]\"\n",
-    "    rm -f \"${final_prefix}_pmerge_list.txt\"\n",
-    "    rm -f \"${final_prefix}-merge.pgen\" \"${final_prefix}-merge.pvar\" \"${final_prefix}-merge.psam\"\n",
-    "    for block_dir in \"${chrom_dir}\"/*/; do\n",
-    "        rm -rf \"${block_dir}\"\n",
-    "    done\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "dc998dcc",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Troubleshooting\n",
-    "\n",
-    "| Symptom | Cause | Fix |\n",
-    "|---|---|---|\n",
-    "| `No VCF files for chrXX in {vcf_base}` | VCF naming or extension mismatch | Files must end in `.bgz` and be named `{vcf_prefix}{chr}.*.bgz`; check `--vcf-base` and `--vcf-prefix`. |\n",
-    "| `W shape mismatch` | `--n-samples` or `--B` differs from the W used | Re-run `generate_W` with the same `--n-samples` and `--B`, and pass that `W_B{B}.npy` to `process_block`. |\n",
-    "| `No passing variants in chrXX` | Filters removed everything (small toy cohort) | Widen `--maf-min` / `--mac-min` / `--msng-min`, or choose blocks with more variants. |\n",
-    "| `No blocks found for chrom=XX` | `--chrom` does not match any BED rows | Ensure the BED `chr` column matches (e.g. `chr22`) and `--chrom` is the matching number. |\n",
-    "| Region query returns nothing | Missing tabix index | Run `tabix -p vcf file.bgz` to create the `.tbi`. |"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "ac3cfb79",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Output\n",
-    "\n",
-    "Per chromosome, under `--cwd`:\n",
-    "- `{cohort_id}.chr{N}.pgen` \u2014 binary genotype-sketch data (B pseudo-samples \u00d7 p variants)\n",
-    "- `{cohort_id}.chr{N}.pvar` \u2014 variant information\n",
-    "- `{cohort_id}.chr{N}.psam` \u2014 sample (sketch) information\n",
-    "- `{cohort_id}.chr{N}.afreq` \u2014 allele frequencies\n",
-    "\n",
-    "These feed SuSiE-RSS fine-mapping: load with a metadata TSV (one row per chromosome, columns `#chrom start end path`, `path` = pgen prefix). Use the X (genotype) interface for `susie_rss(z, X=X)` or the R (correlation) interface for `susie_rss(z, R=R)`."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "ff927c9c",
-   "metadata": {},
-   "source": [
-    "## Anticipated Results\n",
-    "\n",
-    "The pipeline produces output files in the `output/` subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the **Output** section above for the expected file names and formats."
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.12.13"
-  },
-  "sos": {
-   "kernels": [
-    [
-     "SoS",
-     "sos",
-     "sos",
-     "",
-     ""
-    ]
-   ],
-   "version": ""
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
\ No newline at end of file

From 92df463fa2f18119615dd7ba5f5c88c9d93edd4f Mon Sep 17 00:00:00 2001
From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com>
Date: Tue, 23 Jun 2026 12:00:07 -0400
Subject: [PATCH 2/6] remove dup merge output

---
 code/SoS/reference_data/rss_ld_sketch.ipynb | 850 ++++++++++++++++++++
 1 file changed, 850 insertions(+)
 create mode 100644 code/SoS/reference_data/rss_ld_sketch.ipynb

diff --git a/code/SoS/reference_data/rss_ld_sketch.ipynb b/code/SoS/reference_data/rss_ld_sketch.ipynb
new file mode 100644
index 00000000..0396c03d
--- /dev/null
+++ b/code/SoS/reference_data/rss_ld_sketch.ipynb
@@ -0,0 +1,850 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "8bdb623a",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "# RSS LD Sketch Pipeline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4b8d670a",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Description\n",
+    "\n",
+    "This pipeline generates a stochastic genotype sample **U = WᵀG** from whole-genome sequencing VCF files and stores it as a PLINK2 pgen file for use as an LD reference panel with SuSiE-RSS fine-mapping.\n",
+    "\n",
+    "**Key idea:** Rather than storing the full genotype matrix G (n × p), we compute U = WᵀG (B × p) using a random projection matrix $W \\sim N(0, 1/\\sqrt{n})$. The approximate LD matrix $R = U^T U / B \\approx G^T G / n$ by the Johnson–Lindenstrauss lemma. G is never stored.\n",
+    "\n",
+    "**Matrix dimensions:**\n",
+    "- G : (n × p) — n individuals × p variants\n",
+    "- W : (n × B) — projection matrix, generated once per cohort\n",
+    "- U : (B × p) — stochastic genotype sample = WᵀG, stored in pgen\n",
+    "- $\\hat{R}$ : (p × p) — approximate LD matrix, computed on-the-fly by SuSiE-RSS from U\n",
+    "\n",
+    "The workflow has three steps run in order: `generate_W` (build the projection matrix), `process_block` (read VCF per LD block and write per-block dosage sketches), and `merge_chrom` (merge per-block dosages into one per-chromosome pgen).\n",
+    "\n",
+    "**Note on data:** This example runs on a clearly-labeled synthetic toy dataset — a chr22 VCF (`protocol_example.genotype.chr22.bgz`, 60 individuals) and a 3-block LD-block BED (`protocol_example.ld_blocks.bed`). No access-controlled individual-level human genomic data is used."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ffdf8d92",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Input\n",
+    "\n",
+    "- **LD-block BED** (`--ld-block-file`): tab-separated file with columns `chr`, `start`, `end` (0-based half-open) defining the regions to sketch. Toy file: `input/rss_ld_sketch/protocol_example.ld_blocks.bed` (3 chr22 blocks).\n",
+    "- **VCF directory** (`--vcf-base`) and **prefix** (`--vcf-prefix`): bgzipped (`.bgz`) + tabix-indexed VCF(s) named `{vcf_prefix}{chr}.*.bgz`. Toy file: `input/rss_ld_sketch/protocol_example.genotype.chr22.bgz` (60 individuals), discovered with `--vcf-base input/rss_ld_sketch --vcf-prefix protocol_example.genotype.`\n",
+    "- **`--n-samples`**: number of individuals in the VCF (here 60). Must match the VCF sample count.\n",
+    "- **`--B`**: number of sketch (pseudo-)samples / projection dimension (the toy uses a small B for speed; production uses ~10000).\n",
+    "- **`--chrom`**: chromosome to process (e.g. 22; 0 = all autosomes found).\n",
+    "- **`--cohort-id`**: label used to name output files.\n",
+    "\n",
+    "Filter thresholds (defaults shown in the implementation): `--maf-min 0.0005`, `--mac-min 5`, `--msng-min 0.05`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dce33bc7",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Steps\n",
+    "\n",
+    "Run the three workflows in order. `generate_W` builds the shared projection matrix once; `process_block` sketches each LD block; `merge_chrom` assembles the per-chromosome pgen."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4633fe31",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "**Timing:** ~10-20 min (chr22) on typical compute infrastructure."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6ffb31a4-7ad4-479d-8955-ba598a16ef07",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "### Step 1. Generate the projection matrix W (run once per cohort; `--n-samples` must equal the VCF sample count)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "03612385",
+   "metadata": {
+    "kernel": "Bash"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n",
+      "  import pkg_resources\n",
+      "INFO: Running \u001b[32mgenerate_W\u001b[0m: \n",
+      "INFO: \u001b[32mgenerate_W\u001b[0m is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mgenerate_W\u001b[0m output:   \u001b[32moutput/rss_ld_sketch/W_B50.npy\u001b[0m\n",
+      "INFO: Workflow generate_W (ID=w68f63c60d8da4b5e) is executed successfully with 1 completed step.\n"
+     ]
+    }
+   ],
+   "source": [
+    "sos run pipeline/rss_ld_sketch.ipynb generate_W \\\n",
+    "    --n-samples 60 \\\n",
+    "    --output-dir output/rss_ld_sketch \\\n",
+    "    --B 50 \\\n",
+    "    --seed 123 \\\n",
+    "    --cwd output/rss_ld_sketch"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b62d37b8-da5d-4d5f-a3b7-7633ff5ff70f",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "### Step 2. Process all LD blocks for the chromosome — read the VCF, filter variants, and write per-block dosage sketches U = WᵀG.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "d2eccffd",
+   "metadata": {
+    "kernel": "Bash"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n",
+      "  import pkg_resources\n",
+      "INFO: Running \u001b[32mprocess_block\u001b[0m: \n",
+      "  3 LD blocks queued\n",
+      "INFO: \u001b[32mprocess_block\u001b[0m (index=0) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mprocess_block\u001b[0m (index=1) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mprocess_block\u001b[0m (index=2) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mprocess_block\u001b[0m output:   \u001b[32moutput/rss_ld_sketch/chr22/chr22_16000000_20000000/protocol_example..chr22_16000000_20000000.dosage.gz output/rss_ld_sketch/chr22/chr22_30000000_34000000/protocol_example..chr22_30000000_34000000.dosage.gz... (3 items in 3 groups)\u001b[0m\n",
+      "INFO: Workflow process_block (ID=w8c1e759d62203ef6) is executed successfully with 1 completed step and 3 completed substeps.\n"
+     ]
+    }
+   ],
+   "source": [
+    "sos run pipeline/rss_ld_sketch.ipynb process_block \\\n",
+    "    --ld-block-file input/rss_ld_sketch/protocol_example.ld_blocks.bed \\\n",
+    "    --chrom 22 \\\n",
+    "    --vcf-base input/rss_ld_sketch \\\n",
+    "    --vcf-prefix protocol_example.genotype. \\\n",
+    "    --output-dir output/rss_ld_sketch \\\n",
+    "    --W-matrix output/rss_ld_sketch/W_B50.npy \\\n",
+    "    --B 50 \\\n",
+    "    --cohort-id protocol_example. \\\n",
+    "    --cwd output/rss_ld_sketch"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "452c348f-487f-44e4-96c1-75fe118cbc9a",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "### Step 3. Merge the per-block dosage sketches into one per-chromosome PLINK2 pgen."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "81f28809",
+   "metadata": {
+    "kernel": "Bash"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n",
+      "  import pkg_resources\n",
+      "INFO: Running \u001b[32mmerge_chrom\u001b[0m: \n",
+      "PLINK v2.0.0-a.6.9LM 64-bit Intel (29 Jan 2025)    cog-genomics.org/plink/2.0/\n",
+      "(C) 2005-2025 Shaun Purcell, Christopher Chang   GNU General Public License v3\n",
+      "Logging to output/rss_ld_sketch/chr22/protocol_example..chr22.log.\n",
+      "Options in effect:\n",
+      "  --make-pgen\n",
+      "  --out output/rss_ld_sketch/chr22/protocol_example..chr22\n",
+      "  --pmerge-list output/rss_ld_sketch/chr22/protocol_example..chr22_pmerge_list.txt pfile\n",
+      "  --sort-vars\n",
+      "\n",
+      "Start time: Tue Jun 23 09:52:54 2026\n",
+      "191527 MiB RAM detected, ~187643 available; reserving 95763 MiB for main\n",
+      "workspace.\n",
+      "Using up to 32 threads (change this with --threads).\n",
+      "--pmerge-list: 3 filesets specified.\n",
+      "--pmerge-list: 50 samples present.\n",
+      "--pmerge-list: Merged .psam written to\n",
+      "output/rss_ld_sketch/chr22/protocol_example..chr22-merge.psam .\n",
+      "--pmerge-list: 3 .pvar files scanned.\n",
+      "Concatenation job detected.\n",
+      "Concatenating... 673/673 variants complete.\n",
+      "Results written to\n",
+      "output/rss_ld_sketch/chr22/protocol_example..chr22-merge.pgen +\n",
+      "output/rss_ld_sketch/chr22/protocol_example..chr22-merge.pvar .\n",
+      "50 samples (0 females, 0 males, 50 ambiguous; 50 founders) loaded from\n",
+      "output/rss_ld_sketch/chr22/protocol_example..chr22-merge.psam.\n",
+      "673 variants loaded from\n",
+      "output/rss_ld_sketch/chr22/protocol_example..chr22-merge.pvar.\n",
+      "Note: No phenotype data present.\n",
+      "Writing output/rss_ld_sketch/chr22/protocol_example..chr22.pvar ... done.\n",
+      "Writing output/rss_ld_sketch/chr22/protocol_example..chr22.psam ... done.\n",
+      "Writing output/rss_ld_sketch/chr22/protocol_example..chr22.pgen ... done.\n",
+      "End time: Tue Jun 23 09:52:54 2026\n",
+      "\n",
+      "=== Filter Summary for chr22 ===\n",
+      "                value\n",
+      "n_total        6157.0\n",
+      "n_passed        673.0\n",
+      "n_multiallelic    0.0\n",
+      "n_monomorphic  4701.0\n",
+      "n_all_na          0.0\n",
+      "n_high_msng     101.0\n",
+      "n_low_maf         0.0\n",
+      "n_low_mac       682.0\n",
+      "pct_dropped      89.1\n",
+      "INFO: \u001b[32mmerge_chrom\u001b[0m is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmerge_chrom\u001b[0m output:   \u001b[32moutput/rss_ld_sketch/chr22/protocol_example..chr22.pgen\u001b[0m\n",
+      "INFO: Workflow merge_chrom (ID=w8e5a670551e06660) is executed successfully with 1 completed step.\n"
+     ]
+    }
+   ],
+   "source": [
+    "sos run pipeline/rss_ld_sketch.ipynb merge_chrom \\\n",
+    "    --output-dir output/rss_ld_sketch \\\n",
+    "    --cohort-id protocol_example. \\\n",
+    "    --chrom 22 \\\n",
+    "    --cwd output/rss_ld_sketch"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "32c022be",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Command interface\n",
+    "\n",
+    "List every workflow and its parameters:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f3569a70",
+   "metadata": {
+    "kernel": "Bash"
+   },
+   "outputs": [],
+   "source": [
+    "sos run pipeline/rss_ld_sketch.ipynb -h"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ac50d174",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Workflow implementation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a7886e46",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "[global]\n",
+    "parameter: cwd        = path(\"output\")\n",
+    "parameter: job_size   = 1\n",
+    "parameter: walltime   = \"24:00:00\"\n",
+    "parameter: mem        = \"32G\"\n",
+    "parameter: numThreads = 8\n",
+    "\n",
+    "cwd = path(f'{cwd:a}')\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c321bef5",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "[generate_W]\n",
+    "# Generate projection matrix $W \\sim N(0, 1/\\sqrt{n})$, shape (n x B).\n",
+    "# Run ONCE before processing any chromosome.\n",
+    "#\n",
+    "# W depends only on n (total sample size) and B -- not on any variant data.\n",
+    "# n_samples is passed directly as a parameter; no VCF reading is needed.\n",
+    "# All 22 chromosomes reuse the same W so that per-chromosome stochastic\n",
+    "# genotype samples can be arithmetically merged for meta-analysis.\n",
+    "parameter: n_samples = int\n",
+    "parameter: output_dir    = str\n",
+    "parameter: B         = 10000\n",
+    "parameter: seed      = 123\n",
+    "\n",
+    "import os\n",
+    "input:  []\n",
+    "output: f'{output_dir}/W_B{B}.npy'\n",
+    "task: trunk_workers = 1, trunk_size = 1, walltime = '00:05:00', mem = '4G', cores = 1\n",
+    "python: expand = \"${ }\", stdout = f'{_output:n}.stdout', stderr = f'{_output:n}.stderr'\n",
+    "\n",
+    "    import numpy as np\n",
+    "    import os\n",
+    "\n",
+    "    n      = ${n_samples}\n",
+    "    B      = ${B}\n",
+    "    seed   = ${seed}\n",
+    "    W_out  = \"${_output}\"\n",
+    "\n",
+    "    # -- Generate $W \\sim N(0, 1/\\sqrt{n})$ -----------------------------\n",
+    "    # Convention: W = np.random.normal(0, 1/np.sqrt(n), size=(n, B))\n",
+    "    # W is shared across all chromosomes -- do not regenerate per chromosome.\n",
+    "    print(f\"Generating W ~ N(0, 1/sqrt({n})),  shape ({n}, {B}),  seed={seed}\")\n",
+    "    np.random.seed(seed)\n",
+    "    W = np.random.normal(0, 1.0 / np.sqrt(n), size=(n, B)).astype(np.float32)\n",
+    "\n",
+    "    os.makedirs(os.path.dirname(os.path.abspath(W_out)), exist_ok=True)\n",
+    "    np.save(W_out, W)\n",
+    "    print(f\"Saved: {W_out}\")\n",
+    "    print(f\"Shape: {W.shape},  size: {os.path.getsize(W_out)/1e9:.2f} GB\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "68a93ed9",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "[process_block]\n",
+    "parameter: ld_block_file = str\n",
+    "parameter: chrom         = 0\n",
+    "parameter: vcf_base      = str\n",
+    "parameter: vcf_prefix    = str\n",
+    "parameter: cohort_id     = \"ADSP.R5.EUR\"\n",
+    "parameter: output_dir    = str\n",
+    "parameter: W_matrix      = str\n",
+    "parameter: B             = 10000\n",
+    "parameter: maf_min       = 0.0005\n",
+    "parameter: mac_min       = 5\n",
+    "parameter: msng_min      = 0.05\n",
+    "parameter: sample_list   = \"\"\n",
+    "\n",
+    "import os\n",
+    "\n",
+    "def _read_blocks(bed, chrom_filter):\n",
+    "    blocks = []\n",
+    "    with open(bed) as fh:\n",
+    "        for line in fh:\n",
+    "            if line.startswith(\"#\") or not line.strip():\n",
+    "                continue\n",
+    "            parts = line.split()\n",
+    "            c = parts[0]\n",
+    "            if not (c.startswith(\"chr\") and c[3:].isdigit()):\n",
+    "                continue\n",
+    "            cnum = int(c[3:])\n",
+    "            if not (1 <= cnum <= 22):\n",
+    "                continue\n",
+    "            if chrom_filter != 0 and cnum != chrom_filter:\n",
+    "                continue\n",
+    "            blocks.append({\"chr\": c, \"start\": int(parts[1]), \"end\": int(parts[2])})\n",
+    "    if not blocks:\n",
+    "        raise ValueError(f\"No blocks found for chrom={chrom_filter} in {bed}\")\n",
+    "    return blocks\n",
+    "\n",
+    "blocks = _read_blocks(ld_block_file, chrom)\n",
+    "print(f\"  {len(blocks)} LD blocks queued\")\n",
+    "\n",
+    "input: for_each = \"blocks\"\n",
+    "output: f'{output_dir}/{_blocks[\"chr\"]}/{_blocks[\"chr\"]}_{_blocks[\"start\"]}_{_blocks[\"end\"]}/{cohort_id}.{_blocks[\"chr\"]}_{_blocks[\"start\"]}_{_blocks[\"end\"]}.dosage.gz'\n",
+    "task: trunk_workers = 1, trunk_size = 1, walltime = walltime, mem = mem, cores = numThreads\n",
+    "python: expand = \"${ }\"\n",
+    "\n",
+    "    import numpy as np\n",
+    "    import os\n",
+    "    import gzip\n",
+    "    import sys\n",
+    "    import atexit\n",
+    "    from math import nan\n",
+    "    from cyvcf2 import VCF\n",
+    "    from os import listdir\n",
+    "\n",
+    "    # Block coordinates from for_each loop\n",
+    "    chrm_str    = \"${_blocks['chr']}\"\n",
+    "    block_start = ${_blocks[\"start\"]}\n",
+    "    block_end   = ${_blocks[\"end\"]}\n",
+    "\n",
+    "    vcf_base    = \"${vcf_base}\"\n",
+    "    vcf_prefix  = \"${vcf_prefix}\"\n",
+    "    W_path      = \"${W_matrix}\"\n",
+    "    B           = ${B}\n",
+    "    maf_min     = ${maf_min}\n",
+    "    mac_min     = ${mac_min}\n",
+    "    msng_min    = ${msng_min}\n",
+    "    sample_list = \"${sample_list}\"\n",
+    "    cohort_id   = \"${cohort_id}\"\n",
+    "    base_dir    = \"${output_dir}\"\n",
+    "\n",
+    "    block_tag   = f\"{chrm_str}_{block_start}_{block_end}\"\n",
+    "    output_dir  = os.path.join(base_dir, chrm_str, block_tag)\n",
+    "    os.makedirs(output_dir, exist_ok=True)\n",
+    "\n",
+    "    log_path = os.path.join(output_dir, f\"{block_tag}.log\")\n",
+    "    log_fh   = open(log_path, \"w\")\n",
+    "    sys.stdout = log_fh\n",
+    "    sys.stderr = log_fh\n",
+    "    atexit.register(log_fh.close)\n",
+    "\n",
+    "    # -- Load sample subset (optional) -----------------------------\n",
+    "    sample_subset = None\n",
+    "    if sample_list:\n",
+    "        if not os.path.exists(sample_list):\n",
+    "            raise FileNotFoundError(f\"sample_list not found: {sample_list}\")\n",
+    "        with open(sample_list) as fh:\n",
+    "            sample_subset = set(line.strip() for line in fh if line.strip())\n",
+    "        print(f\"  Sample subset: {len(sample_subset):,} samples\")\n",
+    "\n",
+    "    # -- Helpers ---------------------------------------------------\n",
+    "    def get_vcf_files(chrm_str):\n",
+    "        files = sorted([\n",
+    "            os.path.join(vcf_base, x)\n",
+    "            for x in listdir(vcf_base)\n",
+    "            if x.endswith(\".bgz\") and (\n",
+    "                x.startswith(vcf_prefix + chrm_str + \":\") or\n",
+    "                x.startswith(vcf_prefix + chrm_str + \".\")\n",
+    "            )\n",
+    "        ])\n",
+    "        if not files:\n",
+    "            raise FileNotFoundError(f\"No VCF files for {chrm_str} in {vcf_base}\")\n",
+    "        return files\n",
+    "\n",
+    "    def open_vcf(vf, sample_subset):\n",
+    "        \"\"\"Open a VCF file, applying sample subset if provided.\"\"\"\n",
+    "        vcf = VCF(vf)\n",
+    "        if sample_subset is not None:\n",
+    "            vcf_samples = vcf.samples\n",
+    "            keep = [s for s in vcf_samples if s in sample_subset]\n",
+    "            if not keep:\n",
+    "                raise ValueError(f\"No sample_list samples in {os.path.basename(vf)}\")\n",
+    "            vcf.set_samples(keep)\n",
+    "        return vcf\n",
+    "\n",
+    "    def extract_dosage(var):\n",
+    "        \"\"\"Extract diploid dosage from a cyvcf2 variant. Returns list of floats (nan for missing).\"\"\"\n",
+    "        return [sum(x[0:2]) for x in [[nan if v == -1 else v for v in gt] for gt in var.genotypes]]\n",
+    "\n",
+    "    def fill_missing_col_means(G):\n",
+    "        col_means = np.nanmean(G, axis=0)\n",
+    "        return np.where(np.isnan(G), col_means, G)\n",
+    "\n",
+    "    # -- Single-pass: scan variants, filter, and collect dosages ---\n",
+    "    # BED is 0-based half-open [start, end); VCF is 1-based.\n",
+    "    print(f\"[1/3] Scanning {chrm_str} [{block_start:,}, {block_end:,}) ...\")\n",
+    "    vcf_files = get_vcf_files(chrm_str)\n",
+    "    region    = f\"{chrm_str}:{block_start+1}-{block_end}\"\n",
+    "    var_info  = []\n",
+    "    dosage_matrix = []\n",
+    "    n_samples = None\n",
+    "    # Filter counters\n",
+    "    n_total = 0\n",
+    "    n_multiallelic = 0\n",
+    "    n_monomorphic = 0\n",
+    "    n_all_na = 0\n",
+    "    n_low_maf = 0\n",
+    "    n_low_mac = 0\n",
+    "    n_high_msng = 0\n",
+    "\n",
+    "    for vf in vcf_files:\n",
+    "        vcf = open_vcf(vf, sample_subset)\n",
+    "        if n_samples is None:\n",
+    "            n_samples = len(vcf.samples)\n",
+    "        for var in vcf(region):\n",
+    "            if not (block_start <= var.POS - 1 < block_end):\n",
+    "                continue\n",
+    "            n_total += 1\n",
+    "            if len(var.ALT) != 1:\n",
+    "                n_multiallelic += 1\n",
+    "                continue\n",
+    "            dosage = extract_dosage(var)\n",
+    "            if np.nanvar(dosage) == 0:\n",
+    "                n_monomorphic += 1\n",
+    "                continue\n",
+    "            nan_count = int(np.sum(np.isnan(dosage)))\n",
+    "            n_non_na  = len(dosage) - nan_count\n",
+    "            if n_non_na == 0:\n",
+    "                n_all_na += 1\n",
+    "                continue\n",
+    "            alt_sum   = float(np.nansum(dosage))\n",
+    "            mac       = min(2 * n_non_na - alt_sum, alt_sum)\n",
+    "            maf       = mac / (2 * n_non_na)\n",
+    "            af        = alt_sum / (2 * n_non_na)\n",
+    "            msng_rate = nan_count / len(dosage)\n",
+    "            if msng_rate > msng_min:\n",
+    "                n_high_msng += 1\n",
+    "                continue\n",
+    "            if maf < maf_min:\n",
+    "                n_low_maf += 1\n",
+    "                continue\n",
+    "            if mac < mac_min:\n",
+    "                n_low_mac += 1\n",
+    "                continue\n",
+    "            var_info.append({\n",
+    "                \"chr\": var.CHROM, \"pos\": var.POS,\n",
+    "                \"ref\": var.REF,   \"alt\": var.ALT[0],\n",
+    "                \"af\":  round(float(af), 6),\n",
+    "                \"id\":  f\"{var.CHROM}:{var.POS}:{var.REF}:{var.ALT[0]}\",\n",
+    "                \"obs_ct\": 2 * n_non_na,\n",
+    "            })\n",
+    "            dosage_matrix.append(dosage)\n",
+    "        vcf.close()\n",
+    "\n",
+    "    n_passed = len(var_info)\n",
+    "    print(f\"  {n_total:,} total variants in region\")\n",
+    "    print(f\"  {n_passed:,} passed filters (n={n_samples:,})\")\n",
+    "    print(f\"  Filtered: {n_multiallelic:,} multiallelic, \"\n",
+    "          f\"{n_monomorphic:,} monomorphic, {n_all_na:,} all-NA, \"\n",
+    "          f\"{n_high_msng:,} high-missingness, \"\n",
+    "          f\"{n_low_maf:,} low-MAF, {n_low_mac:,} low-MAC\")\n",
+    "\n",
+    "    if not var_info:\n",
+    "        raise ValueError(f\"No passing variants in {chrm_str} [{block_start:,}, {block_end:,})\")\n",
+    "\n",
+    "    # -- Load W ----------------------------------------------------\n",
+    "    print(f\"[2/3] Loading W ...\")\n",
+    "    W = np.load(W_path)\n",
+    "    if W.shape != (n_samples, B):\n",
+    "        raise ValueError(f\"W shape mismatch: {W.shape} vs ({n_samples},{B})\")\n",
+    "    W = W.astype(np.float32)\n",
+    "    print(f\"  W: {W.shape}\")\n",
+    "\n",
+    "    # -- Compute U = $W^T G$ and write output files --------------------\n",
+    "    print(f\"[3/3] Computing U and writing output files ...\")\n",
+    "\n",
+    "    dosage_path = os.path.join(output_dir, f\"{cohort_id}.{block_tag}.dosage.gz\")\n",
+    "    map_path    = os.path.join(output_dir, f\"{cohort_id}.{block_tag}.map\")\n",
+    "    afreq_path  = os.path.join(output_dir, f\"{cohort_id}.{block_tag}.afreq\")\n",
+    "    meta_path   = os.path.join(output_dir, f\"{cohort_id}.{block_tag}.meta\")\n",
+    "\n",
+    "    # Write .map\n",
+    "    with open(map_path, \"w\") as fh:\n",
+    "        for v in var_info:\n",
+    "            fh.write(f\"{v['chr']}\\t{v['id']}\\t0\\t{v['pos']}\\n\")\n",
+    "\n",
+    "    # Write .meta\n",
+    "    with open(meta_path, \"w\") as fh:\n",
+    "        fh.write(f\"source_n_samples={n_samples}\\nB={B}\\n\")\n",
+    "        fh.write(f\"chrom={chrm_str}\\nblock_start={block_start}\\nblock_end={block_end}\\n\")\n",
+    "        fh.write(f\"n_total={n_total}\\nn_passed={n_passed}\\n\")\n",
+    "        fh.write(f\"n_multiallelic={n_multiallelic}\\nn_monomorphic={n_monomorphic}\\n\")\n",
+    "        fh.write(f\"n_all_na={n_all_na}\\nn_high_msng={n_high_msng}\\n\")\n",
+    "        fh.write(f\"n_low_maf={n_low_maf}\\nn_low_mac={n_low_mac}\\n\")\n",
+    "\n",
+    "\n",
+    "    # Build G from collected dosages, compute U = $W^T G$, write dosage.gz\n",
+    "    # Dosage format=1: ID ALT REF val_S1 ... val_SB\n",
+    "    # Min-max scaling to [0, 2] makes the output plink2-compatible as dosage.\n",
+    "    # This preserves correlation structure (cor is scale-invariant) which is\n",
+    "    # what matters for LD computation downstream.\n",
+    "    G = np.array(dosage_matrix, dtype=np.float32).T  # (n_samples, n_variants)\n",
+    "    del dosage_matrix\n",
+    "    G = fill_missing_col_means(G)\n",
+    "\n",
+    "    # variant-wise scaling\n",
+    "    col_mean = G.mean(axis=0, keepdims=True)\n",
+    "    col_std  = G.std(axis=0, keepdims=True)\n",
+    "    # avoid division by zero\n",
+    "    col_std[col_std == 0] = 1.0\n",
+    "    G = (G - col_mean) / col_std\n",
+    "\n",
+    "    U = W.T @ G  # (B, n_variants)\n",
+    "    del G\n",
+    "\n",
+    "    col_min = U.min(axis=0)\n",
+    "    col_max = U.max(axis=0)\n",
+    "    denom   = col_max - col_min\n",
+    "    denom[denom == 0] = 1.0\n",
+    "    U = 2.0 * (U - col_min) / denom\n",
+    "    U = np.round(U, 4)\n",
+    "\n",
+    "    # Record the col min and max for U\n",
+    "\n",
+    "    with open(afreq_path, \"w\") as fh:\n",
+    "        # Add column headers\n",
+    "        fh.write(\"#CHROM\\tID\\tREF\\tALT\\tALT_FREQS\\tOBS_CT\\tU_MIN\\tU_MAX\\n\")\n",
+    "        for j, v in enumerate(var_info):\n",
+    "            fh.write(f\"{v['chr']}\\t{v['id']}\\t{v['ref']}\\t{v['alt']}\\t\"\n",
+    "                     f\"{v['af']:.6f}\\t{v['obs_ct']}\\t\"\n",
+    "                     f\"{col_min[j]:.6f}\\t{col_max[j]:.6f}\\n\")\n",
+    "\n",
+    "    with gzip.open(dosage_path, \"wt\", compresslevel=4) as gz:\n",
+    "        for j, v in enumerate(var_info):\n",
+    "            vals = \" \".join(f\"{x:.4f}\" for x in U[:, j])\n",
+    "            gz.write(f\"{v['id']} {v['alt']} {v['ref']} {vals}\\n\")\n",
+    "\n",
+    "    del U\n",
+    "    print(f\"  Written: {len(var_info):,} variants -> {os.path.basename(dosage_path)}\")\n",
+    "    print(f\"  Written: {os.path.basename(map_path)}, {os.path.basename(afreq_path)}\")\n",
+    "    print(f\"\\nDone: {chrm_str} [{block_start:,}, {block_end:,})\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9e8fff43",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "[merge_chrom]\n",
+    "parameter: chrom      = 0\n",
+    "parameter: output_dir = str\n",
+    "parameter: cohort_id  = str\n",
+    "parameter: plink2_bin = \"plink2\"\n",
+    "\n",
+    "import os, glob\n",
+    "\n",
+    "def chromsto_process(output_dir, chrom_filter):\n",
+    "    if chrom_filter != 0:\n",
+    "        return [f\"chr{chrom_filter}\"]\n",
+    "    return sorted(set(\n",
+    "        os.path.basename(d)\n",
+    "        for d in glob.glob(os.path.join(output_dir, \"chr*\"))\n",
+    "        if os.path.isdir(d)\n",
+    "    ))\n",
+    "\n",
+    "chroms = chromsto_process(output_dir, chrom)\n",
+    "\n",
+    "input: for_each = \"chroms\"\n",
+    "output: f\"{output_dir}/{_chroms}/{cohort_id}.{_chroms}.pgen\"\n",
+    "task: trunk_workers = 1, trunk_size = 1, walltime = walltime, mem = mem, cores = numThreads\n",
+    "bash: expand = \"$[ ]\"\n",
+    "\n",
+    "    set -euo pipefail\n",
+    "    shopt -s nullglob\n",
+    "\n",
+    "    chrom_dir=\"$[output_dir]/$[_chroms]\"\n",
+    "    final_prefix=\"${chrom_dir}/$[cohort_id].$[_chroms]\"\n",
+    "    merge_list=\"${chrom_dir}/$[cohort_id].$[_chroms]_pmerge_list.txt\"\n",
+    "\n",
+    "    # Step 1: Convert each block dosage.gz -> sorted per-block pgen\n",
+    "    > \"${merge_list}\"\n",
+    "    files=(\"${chrom_dir}\"/*/*.dosage.gz)\n",
+    "    if [ ${#files[@]} -eq 0 ]; then\n",
+    "        echo \"No dosage files found in ${chrom_dir}\" >&2\n",
+    "        exit 1\n",
+    "    fi\n",
+    "    for dosage_gz in \"${files[@]}\"; do\n",
+    "        block_dir=$(dirname \"${dosage_gz}\")\n",
+    "        block_tag=$(basename \"${block_dir}\")\n",
+    "        prefix=\"${block_dir}/$[cohort_id].${block_tag}_tmp\"\n",
+    "        map_file=\"${block_dir}/$[cohort_id].${block_tag}.map\"\n",
+    "        psam_file=\"${block_dir}/$[cohort_id].${block_tag}.psam\"\n",
+    "        meta_file=\"${block_dir}/$[cohort_id].${block_tag}.meta\"\n",
+    "\n",
+    "        B=$(grep \"^B=\" \"${meta_file}\" | cut -d= -f2)\n",
+    "        printf '#FID\\tIID\\n' > \"${psam_file}\"\n",
+    "        for i in $(seq 1 ${B}); do\n",
+    "            printf 'S%d\\tS%d\\n' ${i} ${i} >> \"${psam_file}\"\n",
+    "        done\n",
+    "\n",
+    "        $[plink2_bin] \\\n",
+    "            --import-dosage \"${dosage_gz}\" format=1 noheader \\\n",
+    "            --psam \"${psam_file}\" \\\n",
+    "            --map  \"${map_file}\" \\\n",
+    "            --make-pgen \\\n",
+    "            --out  \"${prefix}_unsorted\" \\\n",
+    "            --silent\n",
+    "\n",
+    "        $[plink2_bin] \\\n",
+    "            --pfile \"${prefix}_unsorted\" \\\n",
+    "            --make-pgen \\\n",
+    "            --sort-vars \\\n",
+    "            --out  \"${prefix}\" \\\n",
+    "            --silent\n",
+    "\n",
+    "        rm -f \"${prefix}_unsorted.pgen\" \"${prefix}_unsorted.pvar\" \"${prefix}_unsorted.psam\"\n",
+    "        echo \"${prefix}\" >> \"${merge_list}\"\n",
+    "    done\n",
+    "\n",
+    "    # Step 2: Merge all per-block pgens -> one per-chrom pgen\n",
+    "    $[plink2_bin] \\\n",
+    "        --pmerge-list \"${merge_list}\" pfile \\\n",
+    "        --make-pgen \\\n",
+    "        --sort-vars \\\n",
+    "        --out  \"${final_prefix}\"\n",
+    "\n",
+    "    # Remove PLINK merge intermediates immediately after merge\n",
+    "    rm -f \"${final_prefix}-merge.pgen\" \"${final_prefix}-merge.pvar\" \"${final_prefix}-merge.psam\"\n",
+    "\n",
+    "    # Step 3: Concatenate .afreq\n",
+    "    first=1\n",
+    "    for f in \"${chrom_dir}\"/*/*.afreq; do\n",
+    "        if [ \"${first}\" -eq 1 ]; then\n",
+    "            cat \"${f}\" > \"${final_prefix}.afreq\"\n",
+    "            first=0\n",
+    "        else\n",
+    "            tail -n +2 \"${f}\" >> \"${final_prefix}.afreq\"\n",
+    "        fi\n",
+    "    done\n",
+    "\n",
+    "R: expand = \"$[ ]\"\n",
+    "\n",
+    "    library(data.table)\n",
+    "    meta_files <- list.files(\"$[output_dir]/$[_chroms]\",\n",
+    "                             pattern = \"[.]meta$\", recursive = TRUE,\n",
+    "                             full.names = TRUE)\n",
+    "    if (length(meta_files) > 0) {\n",
+    "      fields <- c(\"n_total\", \"n_passed\", \"n_multiallelic\", \"n_monomorphic\",\n",
+    "                  \"n_all_na\", \"n_high_msng\", \"n_low_maf\", \"n_low_mac\")\n",
+    "      stats <- rbindlist(lapply(meta_files, function(f) {\n",
+    "        lines <- grep(\"^n_\", readLines(f), value = TRUE)\n",
+    "        kv <- strsplit(lines, \"=\")\n",
+    "        vals <- setNames(as.integer(sapply(kv, `[`, 2)), sapply(kv, `[`, 1))\n",
+    "        as.data.table(as.list(vals[fields]))\n",
+    "      }))\n",
+    "      totals <- colSums(stats, na.rm = TRUE)\n",
+    "      summary <- data.frame(t(totals))\n",
+    "      summary$pct_dropped <- round(100 * (1 - summary$n_passed / summary$n_total), 1)\n",
+    "      cat(\"\\n=== Filter Summary for $[_chroms] ===\\n\")\n",
+    "      print(data.frame(value = unlist(summary), row.names = names(summary)))\n",
+    "    }\n",
+    "\n",
+    "bash: expand = \"$[ ]\"\n",
+    "\n",
+    "    # Step 5: Cleanup block intermediates\n",
+    "    chrom_dir=\"$[output_dir]/$[_chroms]\"\n",
+    "    final_prefix=\"${chrom_dir}/$[cohort_id].$[_chroms]\"\n",
+    "\n",
+    "    rm -f \"${final_prefix}_pmerge_list.txt\"\n",
+    "    for block_dir in \"${chrom_dir}\"/*/; do\n",
+    "        rm -rf \"${block_dir}\"\n",
+    "    done"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dc998dcc",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Troubleshooting\n",
+    "\n",
+    "| Symptom | Cause | Fix |\n",
+    "|---|---|---|\n",
+    "| `No VCF files for chrXX in {vcf_base}` | VCF naming or extension mismatch | Files must end in `.bgz` and be named `{vcf_prefix}{chr}.*.bgz`; check `--vcf-base` and `--vcf-prefix`. |\n",
+    "| `W shape mismatch` | `--n-samples` or `--B` differs from the W used | Re-run `generate_W` with the same `--n-samples` and `--B`, and pass that `W_B{B}.npy` to `process_block`. |\n",
+    "| `No passing variants in chrXX` | Filters removed everything (small toy cohort) | Widen `--maf-min` / `--mac-min` / `--msng-min`, or choose blocks with more variants. |\n",
+    "| `No blocks found for chrom=XX` | `--chrom` does not match any BED rows | Ensure the BED `chr` column matches (e.g. `chr22`) and `--chrom` is the matching number. |\n",
+    "| Region query returns nothing | Missing tabix index | Run `tabix -p vcf file.bgz` to create the `.tbi`. |"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ac3cfb79",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Output\n",
+    "\n",
+    "Per chromosome, under `--cwd`:\n",
+    "- `{cohort_id}.chr{N}.pgen` — binary genotype-sketch data (B pseudo-samples × p variants)\n",
+    "- `{cohort_id}.chr{N}.pvar` — variant information\n",
+    "- `{cohort_id}.chr{N}.psam` — sample (sketch) information\n",
+    "- `{cohort_id}.chr{N}.afreq` — allele frequencies\n",
+    "\n",
+    "These feed SuSiE-RSS fine-mapping: load with a metadata TSV (one row per chromosome, columns `#chrom start end path`, `path` = pgen prefix). Use the X (genotype) interface for `susie_rss(z, X=X)` or the R (correlation) interface for `susie_rss(z, R=R)`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ff927c9c",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Anticipated Results\n",
+    "\n",
+    "The pipeline produces output files in the `output/` subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the **Output** section above for the expected file names and formats."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "SoS",
+   "language": "sos",
+   "name": "sos"
+  },
+  "language_info": {
+   "codemirror_mode": "sos",
+   "file_extension": ".sos",
+   "mimetype": "text/x-sos",
+   "name": "sos",
+   "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter",
+   "pygments_lexer": "sos"
+  },
+  "sos": {
+   "kernels": [
+    [
+     "Bash",
+     "calysto_bash",
+     "Bash",
+     "#E6EEFF",
+     ""
+    ],
+    [
+     "SoS",
+     "sos",
+     "sos",
+     "",
+     "sos"
+    ]
+   ],
+   "version": ""
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

From d1ed9bffbdaaabbbd1427db5902fa06b00a3e222 Mon Sep 17 00:00:00 2001
From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com>
Date: Tue, 23 Jun 2026 12:00:55 -0400
Subject: [PATCH 3/6] Delete code/SoS/enrichment/sldsc_enrichment.ipynb

---
 code/SoS/enrichment/sldsc_enrichment.ipynb | 1383 --------------------
 1 file changed, 1383 deletions(-)
 delete mode 100644 code/SoS/enrichment/sldsc_enrichment.ipynb

diff --git a/code/SoS/enrichment/sldsc_enrichment.ipynb b/code/SoS/enrichment/sldsc_enrichment.ipynb
deleted file mode 100644
index e022ec8a..00000000
--- a/code/SoS/enrichment/sldsc_enrichment.ipynb
+++ /dev/null
@@ -1,1383 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "# Stratified LD Score Regression (S-LDSC) Enrichment\n",
-    "\n",
-    "Minimal working-example driver for the S-LDSC functional-enrichment pipeline. The **Steps** section below gives one ready-to-run `sos run` command per workflow, using the toy inputs symlinked under `input/`.\n",
-    "\n",
-    "> **Environment note.** Steps 1\u20132 (`make_annotation_files_ldscore`, `get_heritability`) wrap the external **polyfun** toolkit (`compute_ldscores.py`, `ldsc.py`, `munge_polyfun_sumstats.py`) and require pre-computed reference-panel files (baseline-LD scores, LD weights, `.frq`, and PLINK `.bed/.bim/.fam`). polyfun is **not installed in this environment** and the reference panel is not shipped with the toy example, so those two steps cannot be executed here; their commands are provided for use on a system where polyfun and a matching panel are available. Steps 3\u20134 (`postprocess`, `meta_subset`) use `pecotmr::sldsc_postprocessing_pipeline` (available here) and read the `.results`/`.log` files produced by Step 2.\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Description\n",
-    "This notebook implements the pipeline of [S-LDSC](https://www.nature.com/articles/ng.3404) for LD score and functional enrichment analysis.\n",
-    "\n",
-    "**Important: the S-LDSC implementation comes from the [polyfun](https://github.com/omerwe/polyfun/tree/master) package, not the original LDSC from `bulik/ldsc` GitHub repo.**"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "Markdown"
-   },
-   "source": [
-    "Uses GWAS summary statistics together with annotation and LD reference-panel data to compute per-SNP heritability enrichment for each annotation. It supports single-annotation (individual contribution) and joint multi-annotation (independent contribution) analysis.\n",
-    "\n",
-    "**Background.** LD Score Regression (Bulik-Sullivan et al. 2015) distinguishes confounding (e.g. population stratification) from true polygenic signal by regressing GWAS chi-square statistics on LD scores: SNPs tagging more variation (high LD score) show higher chi-square under true polygenicity, whereas confounding inflates statistics independently of LD. S-LDSC (Finucane et al. 2015) partitions heritability across overlapping annotation categories; standardized tau accounts for negative selection (Gazal et al. 2017). The model details and the tau*/EnrichStat definitions follow below.\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Methods - Workflow Overview\n",
-    "\n",
-    "The pipeline runs in three stages: (1) annotation preparation and the S-LDSC regression (polyfun), (2) post-processing into standardized $\\tau^*$ and meta-analysis (the `pecotmr` package), and (3) optional re-meta on user-defined trait subsets. The concrete commands for stages 1-2 are in the **Steps** section below.\n",
-    "\n",
-    "**Stage 1 - polyfun.** Three SoS workflows wrap polyfun: `make_annotation_files_ldscore` converts target annotations into polyfun `.annot.gz` and runs `compute_ldscores.py` (toggles `compute_single` and `compute_joint`, both default `True`; the joint dir is only emitted when $N \\geq 2$); `munge_sumstats_polyfun` preprocesses each GWAS into LDSC format; `get_heritability` runs polyfun's `ldsc.py` once per `--target-anno-dir`, enforcing the MAF cutoff via `--frqfile-chr` (`maf_cutoff` accepts only `0` or `0.05`).\n",
-    "\n",
-    "**Stage 2 - pecotmr post-processing.** A single `pecotmr::sldsc_postprocessing_pipeline` call consumes all polyfun outputs: it extracts $\\tau$, $E$, $h^2_g$, EnrichStat p-value and per-block jackknife $\\tau$ values; computes $sd_C$ and $M_{\\mathrm{ref}}$ over the regression's MAF-cutoff SNP set; standardizes $\\tau \\to \\tau^*$ for single and joint modes; auto-detects binary vs continuous annotations; and runs a DerSimonian-Laird random-effects meta-analysis across traits, producing three meta tables ($\\tau^*$ cross-type comparable, $E$ within-binary, EnrichStat within-binary). Output is an R list with `per_trait` and `meta` entries.\n",
-    "\n",
-    "**Stage 3 - subset meta-analysis.** `pecotmr::meta_sldsc_random` re-runs the meta on a trait subset without re-running the regression (lightweight, interactive):\n",
-    "\n",
-    "```r\n",
-    "res <- readRDS(\"sldsc_results.rds\")\n",
-    "neuro <- c(\"AD_GWAX\", \"PD_meta\", \"ALS_meta\")\n",
-    "meta_neuro_taustar <- pecotmr::meta_sldsc_random(\n",
-    "  res$per_trait[neuro], category = \"my_target_anno\", quantity = \"tau_star\"\n",
-    ")\n",
-    "```\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Theory\n",
-    "\n",
-    "The statistical model behind the pipeline is summarized below. Because the same framework underlies several of the workflow steps, the model, its stratified extension, and the tau-estimation / enrichment definitions are described together here rather than repeated per step."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "### LDSC model\n",
-    "\n",
-    "Under a polygenic assumption, in which effect sizes for variants are drawn independently from distributions with variance proportional to $1/(p(1-p))$ where $p$ is the minor allele frequency (MAF), the expected $\\chi^2$ statistic of variant $j$ is:\n",
-    "\n",
-    "$$E[\\chi^2_j \\mid \\ell_j] \\;=\\; \\frac{N\\,h^2\\,\\ell_j}{M} \\;+\\; N a \\;+\\; 1 \\quad (1)$$\n",
-    "\n",
-    "where $N$ is the sample size; $M$ is the number of SNPs, so that $h^2/M$ is the average heritability per SNP; $a$ measures the contribution of confounding biases such as cryptic relatedness and population stratification; and $\\ell_j = \\sum_k r^2_{jk}$ is the LD Score of variant $j$, which measures the amount of genetic variation tagged by $j$. A full derivation is given in the Supplementary Note of Bulik-Sullivan et al. (2015); an alternative derivation appears in the Supplementary Note of Zhu and Stephens (2017) AoAS.\n",
-    "\n",
-    "Equation (1) shows that LD Score regression can compute SNP-based heritability for a phenotype from GWAS summary statistics alone, without requiring individual-level genotype data as REML and related methods do."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "### Stratified LDSC\n",
-    "\n",
-    "Heritability is the proportion of phenotypic variation that is due to variation in genetic values, and it can also be partitioned over disjoint or overlapping categories of SNPs.\n",
-    "\n",
-    "Stratified LD Score Regression (S-LDSC) partitions heritability by leveraging both LD-score information and SNPs that have not reached genome-wide significance. S-LDSC exploits the fact that the $\\chi^2$ statistic for a given SNP reflects the cumulative effects of all SNPs tagged by it: in regions of high LD, the focal SNP captures the contribution of a group of nearby SNPs.\n",
-    "\n",
-    "S-LDSC declares an annotation enriched for heritability if SNPs with high LD to that annotation have higher $\\chi^2$ statistics than SNPs with low LD to it.\n",
-    "\n",
-    "Let $a_{jC}$ denote the value of annotation $C$ at SNP $j$:\n",
-    "\n",
-    "- **Binary annotation** (e.g. an indicator for \"in enhancer\", \"in exon\", \"in cell-type-specific peak\"): $a_{jC} \\in \\{0, 1\\}$.\n",
-    "- **Continuous annotation** (e.g. gene-specificity score, conservation score, continuous epigenomic signal): $a_{jC} \\in \\mathbb{R}$.\n",
-    "\n",
-    "Under a polygenic model the per-SNP heritability for SNP $j$ is\n",
-    "\n",
-    "$$\\mathrm{Var}(\\beta_j) \\;=\\; \\sum_C a_{jC}\\, \\tau_C$$\n",
-    "\n",
-    "and the expected $\\chi^2$ statistic of SNP $j$ is\n",
-    "\n",
-    "$$E[\\chi^2_j \\mid \\mathbf{a}_j] \\;=\\; N \\sum_C \\tau_C\\, \\ell(j, C) \\;+\\; N a \\;+\\; 1 \\quad (2)$$\n",
-    "\n",
-    "where $\\ell(j, C) = \\sum_k a_{kC}\\, r^2_{jk}$ is the partitioned LD score of SNP $j$ with respect to annotation $C$, and $a$ measures confounding bias. Equation (2) allows joint estimation of all $\\tau_C$ via a (computationally simple) multiple regression of $\\chi^2_j$ against $\\ell(j, C)$.\n",
-    "\n",
-    "Interpretation of $\\tau_C$:\n",
-    "- **Binary $C$**: $\\tau_C$ is the *additive increase in per-SNP heritability* for SNPs in category $C$, on top of the contributions from any other annotations they belong to.\n",
-    "- **Continuous $C$**: $\\tau_C$ is the *additive change in per-SNP heritability per unit increase* in the value of annotation $C$.\n",
-    "\n",
-    "For application to real data and comparisons to other methods, see the three papers cited at the top of this notebook."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "### Tau Estimation and Enrichment Analysis\n",
-    "\n",
-    "Goal: quantify the contribution of functional annotations to trait heritability and assess statistical significance, accounting for LD structure and (for continuous annotations) annotation scale.\n",
-    "\n",
-    "The pipeline has two computational layers:\n",
-    "\n",
-    "- **Regression layer** \u2014 the S-LDSC regression itself, performed by the [polyfun](https://github.com/omerwe/polyfun) engine. We do not re-implement this.\n",
-    "- **Post-processing layer** \u2014 standardization, differential per-SNP heritability, binary/continuous detection, and random-effects meta-analysis across traits. Implemented in the [`pecotmr`](https://github.com/StatFunGen/pecotmr) R package (`R/sldsc_wrapper.R`).\n",
-    "\n",
-    "The notation below tags each modeling quantity as **(polyfun)** or **(pecotmr)**.\n",
-    "\n",
-    "#### Notation\n",
-    "\n",
-    "For each annotation $C$ we use:\n",
-    "\n",
-    "- $\\pi^{h^2}_C$ = proportion of trait heritability $h^2_g$ assigned to annotation $C$.\n",
-    "- $\\pi^{M}_C$ = proportion of (effective) SNPs in annotation $C$. For binary annotations this is $M_C / M_{\\mathrm{ref}}$; for continuous annotations it is the share of total annotation weight in $C$.\n",
-    "\n",
-    "#### Reference panel and MAF cutoff\n",
-    "\n",
-    "All LD-derived quantities \u2014 partitioned LD scores for the 97 baseline annotations and for our $K$ target annotations, the LD-score-regression weights, allele frequencies, and the SNP set \u2014 are computed against our own LD reference panel. We do not mix in pre-computed quantities from external panels (e.g. 1000G); $M_{\\mathrm{ref}}$ throughout this notebook denotes the number of common SNPs in our panel.\n",
-    "\n",
-    "By default we restrict to MAF $> 5\\%$ per the sLDSC recommendation: rare-variant LD is unstable and HapMap3-style regression weights are common-variant by construction. The cutoff is exposed as the SoS parameter `maf_cutoff` (default $0.05$); the regression, the standardized $sd_C$, and $M_{\\mathrm{ref}}$ are all evaluated on the same MAF $>$ cutoff SNP set. If allele-frequency files are not available the pipeline fails; the user must explicitly set `maf_cutoff = 0` to opt out (not recommended).\n",
-    "\n",
-    "#### Quantities from the regression layer (polyfun)\n",
-    "\n",
-    "Solving Equation (2) jointly across annotations, with 200-block genomic jackknife for inference, is performed by polyfun's `ldsc.py`. From each polyfun run we obtain, per annotation:\n",
-    "\n",
-    "- $\\tau_C$ and its standard error \u2014 **(polyfun)**.\n",
-    "- $\\pi^{h^2}_C$ and $\\pi^{M}_C$ \u2014 **(polyfun)**.\n",
-    "- $E_C = \\pi^{h^2}_C / \\pi^{M}_C$ and its standard error \u2014 **(polyfun)**.\n",
-    "- The p-value of the differential per-SNP heritability test (defined below) \u2014 **(polyfun)**, computed internally with the full coefficient covariance matrix.\n",
-    "\n",
-    "We also obtain, per run:\n",
-    "\n",
-    "- The total trait heritability $h^2_g$ \u2014 **(polyfun)**.\n",
-    "- The 200-block jackknife delete-values of $\\tau_C$ \u2014 **(polyfun)**.\n",
-    "\n",
-    "#### Quantities from the post-processing layer (pecotmr)\n",
-    "\n",
-    "From the polyfun outputs above plus our reference panel, the post-processing layer computes:\n",
-    "\n",
-    "- $sd_C$ \u2014 per-annotation standard deviation over MAF $>$ cutoff SNPs \u2014 **(pecotmr: `compute_sldsc_annot_sd`)**.\n",
-    "- $M_{\\mathrm{ref}}$ \u2014 reference SNP count at the MAF cutoff \u2014 **(pecotmr: `compute_sldsc_M_ref`)**.\n",
-    "- Whether each annotation is binary or continuous \u2014 **(pecotmr: `is_binary_sldsc_annot`)**.\n",
-    "- $\\tau^*_C$ point estimate and per-block $\\tau^*_C$ \u2014 **(pecotmr: `standardize_sldsc_trait`)**.\n",
-    "- EnrichStat point estimate and its standard error (formula below) \u2014 **(pecotmr: `standardize_sldsc_trait`)**.\n",
-    "- DerSimonian-Laird random-effects meta-analysis of $\\tau^*_C$, $E_C$, or EnrichStat across traits \u2014 **(pecotmr: `meta_sldsc_random`)**.\n",
-    "\n",
-    "The top-level entry point `pecotmr::sldsc_postprocessing_pipeline` orchestrates all of the above.\n",
-    "\n",
-    "#### Standardized tau ($\\tau^*$)  \u2014  (pecotmr)\n",
-    "\n",
-    "$\\tau_C$ has units that depend on the scale of the annotation and on the total heritability of the trait, so raw $\\tau$ is not directly comparable across annotations or across traits. We compute the standardized version (Gazal et al. 2017)\n",
-    "\n",
-    "$$\\tau^*_C \\;=\\; \\tau_C \\cdot \\frac{sd_C \\cdot M_{\\mathrm{ref}}}{h^2_g}$$\n",
-    "\n",
-    "interpreted as the additive change in per-SNP heritability associated with a 1 standard deviation increase in annotation $C$, divided by the average per-SNP heritability across all SNPs. $\\tau^*_C$ is dimensionless and comparable across annotations and across traits. In a joint multi-annotation regression it is the *independent contribution* of annotation $C$ after controlling for overlapping effects of the others.\n",
-    "\n",
-    "Here $sd_C$ is the standard deviation of annotation $C$ across reference SNPs (MAF $>$ cutoff), $M_{\\mathrm{ref}}$ is the count of those SNPs, and $h^2_g$ is the trait heritability. Applying the same scaling to each of the 200 jackknife blocks yields per-block $\\tau^*_C$ values; their sample variance gives the jackknife standard error\n",
-    "$$SE^{\\text{jackknife}}(\\tau^*_C) \\;=\\; \\sqrt{\\,\\tfrac{(B-1)^2}{B}\\, \\mathrm{Var}_b(\\tau^*_{C,(b)})\\,}$$\n",
-    "with $B = 200$, used as the per-trait input to cross-trait meta-analysis.\n",
-    "\n",
-    "#### Differential per-SNP heritability (\"EnrichStat\")  \u2014  (polyfun + pecotmr)\n",
-    "\n",
-    "To test whether the per-SNP heritability *inside* annotation $C$ differs from *outside* it (Finucane et al. 2015):\n",
-    "\n",
-    "$$\\text{EnrichStat}_C \\;=\\; \\frac{h^2_g}{M_{\\mathrm{ref}}} \\!\\left[\\, \\frac{\\pi^{h^2}_C}{\\pi^{M}_C} \\;-\\; \\frac{1 - \\pi^{h^2}_C}{1 - \\pi^{M}_C} \\,\\right]$$\n",
-    "\n",
-    "The point-estimate p-value of this test is computed by polyfun internally using the full coefficient covariance and reported as `Enrichment_p`. Its standard error is recovered from the reported p-value:\n",
-    "\n",
-    "$$|Z_C| \\;=\\; \\Phi^{-1}\\!\\left(1 - \\tfrac{p_C}{2}\\right), \\qquad SE_{\\text{EnrichStat}_C} \\;=\\; \\frac{|\\text{EnrichStat}_C|}{|Z_C|}.$$\n",
-    "\n",
-    "This per-trait point + SE is the input to cross-trait meta-analysis.\n",
-    "\n",
-    "#### Reporting: binary vs. continuous annotations  \u2014  (pecotmr)\n",
-    "\n",
-    "The estimation machinery applies to both annotation types, but the *headline* quantity to report **within each type** differs.\n",
-    "\n",
-    "For a **binary annotation** (e.g. enhancer indicator, exon, in/out of a cell-type peak), $\\pi^{M}_C = M_C / M_{\\mathrm{ref}}$ has a direct interpretation and $E_C$ reads as \"the category explains $E_C$-fold more heritability than its share of SNPs.\" The within-type headline quantities are therefore $E_C$ and the EnrichStat p-value; $\\tau^*_C$ is reported alongside.\n",
-    "\n",
-    "For a **continuous annotation** (e.g. gene-specificity score, conservation score, continuous epigenomic signal), $E_C$ depends on the scale of the annotation: rescaling the annotation by a constant changes $E_C$ even though the underlying biology is unchanged. The within-type headline quantities are therefore $\\tau^*_C$ and its p-value; $E_C$ is reported alongside but should not be interpreted for continuous annotations.\n",
-    "\n",
-    "The pipeline determines whether an annotation is binary by inspecting whether its values lie in $\\{0, 1\\}$ and selects the appropriate within-type headline statistic automatically (pecotmr).\n",
-    "\n",
-    "> **From the official LDSC tutorial** ([Partitioned Heritability from Continuous Annotations](https://github.com/bulik/ldsc/wiki/Partitioned-Heritability-from-Continuous-Annotations)):\n",
-    ">\n",
-    "> *\"Enrichment is (Prop. heritability) / (Prop. SNPs). These outputs make sense only for binary annotations. Do not try to interpret them for continuous annotations. Using `--print-coefficients` outputs the regression coefficients and corresponding standard errors and Z score for each annotation. These coefficients measure the additional contribution of one annotation to the model and are interpretable for both binary and continuous annotations.\"*\n",
-    ">\n",
-    "> The pipeline always passes `--print-coefficients` to polyfun for this reason.\n",
-    "\n",
-    "#### Cross-type comparison: always use $\\tau^*_C$  \u2014  (pecotmr)\n",
-    "\n",
-    "For an apple-to-apple comparison **across binary and continuous annotations** \u2014 ranking annotations on a single axis, meta-analyzing a mixed set, or reporting a leaderboard that pools both types \u2014 use $\\tau^*_C$. The standardization in Gazal et al. (2017) was designed for exactly this purpose: $sd_C = \\sqrt{p(1-p)}$ for a binary annotation (where $p$ is the proportion in the category) and $sd_C = $ empirical standard deviation for a continuous annotation, so the resulting $\\tau^*_C$ is dimensionless and has the same interpretation in both cases \u2014 additive change in per-SNP heritability per 1 SD increase in the annotation, normalized by the average per-SNP heritability. $E_C$ does not have this property and must not be compared across types.\n",
-    "\n",
-    "The pipeline emits both $E_C$ and $\\tau^*_C$ for every annotation, with the binary/continuous flag, so callers can pick the right column for the comparison they are making.\n",
-    "\n",
-    "#### Joint analysis  \u2014  (polyfun runs the regression; pecotmr standardizes both modes)\n",
-    "\n",
-    "For **joint analysis** (multiple annotations fit together), both $\\tau$ and $E$ are conditional on the other annotations in the model. We report joint $\\tau^*_C$ as the independent contribution of annotation $C$ after controlling for the others. The annotation-prep step exposes two independent toggles, `compute_single` and `compute_joint` (both default `True`), so the user can produce the $N$ single-target outputs, the joint output, or both in one invocation. With both defaults the post-processing layer reads all $N+1$ regression outputs per trait and presents single + joint side-by-side. When the joint subset is decided after looking at single-target results (exploratory $\\rightarrow$ conditional workflow), the user runs the annotation-prep step a second time with `compute_single=False` on the curated subset.\n",
-    "\n",
-    "### Meta-Analysis across Traits (Random Effects)  \u2014  (pecotmr)\n",
-    "\n",
-    "DerSimonian-Laird random-effects meta-analysis of per-annotation estimates across traits, implemented in `pecotmr::meta_sldsc_random` (which delegates the numerics to `rmeta::meta.summaries(..., method = \"random\")`):\n",
-    "\n",
-    "$$\\hat\\theta_{\\mathrm{meta}} \\;=\\; \\frac{\\sum_i w_i\\, \\hat\\theta_i}{\\sum_i w_i}, \\qquad SE_{\\mathrm{meta}} \\;=\\; \\sqrt{\\frac{1}{\\sum_i w_i}}, \\qquad w_i \\;=\\; \\frac{1}{SE_i^2 + \\hat\\sigma^2}$$\n",
-    "\n",
-    "where $\\hat\\theta_i$ is the per-trait estimate and $SE_i$ its standard error:\n",
-    "\n",
-    "- **For $\\tau^*_C$ meta**: $SE_i$ is the jackknife SE from the per-block $\\tau^*_C$ values.\n",
-    "- **For $E_C$ meta**: $SE_i$ is the polyfun-reported `Enrichment_std_error`.\n",
-    "- **For EnrichStat meta**: $SE_i$ is the back-solved SE from polyfun's `Enrichment_p`.\n",
-    "\n",
-    "For binary-annotation enrichment reporting we use a two-channel meta: the **effect size and SE** come from the meta on $E_C$ (interpretable on the original enrichment-fold scale), while the **p-value** comes from the meta on EnrichStat (the appropriate hypothesis test). The pipeline produces a default meta over all supplied traits; users can re-run meta on any subset of traits without re-running the regression layer.\n",
-    "\n",
-    "$$Z_{\\mathrm{meta}} \\;=\\; \\frac{\\hat\\theta_{\\mathrm{meta}}}{SE_{\\mathrm{meta}}}, \\qquad p \\;=\\; 2\\,\\Phi(-|Z_{\\mathrm{meta}}|)$$"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Minimal Working Example (MWE)\n",
-    "\n",
-    "The steps below run the four pipeline workflows end to end on the example data. Each step lists what it does, then the `sos run` command to execute it.\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Step 1. `make_annotation_files_ldscore`\n",
-    "\n",
-    "*Annotation preparation and S-LDSC regression (polyfun).* This step accepts a single annotation file for a single-tau analysis (one annotation as input) or several annotation files for a joint-tau analysis (multiple annotations as input)."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "#### **Inputs**\n",
-    "\n",
-    "##### 1. Target Annotation File\n",
-    "\n",
-    "- **Purpose**: Specifies the user-provided (\"target\") genome annotation files. The pipeline supports both binary and continuous annotations; the type is auto-detected per annotation column.\n",
-    "- **Formats**:\n",
-    "    - Text file (`.txt`) listing per-chromosome paths to annotation files. Annotation files can be `.rds`/`.tsv`/`.txt`.\n",
-    "    - Alternatively, files for specific chromosomes can be provided directly.\n",
-    "    - **Multiple target annotations** are supported in one input file (one column per annotation, prefixed `path`, `path1`, `path2`, ...). Single-target and joint-target analyses are produced automatically in one pipeline pass.\n",
-    "    - **Format** (the score column is optional; if absent, score is set to 1):\n",
-    "        - `is_range = False`:\n",
-    "        ```\n",
-    "        chr   pos   score\n",
-    "        1    10001   1\n",
-    "        1    10002   1\n",
-    "        ```\n",
-    "        - `is_range = True`:\n",
-    "        ```\n",
-    "        chr   start   end   score\n",
-    "        1    10001  20001  1\n",
-    "        1    30001  40001  1\n",
-    "        ```\n",
-    "\n",
-    "##### 2. Reference Annotation File (baseline-LD)\n",
-    "\n",
-    "- **Purpose**: Provides the baseline annotations (typically the 97-annotation baseline-LD model from Gazal et al. 2017) in `.annot.gz` format for each chromosome. The baseline conditions every regression.\n",
-    "- **Formats**:\n",
-    "    - Text file listing baseline annotation files for all chromosomes.\n",
-    "    - Alternatively, files for specific chromosomes can be provided directly.\n",
-    "\n",
-    "##### 3. Genome Reference File\n",
-    "\n",
-    "- **Purpose**: PLINK-format `.bed/.bim/.fam` files for our LD reference panel, per chromosome. This is the panel against which all LD-derived quantities (target LD scores, baseline LD scores, regression weights, allele frequencies) must be computed. **Do not mix files derived from different panels** (e.g. 1000G vs ADSP).\n",
-    "- **Formats**:\n",
-    "    - Text file listing per-chromosome reference files, or files for specific chromosomes.\n",
-    "\n",
-    "##### 4. SNP List\n",
-    "\n",
-    "- **Purpose**: Specifies the SNPs to include in LDSC analysis (typically a HapMap3-style list).\n",
-    "- **Format**: A list of `rsid`s, one per line.\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "kernel": "Bash"
-   },
-   "outputs": [],
-   "source": [
-    "sos run pipeline/sldsc_enrichment.ipynb make_annotation_files_ldscore \\\n",
-    "  --annotation_file input/enrichment/sldsc/colocboost_test_annotation_path.txt \\\n",
-    "  --reference_anno_file input/enrichment/sldsc/reference_annotation0.txt \\\n",
-    "  --genome_ref_file input/enrichment/sldsc/genome_reference_bfile.txt \\\n",
-    "  --annotation_name protocol_example \\\n",
-    "  --plink_name reference. --baseline_name annotations. --weight_name weights. \\\n",
-    "  --python_exec python \\\n",
-    "  --polyfun_path ../polyfun \\\n",
-    "  --cwd output/sldsc_ldscore -j 4\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "### Munge summary statistics (preprocessing, run before Step 2)\n",
-    "\n",
-    "Before estimating heritability, each raw GWAS summary-statistics file must be converted into the LDSC-compatible format consumed by `get_heritability`. Run `munge_sumstats_polyfun` once per trait; the munged files are then collected in the directory passed to `get_heritability` via `--sumstat_dir`.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": [
-    "# sos run pipeline/sldsc_enrichment.ipynb munge_sumstats_polyfun \\\n",
-    "#     --sumstats data/polyfun_new/example_data/trait_raw_sumstats.tsv \\\n",
-    "#     --n 0 \\\n",
-    "#     --min-info 0.6 \\\n",
-    "#     --min-maf 0.001 \\\n",
-    "#     --chi2-cutoff 30 \\\n",
-    "#     --polyfun_path data/github/polyfun \\\n",
-    "#     --cwd data/polyfun_new/example_data"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Step 2. `get_heritability`\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "**Inputs**\n",
-    "\n",
-    "##### 1. Allele Frequency Files (`.frq`, our panel)\n",
-    "\n",
-    "- **Purpose**: PLINK `.frq` files for the reference panel, used to enforce the MAF cutoff. **Required** when `maf_cutoff > 0` (default `0.05`); the pipeline fails if missing unless `maf_cutoff = 0` is explicitly set.\n",
-    "\n",
-    "##### 2. GWAS Summary Statistics\n",
-    "\n",
-    "- **Purpose**: One munged sumstats file per trait, listed in a text file (`all_traits_file`). The pipeline runs the regression once per trait per single/joint mode.\n",
-    "- **Format**:\n",
-    "    ```\n",
-    "    CAD_META.filtered.sumstats.gz\n",
-    "    UKB.Lym.BOLT.sumstats.gz\n",
-    "    ```\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "kernel": "Bash"
-   },
-   "outputs": [],
-   "source": [
-    "sos run pipeline/sldsc_enrichment.ipynb get_heritability \\\n",
-    "  --target_anno_dirs output/sldsc_ldscore/protocol_example_single_1 \\\n",
-    "  --all_traits_file input/enrichment/sldsc/sumstats_test_all.txt \\\n",
-    "  --sumstat_dir input/enrichment/sldsc \\\n",
-    "  --baseline_ld_dir input/enrichment/sldsc \\\n",
-    "  --weights_dir input/enrichment/sldsc \\\n",
-    "  --plink_name reference. --baseline_name annotations. --weight_name weights. \\\n",
-    "  --annotation_name protocol_example --python_exec python \\\n",
-    "  --polyfun_path ../polyfun \\\n",
-    "  --maf_cutoff 0 --cwd output/sldsc_heritability -j 4\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Step 3. `Post-processing (pecotmr) and meta-analysis`\n",
-    "\n",
-    "*Post-Processing (`pecotmr::sldsc_postprocessing_pipeline`)*\n",
-    "\n",
-    "A single R function call consumes all polyfun outputs for the run and produces the final tables:\n",
-    "\n",
-    "- Reads each polyfun output and extracts $\\tau$, $E$, $h^2_g$, EnrichStat p-value, and per-block jackknife $\\tau$ values.\n",
-    "- Computes annotation $sd_C$ and $M_{\\mathrm{ref}}$ over the same MAF $>$ cutoff SNP set as the regression.\n",
-    "- Standardizes $\\tau \\to \\tau^*$ for both single-tau and joint-tau modes, including the per-block versions for jackknife SE.\n",
-    "- Auto-detects whether each annotation is binary or continuous and tags every output row accordingly.\n",
-    "- Reports the number and names of baseline annotations encountered (via `message()`) for transparency.\n",
-    "- Runs the default DerSimonian-Laird random-effects meta-analysis across all supplied traits, producing three meta tables: $\\tau^*$ (cross-type comparable), $E$ (within-binary), and EnrichStat (within-type).\n",
-    "\n",
-    "Outputs are returned as an R list with two top-level entries: `per_trait` (one tidy data frame per trait, single + joint estimates side-by-side per target) and `meta` (three tables, one per quantity, with rows = target annotations and columns = single/joint mean/SE/p plus an `is_binary` flag).\n",
-    "\n",
-    "The `[postprocess]` step reads all polyfun outputs under `heritability_cwd`\n",
-    "(which contains the $N$ single-target subdirectories and optionally the\n",
-    "joint subdirectory) and calls `pecotmr::sldsc_postprocessing_pipeline()`\n",
-    "to produce per-trait standardized tables and the default random-effects\n",
-    "meta across all traits.\n",
-    "\n",
-    "Use `--target-categories-label` (same order as `--target-categories`) to give the target annotations friendly names in the output \u2014 e.g. `--target-categories ANNOT_1_0 ANNOT_2_0 --target-categories-label quantile_eQTL eQTL` makes the `target` column read `quantile_eQTL` / `eQTL` instead of `ANNOT_1_0` / `ANNOT_2_0` (the original names are kept in `params$target_categories_orig`). Omit it to keep the polyfun `.results` names.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "kernel": "Bash"
-   },
-   "outputs": [],
-   "source": [
-    "sos run pipeline/sldsc_enrichment.ipynb postprocess \\\n",
-    "  --traits_file input/enrichment/sldsc/sumstats_test_all.txt \\\n",
-    "  --heritability_cwd output/sldsc_heritability \\\n",
-    "  --target_categories ANNOT_0 --target_categories_label protocol_example_annotation \\\n",
-    "  --target_anno_dir output/sldsc_ldscore/protocol_example_single_1 \\\n",
-    "  --annotation_name protocol_example --python_exec python \\\n",
-    "  --polyfun_path ../polyfun \\\n",
-    "  --maf_cutoff 0 --cwd output/sldsc_postprocess -j 4\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Step 4. `Subset Meta-Analysis (`pecotmr::meta_sldsc_random`)` (optional)\n",
-    "\n",
-    "The default meta in Step 2 pools all traits the user supplied. To re-run the meta on a subset (e.g., neurodegenerative traits only, or autoimmune traits only) without re-running the regression layer:\n",
-    "\n",
-    "```r\n",
-    "res <- readRDS(\"sldsc_results.rds\")\n",
-    "neuro <- c(\"AD_GWAX\", \"PD_meta\", \"ALS_meta\")\n",
-    "meta_neuro_taustar <- pecotmr::meta_sldsc_random(\n",
-    "  res$per_trait[neuro], category = \"my_target_anno\", quantity = \"tau_star\"\n",
-    ")\n",
-    "```\n",
-    "\n",
-    "This step is light-weight and can be run interactively.\n",
-    "\n",
-    "\n",
-    "The default meta in step 3 pools all traits supplied to `[postprocess]`. Use `[meta_subset]` to re-run the meta on a user-defined trait subset (e.g., neurodegenerative traits only, autoimmune traits only) without re-running the regression or the per-trait standardization. The subset operates on the cached `.sldsc_postprocess.rds` output; it is light-weight and can be run interactively or in batch.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "kernel": "Bash"
-   },
-   "outputs": [],
-   "source": [
-    "sos run pipeline/sldsc_enrichment.ipynb meta_subset \\\n",
-    "  --postprocess_rds output/sldsc_postprocess/protocol_example.sldsc_postprocess.rds \\\n",
-    "  --subset_traits_file input/enrichment/sldsc/sumstats_test_category1.txt \\\n",
-    "  --subset_name category1 --target_categories ANNOT_0 \\\n",
-    "  --annotation_name protocol_example --python_exec python \\\n",
-    "  --polyfun_path ../polyfun \\\n",
-    "  --maf_cutoff 0 --cwd output/sldsc_postprocess -j 4"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Output\n",
-    "\n",
-    "### Output summary (cached artifacts)\n",
-    "\n",
-    "| Stage | Cached on disk | Recomputable from | Purpose |\n",
-    "|---|---|---|---|\n",
-    "| Target LD scores | per-annotation, once | annotation + reference panel | input to every regression |\n",
-    "| polyfun `.results` per (trait, mode) | yes | regression run | $\\tau$, $E$, EnrichStat |\n",
-    "| Per-trait standardized table | yes (RDS) | polyfun outputs + $sd_C$ + $M_{\\mathrm{ref}}$ | reporting + meta |\n",
-    "| Default meta tables | yes (RDS) | per-trait standardized | headline figures |\n",
-    "| Subset meta | re-run on demand | per-trait standardized | custom analyses |\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "### Per-stage outputs\n",
-    "\n",
-    "Each workflow writes into its `--cwd`:\n",
-    "\n",
-    "- **make_annotation_files_ldscore** \u2014 polyfun `.annot.gz` files plus per-annotation LD-score directories (`.l2.ldscore.{gz,parquet}`, `.l2.M`, `.l2.M_5_50`). One single-target directory per annotation, plus (when more than one annotation) a joint directory.\n",
-    "- **get_heritability** \u2014 per trait and per target directory, the S-LDSC regression outputs `<trait>.{results,log,part_delete}`. The `.results` `Category` column carries the annotation name with a `_<ref-ld-index>` suffix.\n",
-    "- **postprocess** \u2014 a single `<annotation_name>.sldsc_postprocess.rds` containing per-trait tables (Gazal-style tau*, EnrichStat with back-solved jackknife SE) and three DerSimonian\u2013Laird random-effects meta tables (tau*, E, EnrichStat).\n",
-    "- **meta_subset** \u2014 a re-meta of the cached `.sldsc_postprocess.rds` over a user-defined trait subset (lightweight; no regression re-run).\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Anticipated Results\n",
-    "\n",
-    "Produces per-annotation enrichment statistics (tau, enrichment, p-value) from stratified LD score regression."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Command interface\n",
-    "\n",
-    "List all workflows and their options:\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "kernel": "Bash"
-   },
-   "outputs": [],
-   "source": [
-    "sos run pipeline/sldsc_enrichment.ipynb -h"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Workflow implementation\n",
-    "\n",
-    "The cells below are the pipeline definition (preserved from the original notebook): the `[global]` parameter block and the workflow step bodies.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": [
-    "[global]\n",
-    "# Path to the work directory of the analysis.\n",
-    "parameter: cwd = path('output')\n",
-    "# Prefix for the analysis output\n",
-    "parameter: annotation_name = str\n",
-    "parameter: python_exec = \"python\" # e.g. \"/home/you/.conda/envs/polyfun/bin/python\"\n",
-    "parameter: polyfun_path   = path # e.g. \"/home/you/tools/polyfun\"\n",
-    "\n",
-    "# MAF cutoff for sLDSC. Default 0.05 per sLDSC recommendation (rare-variant LD is unstable\n",
-    "# and HapMap3-style regression weights are common-variant by construction).\n",
-    "# Set to 0 to opt out of MAF filtering (NOT recommended; only use if you understand the implications).\n",
-    "# Other values would require recomputing LD scores at that cutoff.\n",
-    "parameter: maf_cutoff = 0.05\n",
-    "\n",
-    "# for make_annotation_files_ldscore workflow:\n",
-    "parameter: annotation_file = path()\n",
-    "parameter: reference_anno_file = path()\n",
-    "parameter: genome_ref_file = path() # with .bed\n",
-    "parameter: chromosome = []\n",
-    "parameter: snp_list = path()\n",
-    "parameter: ld_wind_kb = 0 # use kb if the value is provided\n",
-    "parameter: ld_wind_cm = 1.0 # default using ld_wind_cm\n",
-    "\n",
-    "# for get_heritability workflow.\n",
-    "# Note: all LD-derived inputs (baseline LD scores, target LD scores, regression weights,\n",
-    "# allele frequencies) must be computed against the same reference panel as `genome_ref_file`.\n",
-    "# Do not mix files derived from different reference panels (e.g., 1000G vs ADSP).\n",
-    "parameter: all_traits_file = path() # txt file, each row contains all GWAS summary statistics name: e.g. CAD_META.filtered.sumstats.gz\n",
-    "parameter: sumstat_dir = path() # Directory containing GWAS summary statistics\n",
-    "parameter: target_anno_dir = path()  # Directory containing target annotation files: output of ldscore\n",
-    "parameter: baseline_ld_dir = path()  # Directory containing baseline LD score files (computed against our panel)\n",
-    "parameter: frqfile_dir = path()  # Directory containing allele frequency files (.frq, our panel)\n",
-    "parameter: plink_name = \"ADSP_chr\"\n",
-    "parameter: weights_dir = path()  # Directory containing LD weights (computed against our panel)\n",
-    "parameter: baseline_name = \"baseline_chr\"  # Prefix of baseline annotation files\n",
-    "parameter: weight_name = \"weights_chr\"  # Prefix of LD weights files\n",
-    "parameter: n_blocks = 200\n",
-    "\n",
-    "# Number of threads\n",
-    "parameter: numThreads = 16\n",
-    "# For cluster jobs, number commands to run per job\n",
-    "parameter: job_size = 1\n",
-    "parameter: walltime = '12h'\n",
-    "parameter: mem = '16G'"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "Python 3 (ipykernel)"
-   },
-   "source": [
-    "## Make Annotation File"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": [
-    "[make_annotation_files_ldscore]\n",
-    "# Annotation preparation. Takes one annotation_file with N target annotations\n",
-    "# and produces, in one invocation, any combination of:\n",
-    "#   - N single-target LD-score directories (when compute_single = TRUE, default)\n",
-    "#   - 1 joint LD-score directory containing all N (when compute_joint = TRUE\n",
-    "#     and N >= 2, default)\n",
-    "#\n",
-    "# Outputs per chromosome <chr>:\n",
-    "#   <cwd>/<annotation_name>_single_<i>/<annotation_name>_single_<i>.<chr>.annot.gz   (i in 1..N, when compute_single)\n",
-    "#   <cwd>/<annotation_name>_single_<i>/<annotation_name>_single_<i>.<chr>.l2.ldscore.{parquet|gz}\n",
-    "#   <cwd>/<annotation_name>_single_<i>/<annotation_name>_single_<i>.<chr>.l2.M\n",
-    "#   <cwd>/<annotation_name>_single_<i>/<annotation_name>_single_<i>.<chr>.l2.M_5_50  (when .frq present)\n",
-    "#\n",
-    "#   <cwd>/<annotation_name>_joint/<annotation_name>_joint.<chr>.{...}                (when compute_joint and N>=2)\n",
-    "#\n",
-    "# Workflows:\n",
-    "#   - Workflow A (\"all at once\"): compute_single=TRUE, compute_joint=TRUE (defaults).\n",
-    "#     Produces both, fits the case where you have already chosen the joint set.\n",
-    "#   - Workflow B (\"exploratory then conditional\"):\n",
-    "#       Step 1: compute_single=TRUE, compute_joint=FALSE.\n",
-    "#               Run on N candidate annotations -> N single-target dirs.\n",
-    "#               Inspect single-target results, identify K significant ones.\n",
-    "#       Step 2: compute_single=FALSE, compute_joint=TRUE.\n",
-    "#               Run on a NEW annotation_file with the K selected annotations\n",
-    "#               -> 1 joint dir with the conditional model.\n",
-    "\n",
-    "#\n",
-    "# --- snplist (--snp_list) vs no-snplist: which polyfun script, output format,\n",
-    "#     column name, and the CM requirement ---\n",
-    "#   --snp_list given  -> ldsc.py --l2 --print-snps   -> output .l2.ldscore.gz\n",
-    "#   --snp_list absent -> compute_ldscores.py         -> output .l2.ldscore.parquet\n",
-    "#\n",
-    "#   LD-score column name (this is what becomes the .results \"Category\" in\n",
-    "#   [get_heritability], with a \"_<ref-ld-index>\" suffix appended there):\n",
-    "#     * compute_ldscores.py  ALWAYS keeps the annot column name(s):\n",
-    "#         single annot column \"ANNOT\"          -> ldscore column \"ANNOT\"\n",
-    "#         joint  annot columns \"ANNOT_1\",\"ANNOT_2\",...  -> \"ANNOT_1\",\"ANNOT_2\",...\n",
-    "#     * ldsc.py --l2 has a quirk: with EXACTLY ONE annotation (n_annot == 1) it\n",
-    "#       HARD-CODES the ldscore column name to \"L2\" and DROPS the annot's original\n",
-    "#       column name. With >=2 annotations it uses \"<annot_name>L2\"\n",
-    "#       (\"ANNOT_1L2\",\"ANNOT_2L2\",...).\n",
-    "#     => a single-target snplist run reports \"L2_0\" in .results, while a\n",
-    "#        single-target no-snplist run reports \"ANNOT_0\".  [postprocess] auto-\n",
-    "#        detects either; only matters if you pass --target-categories explicitly.\n",
-    "#\n",
-    "#   CM column requirement for snplist:  ldsc.py --l2 --print-snps requires the\n",
-    "#   target annot to (a) carry a \"CM\" (centimorgan) column and (b) line up with\n",
-    "#   the plink .bim (same SNP set, same row order). This step handles both\n",
-    "#   internally (normalize_for_ldsc: takes CM from the .bim 4th column, re-expands\n",
-    "#   the annot onto the .bim rows, filling 0). Therefore the plink .bim files MUST\n",
-    "#   carry genetic-map (cM) positions when using --ld-wind-cm (the default);\n",
-    "#   if your .bim has 0 in the cM column, switch to --ld-wind-kb instead.\n",
-    "#\n",
-    "parameter: compute_single = True\n",
-    "parameter: compute_joint = True\n",
-    "parameter: score_column = 3\n",
-    "parameter: is_range = False\n",
-    "\n",
-    "import pandas as pd\n",
-    "import os\n",
-    "\n",
-    "if not (compute_single or compute_joint):\n",
-    "    raise ValueError(\"[make_annotation_files_ldscore] at least one of compute_single or compute_joint must be TRUE\")\n",
-    "\n",
-    "def adapt_file_path(file_path, reference_file):\n",
-    "    reference_path = os.path.dirname(reference_file)\n",
-    "    if os.path.isfile(file_path):\n",
-    "        return file_path\n",
-    "    file_name = os.path.basename(file_path)\n",
-    "    if os.path.isfile(file_name):\n",
-    "        return file_name\n",
-    "    file_in_ref_dir = os.path.join(reference_path, file_name)\n",
-    "    if os.path.isfile(file_in_ref_dir):\n",
-    "        return file_in_ref_dir\n",
-    "    file_prefixed = os.path.join(reference_path, file_path)\n",
-    "    if os.path.isfile(file_prefixed):\n",
-    "        return file_prefixed\n",
-    "    raise FileNotFoundError(f\"No valid path found for file: {file_path}\")\n",
-    "\n",
-    "\n",
-    "# ---- Parse inputs and determine N ----\n",
-    "if (str(annotation_file).endswith(('rds', 'tsv', 'txt', 'tsv.gz', 'txt.gz')) and\n",
-    "    str(reference_anno_file).endswith('annot.gz')):\n",
-    "    # Case 1: direct file paths (single-chromosome run). Multiple target files separated by ','.\n",
-    "    target_files_direct = str(annotation_file).split(',')\n",
-    "    N_targets = len(target_files_direct)\n",
-    "    target_names = [f\"target_{i+1}\" for i in range(N_targets)]\n",
-    "    input_files = [[*target_files_direct, str(reference_anno_file), str(genome_ref_file)]]\n",
-    "    if len(chromosome) > 0:\n",
-    "        input_chroms = [int(x) for x in chromosome]\n",
-    "    else:\n",
-    "        input_chroms = [0]\n",
-    "else:\n",
-    "    # Case 2: txt list with #id and one or more 'path' columns\n",
-    "    target_files_df = pd.read_csv(annotation_file, sep=\"\\t\")\n",
-    "    reference_files = pd.read_csv(reference_anno_file, sep=\"\\t\")\n",
-    "    genome_ref_files = pd.read_csv(genome_ref_file, sep=\"\\t\")\n",
-    "\n",
-    "    target_files_df[\"#id\"] = [x.replace(\"chr\", \"\") for x in target_files_df[\"#id\"].astype(str)]\n",
-    "    reference_files[\"#id\"]  = [x.replace(\"chr\", \"\") for x in reference_files[\"#id\"].astype(str)]\n",
-    "    genome_ref_files[\"#id\"] = [x.replace(\"chr\", \"\") for x in genome_ref_files[\"#id\"].astype(str)]\n",
-    "\n",
-    "    path_columns = [c for c in target_files_df.columns if c.startswith('path')]\n",
-    "    N_targets = len(path_columns)\n",
-    "    target_names = path_columns[:]   # 'path', 'path1', 'path2', ...\n",
-    "\n",
-    "    for col in path_columns:\n",
-    "        target_files_df[col] = target_files_df[col].apply(lambda x: adapt_file_path(x, str(annotation_file)))\n",
-    "    reference_files[\"path\"] = reference_files[\"path\"].apply(lambda x: adapt_file_path(x, str(reference_anno_file)))\n",
-    "    genome_ref_files[\"path\"] = genome_ref_files[\"path\"].apply(lambda x: adapt_file_path(x, str(genome_ref_file)))\n",
-    "\n",
-    "    merged = target_files_df.merge(reference_files, on=\"#id\").merge(genome_ref_files, on=\"#id\")\n",
-    "    if len(chromosome) > 0:\n",
-    "        merged = merged[merged[\"#id\"].isin([str(c) for c in chromosome])]\n",
-    "\n",
-    "    rows = merged.values.tolist()\n",
-    "    input_chroms = [r[0] for r in rows]\n",
-    "    input_files = [[*r[1:N_targets+1], r[-2], r[-1]] for r in rows]\n",
-    "\n",
-    "# ---- Determine output format ----\n",
-    "use_print_snps = snp_list.is_file()\n",
-    "ldscore_ext = \"l2.ldscore.gz\" if use_print_snps else \"l2.ldscore.parquet\"\n",
-    "\n",
-    "if ld_wind_kb > 0:\n",
-    "    use_kb_window = True\n",
-    "    ld_window_param = ld_wind_kb\n",
-    "    ld_window_flag = \"--ld-wind-kb\"\n",
-    "else:\n",
-    "    use_kb_window = False\n",
-    "    ld_window_param = ld_wind_cm\n",
-    "    ld_window_flag = \"--ld-wind-cm\"\n",
-    "\n",
-    "emit_single = compute_single\n",
-    "emit_joint  = compute_joint and N_targets >= 2\n",
-    "\n",
-    "# ---- Build per-chromosome output list ----\n",
-    "def chrom_outputs(chrom):\n",
-    "    outs = []\n",
-    "    if emit_single:\n",
-    "        for i in range(N_targets):\n",
-    "            name = f\"{annotation_name}_single_{i+1}\"\n",
-    "            prefix = f\"{cwd:a}/{name}/{name}.{chrom}\"\n",
-    "            outs += [f\"{prefix}.annot.gz\", f\"{prefix}.{ldscore_ext}\", f\"{prefix}.l2.M\"]\n",
-    "    if emit_joint:\n",
-    "        name = f\"{annotation_name}_joint\"\n",
-    "        prefix = f\"{cwd:a}/{name}/{name}.{chrom}\"\n",
-    "        outs += [f\"{prefix}.annot.gz\", f\"{prefix}.{ldscore_ext}\", f\"{prefix}.l2.M\"]\n",
-    "    return outs\n",
-    "\n",
-    "input: input_files, group_by = N_targets + 2, group_with = \"input_chroms\"\n",
-    "\n",
-    "output: chrom_outputs(input_chroms[_index])\n",
-    "\n",
-    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bnn}'\n",
-    "\n",
-    "# ----------------------------------------------------------------------------\n",
-    "# Step A: write the requested .annot files for this chromosome.\n",
-    "# ----------------------------------------------------------------------------\n",
-    "R: expand = \"${ }\", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'\n",
-    "    library(data.table)\n",
-    "\n",
-    "    clean_chr <- function(x) as.numeric(gsub(\"^chr\", \"\", x))\n",
-    "\n",
-    "    process_range_data <- function(data, chr_value) {\n",
-    "        data$chr <- clean_chr(data$chr)\n",
-    "        data <- data[data$chr == chr_value,]\n",
-    "        if (nrow(data) == 0) return(NULL)\n",
-    "        expanded <- lapply(seq_len(nrow(data)), function(j) {\n",
-    "            row <- data[j,]\n",
-    "            pos_seq <- seq(row$start, row$end - 1)\n",
-    "            result <- data.frame(chr = rep(row$chr, length(pos_seq)), pos = pos_seq)\n",
-    "            if (ncol(data) > 3) {\n",
-    "                for (col in 4:ncol(data))\n",
-    "                    result[[names(data)[col]]] <- rep(row[[col]], length(pos_seq))\n",
-    "            }\n",
-    "            result\n",
-    "        })\n",
-    "        unique(rbindlist(expanded))\n",
-    "    }\n",
-    "\n",
-    "    process_annotation <- function(target_anno, ref_anno, score_column_value) {\n",
-    "        target_anno <- as.data.frame(target_anno)\n",
-    "        ref_anno    <- as.data.frame(ref_anno)\n",
-    "        target_anno$chr <- clean_chr(target_anno$chr)\n",
-    "        ref_anno$CHR    <- clean_chr(ref_anno$CHR)\n",
-    "        chr_value <- unique(ref_anno$CHR)\n",
-    "        anno_scores <- rep(0, nrow(ref_anno))\n",
-    "        match_pos <- match(target_anno$pos, ref_anno$BP)\n",
-    "        valid_pos <- as.numeric(na.omit(match_pos))\n",
-    "        if (score_column_value <= ncol(target_anno)) {\n",
-    "            anno_scores[valid_pos] <- target_anno[[score_column_value]][!is.na(match_pos)]\n",
-    "        } else {\n",
-    "            anno_scores[valid_pos] <- 1\n",
-    "            print(\"Warning: score column does not exist; setting scores to 1\")\n",
-    "        }\n",
-    "        anno_scores\n",
-    "    }\n",
-    "\n",
-    "    read_target_anno <- function(file_path, ref_anno) {\n",
-    "        if (endsWith(file_path, \"rds\")) {\n",
-    "            target_anno <- readRDS(file_path)\n",
-    "            return(process_annotation(target_anno, ref_anno, ${score_column}))\n",
-    "        }\n",
-    "        target_anno <- fread(file_path)\n",
-    "        if (${\"TRUE\" if is_range else \"FALSE\"}) {\n",
-    "            names(target_anno)[1:3] <- c(\"chr\", \"start\", \"end\")\n",
-    "            target_anno <- process_range_data(target_anno, unique(ref_anno$CHR))\n",
-    "            if (is.null(target_anno)) return(rep(0, nrow(ref_anno)))\n",
-    "        } else {\n",
-    "            names(target_anno)[1:2] <- c(\"chr\", \"pos\")\n",
-    "        }\n",
-    "        process_annotation(target_anno, ref_anno, ${score_column})\n",
-    "    }\n",
-    "\n",
-    "    # ---- Read reference annotation ----\n",
-    "    ref_anno <- as.data.frame(fread(${_input[-2]:ar}))\n",
-    "    if (\"ANNOT\" %in% colnames(ref_anno)) ref_anno <- ref_anno[, -which(colnames(ref_anno) == \"ANNOT\")]\n",
-    "\n",
-    "    # ---- Compute per-target annotation scores ----\n",
-    "    target_files <- c(${\",\".join('\"%s\"' % str(p.absolute()) for p in _input[:-2])})\n",
-    "    N_local <- length(target_files)\n",
-    "    score_list <- lapply(target_files, read_target_anno, ref_anno = ref_anno)\n",
-    "\n",
-    "    emit_single_local <- ${\"TRUE\" if emit_single else \"FALSE\"}\n",
-    "    emit_joint_local  <- ${\"TRUE\" if emit_joint  else \"FALSE\"}\n",
-    "    use_print_snps_local <- ${\"TRUE\" if use_print_snps else \"FALSE\"}\n",
-    "    bfile_prefix         <- \"${_input[-1]:na}\"\n",
-    "\n",
-    "    # Reshape annot to match .bim panel for ldsc.py --l2 --print-snps\n",
-    "    # (drop A1/A2/MAF, expand to .bim rows filling 0, take CM from .bim).\n",
-    "    normalize_for_ldsc <- function(df) {\n",
-    "        if (!use_print_snps_local) return(df)\n",
-    "        df <- df[, !names(df) %in% c(\"A1\", \"A2\", \"MAF\", \"CM\"), drop = FALSE]\n",
-    "        annot_cols <- setdiff(names(df), c(\"CHR\", \"BP\", \"SNP\"))\n",
-    "        bim <- as.data.frame(fread(paste0(bfile_prefix, \".bim\"), header = FALSE,\n",
-    "                                   col.names = c(\"CHR\", \"SNP\", \"CM\", \"BP\", \"A1\", \"A2\")))\n",
-    "        bim$CHR <- as.character(bim$CHR); df$CHR <- as.character(df$CHR)\n",
-    "        idx <- match(bim$SNP, df$SNP)\n",
-    "        out <- data.frame(CHR = bim$CHR, BP = bim$BP, SNP = bim$SNP, CM = bim$CM,\n",
-    "                          stringsAsFactors = FALSE)\n",
-    "        for (col in annot_cols) {\n",
-    "            v <- rep(0, nrow(bim))\n",
-    "            non_na <- !is.na(idx)\n",
-    "            v[non_na] <- df[[col]][idx[non_na]]\n",
-    "            out[[col]] <- v\n",
-    "        }\n",
-    "        out\n",
-    "    }\n",
-    "\n",
-    "    # ---- Write N single-target .annot files (when requested) ----\n",
-    "    if (emit_single_local) {\n",
-    "        for (i in seq_len(N_local)) {\n",
-    "            out_anno <- ref_anno\n",
-    "            out_anno$ANNOT <- score_list[[i]]\n",
-    "            out_anno <- normalize_for_ldsc(out_anno)\n",
-    "            name <- paste0(\"${annotation_name}\", \"_single_\", i)\n",
-    "            out_path_gz  <- file.path(\"${cwd:a}\", name, paste0(name, \".${input_chroms[_index]}.annot.gz\"))\n",
-    "            out_path_tsv <- sub(\"\\\\.gz$\", \"\", out_path_gz)\n",
-    "            dir.create(dirname(out_path_gz), showWarnings = FALSE, recursive = TRUE)\n",
-    "            fwrite(out_anno, out_path_tsv, quote = FALSE, col.names = TRUE, row.names = FALSE, sep = \"\\t\")\n",
-    "        }\n",
-    "    }\n",
-    "\n",
-    "    # ---- Optionally write joint .annot ----\n",
-    "    if (emit_joint_local) {\n",
-    "        joint_anno <- ref_anno\n",
-    "        for (i in seq_len(N_local)) {\n",
-    "            joint_anno[[paste0(\"ANNOT_\", i)]] <- score_list[[i]]\n",
-    "        }\n",
-    "        joint_anno <- normalize_for_ldsc(joint_anno)\n",
-    "        joint_name   <- paste0(\"${annotation_name}\", \"_joint\")\n",
-    "        joint_out_gz <- file.path(\"${cwd:a}\", joint_name, paste0(joint_name, \".${input_chroms[_index]}.annot.gz\"))\n",
-    "        joint_out_tsv <- sub(\"\\\\.gz$\", \"\", joint_out_gz)\n",
-    "        dir.create(dirname(joint_out_gz), showWarnings = FALSE, recursive = TRUE)\n",
-    "        fwrite(joint_anno, joint_out_tsv, quote = FALSE, col.names = TRUE, row.names = FALSE, sep = \"\\t\")\n",
-    "    }\n",
-    "\n",
-    "# ----------------------------------------------------------------------------\n",
-    "# Step B: gzip all annot files. Uses expand=\"$[ ]\" so bash ${var} survives.\n",
-    "# ----------------------------------------------------------------------------\n",
-    "bash: expand = \"$[ ]\", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'\n",
-    "    set -e\n",
-    "    annots=()\n",
-    "    if [ \"$[str(emit_single)]\" = \"True\" ]; then\n",
-    "        for i in $(seq 1 $[N_targets]); do\n",
-    "            annots+=(\"$[cwd:a]/$[annotation_name]_single_$i/$[annotation_name]_single_$i.$[input_chroms[_index]].annot\")\n",
-    "        done\n",
-    "    fi\n",
-    "    if [ \"$[str(emit_joint)]\" = \"True\" ]; then\n",
-    "        annots+=(\"$[cwd:a]/$[annotation_name]_joint/$[annotation_name]_joint.$[input_chroms[_index]].annot\")\n",
-    "    fi\n",
-    "    for a in \"${annots[@]}\"; do\n",
-    "        gzip -f \"$a\"\n",
-    "    done\n",
-    "\n",
-    "# ----------------------------------------------------------------------------\n",
-    "# Step C: run polyfun's LD-score computation for each emitted annotation file.\n",
-    "# ----------------------------------------------------------------------------\n",
-    "bash: expand = \"$[ ]\", stderr = f'{_output[1]}.stderr', stdout = f'{_output[1]}.stdout'\n",
-    "    set -e\n",
-    "    chrom=\"$[input_chroms[_index]]\"\n",
-    "\n",
-    "    run_polyfun() {\n",
-    "        local annot=\"$1\"\n",
-    "        local out_prefix=\"$2\"\n",
-    "        if [ \"$[str(use_print_snps)]\" = \"True\" ]; then\n",
-    "            $[python_exec] $[polyfun_path]/ldsc.py \\\n",
-    "                --print-snps $[snp_list] \\\n",
-    "                $[ld_window_flag] $[ld_window_param] \\\n",
-    "                --out \"$out_prefix\" \\\n",
-    "                --bfile $[_input[-1]:nar] \\\n",
-    "                --yes-really \\\n",
-    "                --annot \"$annot\" \\\n",
-    "                --l2\n",
-    "        else\n",
-    "            $[python_exec] $[polyfun_path]/compute_ldscores.py \\\n",
-    "                --annot \"$annot\" \\\n",
-    "                --bfile $[_input[-1]:nar] \\\n",
-    "                $[ld_window_flag] $[ld_window_param] \\\n",
-    "                --out \"${out_prefix}.$[ldscore_ext]\" \\\n",
-    "                --allow-missing\n",
-    "        fi\n",
-    "    }\n",
-    "\n",
-    "    if [ \"$[str(emit_single)]\" = \"True\" ]; then\n",
-    "        for i in $(seq 1 $[N_targets]); do\n",
-    "            name=\"$[annotation_name]_single_$i\"\n",
-    "            annot=\"$[cwd:a]/$name/$name.$chrom.annot.gz\"\n",
-    "            prefix=\"$[cwd:a]/$name/$name.$chrom\"\n",
-    "            run_polyfun \"$annot\" \"$prefix\"\n",
-    "        done\n",
-    "    fi\n",
-    "    if [ \"$[str(emit_joint)]\" = \"True\" ]; then\n",
-    "        name=\"$[annotation_name]_joint\"\n",
-    "        annot=\"$[cwd:a]/$name/$name.$chrom.annot.gz\"\n",
-    "        prefix=\"$[cwd:a]/$name/$name.$chrom\"\n",
-    "        run_polyfun \"$annot\" \"$prefix\"\n",
-    "    fi\n",
-    "\n",
-    "# ----------------------------------------------------------------------------\n",
-    "# Step D: write .l2.M and .l2.M_5_50 files for each emitted annotation directory.\n",
-    "# ----------------------------------------------------------------------------\n",
-    "R: expand = \"${ }\", stderr = f'{_output[2]}.stderr', stdout = f'{_output[2]}.stdout'\n",
-    "    suppressPackageStartupMessages({ library(data.table); library(dplyr) })\n",
-    "    use_print_snps <- ${str(use_print_snps).upper()}\n",
-    "\n",
-    "    chrom <- \"${input_chroms[_index]}\"\n",
-    "    # Look up .frq file under frqfile_dir, using plink_name + chrom (matches cell 25).\n",
-    "    frq_file <- file.path(\"${frqfile_dir}\", paste0(\"${plink_name}\", chrom, \".frq\"))\n",
-    "    has_frq  <- file.exists(frq_file)\n",
-    "    frq_dt <- if (has_frq) fread(frq_file)[, .(SNP, MAF)] else NULL\n",
-    "\n",
-    "    write_M_files <- function(annot_path, ldscore_path, m_path) {\n",
-    "        if (use_print_snps && file.exists(m_path) && file.exists(paste0(m_path, \"_5_50\"))) {\n",
-    "            cat(\"M files already exist for\", m_path, \"\\n\"); return(invisible())\n",
-    "        }\n",
-    "        ldscore_dt <- if (endsWith(ldscore_path, \".parquet\")) {\n",
-    "            suppressPackageStartupMessages(library(arrow)); arrow::read_parquet(ldscore_path)\n",
-    "        } else fread(ldscore_path)\n",
-    "        annot_dt <- fread(annot_path)\n",
-    "        annot_filtered <- annot_dt[annot_dt$SNP %in% ldscore_dt$SNP, ]\n",
-    "        merged <- if (has_frq) merge(annot_filtered, frq_dt, by = \"SNP\", all.x = TRUE) else annot_filtered\n",
-    "        std_cols <- c(\"CHR\", \"SNP\", \"BP\", \"CM\", \"A1\", \"A2\", if (has_frq) \"MAF\")\n",
-    "        annot_cols <- setdiff(names(merged), std_cols)\n",
-    "        if (length(annot_cols) == 0L) { merged[, ANNOT := 1L]; annot_cols <- \"ANNOT\" }\n",
-    "        M <- merged[, lapply(.SD, sum, na.rm = TRUE), .SDcols = annot_cols]\n",
-    "        writeLines(paste(as.numeric(M), collapse = \" \"), m_path)\n",
-    "        if (has_frq) {\n",
-    "            common <- merged[!is.na(MAF) & MAF > 0.05, ]\n",
-    "            M5 <- common[, lapply(.SD, sum, na.rm = TRUE), .SDcols = annot_cols]\n",
-    "            writeLines(paste(as.numeric(M5), collapse = \" \"), paste0(m_path, \"_5_50\"))\n",
-    "        }\n",
-    "    }\n",
-    "\n",
-    "    targets <- c()\n",
-    "    if (${\"TRUE\" if emit_single else \"FALSE\"}) {\n",
-    "        for (i in seq_len(${N_targets})) {\n",
-    "            targets <- c(targets, paste0(\"${annotation_name}\", \"_single_\", i))\n",
-    "        }\n",
-    "    }\n",
-    "    if (${\"TRUE\" if emit_joint else \"FALSE\"}) {\n",
-    "        targets <- c(targets, paste0(\"${annotation_name}\", \"_joint\"))\n",
-    "    }\n",
-    "    for (name in targets) {\n",
-    "        annot_path   <- file.path(\"${cwd:a}\", name, paste0(name, \".\", chrom, \".annot.gz\"))\n",
-    "        ldscore_path <- file.path(\"${cwd:a}\", name, paste0(name, \".\", chrom, \".${ldscore_ext}\"))\n",
-    "        m_path       <- file.path(\"${cwd:a}\", name, paste0(name, \".\", chrom, \".l2.M\"))\n",
-    "        write_M_files(annot_path, ldscore_path, m_path)\n",
-    "    }\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "Python 3 (ipykernel)"
-   },
-   "source": [
-    "## Calculate Functional Enrichment using Annotations"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": [
-    "[get_heritability]\n",
-    "# Per-trait sLDSC regression via polyfun. Fans out across target_anno_dirs:\n",
-    "# each (trait, target_dir) pair becomes one polyfun invocation. Outputs go to\n",
-    "# <cwd>/<basename(target_dir)>/<trait>.{results,log,part_delete}.\n",
-    "#\n",
-    "# `target_anno_dirs` is the list produced by [make_annotation_files_ldscore]:\n",
-    "# typically N _single_<i> directories plus optionally one _joint directory.\n",
-    "\n",
-    "#\n",
-    "# --- about the \".results\" Category column and the \"_0 / _1\" suffix ---\n",
-    "# Each (trait, target_dir) pair is ONE polyfun call; its `ldsc.py --ref-ld-chr`\n",
-    "# always gets exactly two LD-score sources, in this order:\n",
-    "#     \"<target_dir>/<target>.\"   (index 0)  ,  \"<baseline_dir>/<baseline>\"   (index 1)\n",
-    "# With --overlap-annot, every annotation column in the .results \"Category\" is\n",
-    "# named  <ldscore_column_name>_<ref-ld-index>:\n",
-    "#     index 0 = the target file   -> \"ANNOT_0\"  (no-snplist; compute_ldscores.py keeps the annot col name)\n",
-    "#                                  -> \"L2_0\"    (snplist + single annot; ldsc.py hard-codes \"L2\", see below)\n",
-    "#                                  -> \"ANNOT_1_0\",\"ANNOT_2_0\"      (no-snplist joint dir, N>=2 annot cols)\n",
-    "#                                  -> \"ANNOT_1L2_0\",\"ANNOT_2L2_0\"  (snplist joint dir, N>=2 -> \"<name>L2\")\n",
-    "#     index 1 = the baseline file -> \"base_1\",\"Coding_UCSC_1\", ...  (the 97 baseline annots)\n",
-    "# So in this pipeline the suffix is only ever 0 (target) or 1 (baseline); it would\n",
-    "# continue 0,1,2,... only if you handed `ldsc.py --ref-ld-chr` more than two sources.\n",
-    "# (Why ANNOT_0 vs L2_0: see the [make_annotation_files_ldscore] header \u2014 ldsc.py's\n",
-    "#  \"n_annot == 1 -> column name 'L2'\" quirk vs compute_ldscores.py keeping the annot\n",
-    "#  column name.)  [postprocess] auto-detects the target Category; if you instead pass\n",
-    "# --target-categories, the names must match this column exactly.\n",
-    "#\n",
-    "parameter: target_anno_dirs = paths()\n",
-    "parameter: all_traits = []\n",
-    "\n",
-    "import os\n",
-    "\n",
-    "with open(all_traits_file, 'r') as f:\n",
-    "    trait_paths = [os.path.join(sumstat_dir, line.strip()) for line in f if line.strip()]\n",
-    "\n",
-    "# Build (trait, target_dir) Cartesian product as parallel flat lists.\n",
-    "input_list  = []\n",
-    "target_meta = []\n",
-    "for td in target_anno_dirs:\n",
-    "    for t in trait_paths:\n",
-    "        input_list.append(t)\n",
-    "        target_meta.append(str(td))\n",
-    "\n",
-    "input: input_list, group_by = 1, group_with = \"target_meta\"\n",
-    "\n",
-    "output: f\"{cwd:a}/{os.path.basename(target_meta[_index])}/{os.path.basename(_input[0])}.log\",  \\\n",
-    "        f\"{cwd:a}/{os.path.basename(target_meta[_index])}/{os.path.basename(_input[0])}.results\"\n",
-    "\n",
-    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'\n",
-    "\n",
-    "bash: expand = \"${ }\"\n",
-    "    target_dir=\"${target_meta[_index]}\"\n",
-    "    target_name=\"$(basename ${target_meta[_index]})\"\n",
-    "    trait=\"$(basename ${_input[0]})\"\n",
-    "    output_dir=\"${cwd:a}/$target_name\"\n",
-    "    mkdir -p \"$output_dir\"\n",
-    "\n",
-    "    # MAF cutoff handling. Only 0 (disabled) or 0.05 (sLDSC default) are supported;\n",
-    "    # other values would require recomputing LD scores at that cutoff.\n",
-    "    frq_file_check=\"${frqfile_dir}/${plink_name}22.frq\"\n",
-    "    if [ \"${maf_cutoff}\" = \"0\" ] || [ \"${maf_cutoff}\" = \"0.0\" ]; then\n",
-    "        echo \"maf_cutoff = 0: skipping MAF filter (--not-M-5-50)\"\n",
-    "        frq_option=\"--not-M-5-50\"\n",
-    "    elif [ \"${maf_cutoff}\" = \"0.05\" ]; then\n",
-    "        if [ -f \"$frq_file_check\" ]; then\n",
-    "            echo \"maf_cutoff = 0.05: using --frqfile-chr (MAF > 5%)\"\n",
-    "            frq_option=\"--frqfile-chr ${frqfile_dir}/${plink_name}\"\n",
-    "        else\n",
-    "            echo \"ERROR: maf_cutoff=0.05 requires .frq files for the reference panel,\"\n",
-    "            echo \"       but none found at ${frqfile_dir}/${plink_name}*.frq.\"\n",
-    "            echo \"       Provide .frq files in frqfile_dir, or set maf_cutoff=0 (NOT recommended).\"\n",
-    "            exit 1\n",
-    "        fi\n",
-    "    else\n",
-    "        echo \"ERROR: maf_cutoff=${maf_cutoff} is not supported. Only 0 (no filter) or\"\n",
-    "        echo \"       0.05 (sLDSC default) are accepted. Other values would require\"\n",
-    "        echo \"       recomputing LD scores at that cutoff.\"\n",
-    "        exit 1\n",
-    "    fi\n",
-    "\n",
-    "    run_ldsc() {\n",
-    "        local extra_args=\"$1\"\n",
-    "        ${python_exec} ${polyfun_path}/ldsc.py \\\n",
-    "            --h2 ${sumstat_dir}/$trait \\\n",
-    "            --ref-ld-chr \"$target_dir/$target_name.\",\"${baseline_ld_dir}/${baseline_name}\" \\\n",
-    "            --out \"$output_dir/$trait\" \\\n",
-    "            --overlap-annot \\\n",
-    "            --w-ld-chr ${weights_dir}/${weight_name} \\\n",
-    "            $frq_option \\\n",
-    "            --print-coefficients \\\n",
-    "            --print-delete-vals \\\n",
-    "            --n-blocks ${n_blocks} \\\n",
-    "            $extra_args\n",
-    "    }\n",
-    "\n",
-    "    run_ldsc \"\"\n",
-    "    log_file=\"$output_dir/$trait.log\"\n",
-    "\n",
-    "    # FloatingPointError retry ladder (preserved from original): 30 -> 20 -> 10\n",
-    "    for max in 30 20 10; do\n",
-    "        if [ -f \"$log_file\" ] && grep -q \"FloatingPointError\\|invalid value encountered in sqrt\" \"$log_file\"; then\n",
-    "            echo \"FloatingPointError detected, retrying with --chisq-max $max...\"\n",
-    "            run_ldsc \"--chisq-max $max\"\n",
-    "        else\n",
-    "            break\n",
-    "        fi\n",
-    "    done\n",
-    "\n",
-    "    if [ -f \"$log_file\" ] && grep -q \"FloatingPointError\\|invalid value encountered in sqrt\" \"$log_file\"; then\n",
-    "        echo \"ERROR: FloatingPointError persists for trait $trait at target $target_name even with --chisq-max 10\"\n",
-    "        echo \"This trait may have severe numerical instability issues in the summary statistics.\"\n",
-    "    fi\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": [
-    "[munge_sumstats_polyfun]\n",
-    "parameter: sumstats  = path\n",
-    "parameter: n       = 0\n",
-    "parameter: min_info = 0.6\n",
-    "parameter: min_maf  = 0.001\n",
-    "parameter: keep_hla = False\n",
-    "parameter: chi2_cut = 30\n",
-    "input: sumstats\n",
-    "output: f\"{_input:n}.munged.parquet\"\n",
-    "bash: expand=True, stderr=f'{_output:nn}.stderr', stdout=f'{_output:nn}.stdout'\n",
-    "    {python_exec} {polyfun_path}/munge_polyfun_sumstats.py \\\n",
-    "        --sumstats {_input} \\\n",
-    "        --out {_output} \\\n",
-    "        {'--n {}'.format(n) if n>0 else ''} \\\n",
-    "        {'--min-info {}'.format(min_info)} \\\n",
-    "        {'--min-maf {}'.format(min_maf)} \\\n",
-    "        {'--chi2-cutoff {}'.format(chi2_cut)} \\\n",
-    "        {'--keep-hla' if keep_hla else ''} \\\n",
-    "        --remove-strand-ambig"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": [
-    "[postprocess]\n",
-    "# Post-processing of polyfun outputs via pecotmr::sldscPostprocessingPipeline.\n",
-    "# Reads .results / .log / .part_delete for all traits in `traits_file`, both\n",
-    "# single-target and (when present) joint-target runs, computes Gazal-style\n",
-    "# tau*, EnrichStat with back-solved jackknife SE, and runs the default\n",
-    "# DerSimonian-Laird random-effects meta across all supplied traits. Writes\n",
-    "# one RDS containing per-trait tables and three meta tables (tau*, E, EnrichStat).\n",
-    "\n",
-    "parameter: traits_file = path()             # text file: one trait sumstats filename per line\n",
-    "parameter: heritability_cwd = path()        # parent directory of [get_heritability] outputs (contains <annotation_name>_single_<i>/ subdirs and optionally <annotation_name>_joint/)\n",
-    "parameter: target_categories = []           # target annotation names. Auto-detected from the joint-run results if empty.\n",
-    "parameter: target_categories_label = []     # optional display names, same order as target_categories;\n",
-    "                                            # when given, every \"target\" column / tau*-block colname in\n",
-    "                                            # the output RDS is renamed to these (params$target_categories\n",
-    "                                            # holds the labels, params$target_categories_orig the originals).\n",
-    "parameter: target_anno_dir = path()         # directory of target .annot.gz files used for sd_C and binary detection (typically the joint dir, since it carries all target columns)\n",
-    "\n",
-    "input: traits_file\n",
-    "output: f\"{cwd:a}/{annotation_name}.sldsc_postprocess.rds\"\n",
-    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n",
-    "\n",
-    "R: expand = \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'\n",
-    "    library(pecotmr)\n",
-    "\n",
-    "    traits <- readLines(\"${traits_file}\")\n",
-    "    target_cats <- c(${\",\".join('\"%s\"' % c for c in target_categories)})\n",
-    "    target_lab  <- c(${\",\".join('\"%s\"' % c for c in target_categories_label)})\n",
-    "\n",
-    "    # Auto-detect single-target and joint-target output directories.\n",
-    "    her_root  <- \"${heritability_cwd}\"\n",
-    "    all_subdirs <- list.dirs(her_root, recursive = FALSE)\n",
-    "    single_pattern <- paste0(\"^\", \"${annotation_name}\", \"_single_([0-9]+)$\")\n",
-    "    joint_name     <- paste0(\"${annotation_name}\", \"_joint\")\n",
-    "    single_dirs <- all_subdirs[grepl(single_pattern, basename(all_subdirs))]\n",
-    "    single_indices <- as.integer(sub(single_pattern, \"\\\\1\", basename(single_dirs)))\n",
-    "    single_dirs <- single_dirs[order(single_indices)]\n",
-    "    joint_dir   <- file.path(her_root, joint_name)\n",
-    "    has_joint   <- dir.exists(joint_dir)\n",
-    "\n",
-    "    message(sprintf(\"Detected %d single-target dirs%s\",\n",
-    "                    length(single_dirs),\n",
-    "                    if (has_joint) \"; joint-target dir present\" else \"; no joint-target dir\"))\n",
-    "\n",
-    "    # Build per-trait prefix maps. Each trait's polyfun output is at <dir>/<trait>\n",
-    "    # (polyfun appends .results / .log / .part_delete).\n",
-    "    trait_single_prefixes <- lapply(traits, function(t) file.path(single_dirs, t))\n",
-    "    names(trait_single_prefixes) <- traits\n",
-    "\n",
-    "    if (has_joint) {\n",
-    "        trait_joint_prefix <- setNames(file.path(joint_dir, traits), traits)\n",
-    "    } else {\n",
-    "        trait_joint_prefix <- setNames(rep(NA_character_, length(traits)), traits)\n",
-    "    }\n",
-    "\n",
-    "    res <- sldscPostprocessingPipeline(\n",
-    "        traitSinglePrefixes = trait_single_prefixes,\n",
-    "        traitJointPrefix    = trait_joint_prefix,\n",
-    "        targetAnnoDir       = \"${target_anno_dir}\",\n",
-    "        frqfileDir          = \"${frqfile_dir}\",\n",
-    "        plinkName           = \"${plink_name}\",\n",
-    "        mafCutoff           = ${maf_cutoff},\n",
-    "        targetCategories    = if (length(target_cats) > 0) target_cats else NULL,\n",
-    "        targetLabels        = if (length(target_lab)  > 0) target_lab  else NULL\n",
-    "    )\n",
-    "\n",
-    "    saveRDS(res, \"${_output[0]}\")\n",
-    "    message(\"S-LDSC post-processing complete; results written to ${_output[0]}\")\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": [
-    "[meta_subset]\n",
-    "# Optional: re-run random-effects meta on a user-defined subset of traits, using\n",
-    "# the cached per-trait standardized results from [postprocess]. No regression rerun.\n",
-    "\n",
-    "parameter: postprocess_rds = path()           # output of [postprocess]\n",
-    "parameter: subset_traits_file = path()        # text file: one trait id per line, subset of those passed to [postprocess]\n",
-    "parameter: subset_name = str                  # label used in the output filename\n",
-    "parameter: target_categories = []             # target annotation names to meta on; if empty, uses all from postprocess output\n",
-    "# If [postprocess] was run with --target-categories-label, the cached RDS already\n",
-    "# carries the display names (params$target_categories = the labels), so leave\n",
-    "# --target-categories empty here (or pass the labels, not the original ANNOT_* names).\n",
-    "\n",
-    "input: postprocess_rds, subset_traits_file\n",
-    "output: f\"{cwd:a}/{annotation_name}.{subset_name}.meta.rds\"\n",
-    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n",
-    "\n",
-    "R: expand = \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'\n",
-    "    library(pecotmr)\n",
-    "\n",
-    "    res <- readRDS(\"${postprocess_rds}\")\n",
-    "    subset_traits <- readLines(\"${subset_traits_file}\")\n",
-    "    target_cats <- c(${\",\".join([f'\"{c}\"' for c in target_categories])})\n",
-    "    if (length(target_cats) == 0) target_cats <- res$params$target_categories\n",
-    "\n",
-    "    subset_per_trait <- res$per_trait[subset_traits]\n",
-    "\n",
-    "    # Map wide names (tau_star_single/joint) to bare names metaSldscRandom expects.\n",
-    "    view_single <- pecotmr:::.sldscViewForMeta(subset_per_trait, \"single\")\n",
-    "    view_joint  <- pecotmr:::.sldscViewForMeta(subset_per_trait, \"joint\")\n",
-    "\n",
-    "    out <- list(\n",
-    "        tau_star_single = setNames(lapply(target_cats, function(c) metaSldscRandom(view_single, c, \"tauStar\")),   target_cats),\n",
-    "        tau_star_joint  = setNames(lapply(target_cats, function(c) metaSldscRandom(view_joint,  c, \"tauStar\")),   target_cats),\n",
-    "        enrichment      = setNames(lapply(target_cats, function(c) metaSldscRandom(view_single, c, \"enrichment\")), target_cats),\n",
-    "        enrichstat      = setNames(lapply(target_cats, function(c) metaSldscRandom(view_single, c, \"enrichstat\")), target_cats)\n",
-    "    )\n",
-    "\n",
-    "    saveRDS(out, \"${_output[0]}\")\n",
-    "    message(\"Subset meta complete; results written to ${_output[0]}\")"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "SoS",
-   "language": "sos",
-   "name": "sos"
-  },
-  "language_info": {
-   "codemirror_mode": "sos",
-   "file_extension": ".sos",
-   "mimetype": "text/x-sos",
-   "name": "sos",
-   "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter",
-   "pygments_lexer": "sos"
-  },
-  "sos": {
-   "kernels": [
-    [
-     "Markdown",
-     "markdown",
-     "markdown",
-     "",
-     ""
-    ],
-    [
-     "SoS",
-     "sos",
-     "",
-     "",
-     "sos"
-    ]
-   ],
-   "panel": {
-    "displayed": true,
-    "height": 0
-   },
-   "version": "0.22.4"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
\ No newline at end of file

From a943d9903663893d81c5320ea3302ff5249bf7f1 Mon Sep 17 00:00:00 2001
From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com>
Date: Tue, 23 Jun 2026 12:01:18 -0400
Subject: [PATCH 4/6] fix based on pecotmr 0.5.3

---
 code/SoS/enrichment/sldsc_enrichment.ipynb | 1491 ++++++++++++++++++++
 1 file changed, 1491 insertions(+)
 create mode 100644 code/SoS/enrichment/sldsc_enrichment.ipynb

diff --git a/code/SoS/enrichment/sldsc_enrichment.ipynb b/code/SoS/enrichment/sldsc_enrichment.ipynb
new file mode 100644
index 00000000..0569c353
--- /dev/null
+++ b/code/SoS/enrichment/sldsc_enrichment.ipynb
@@ -0,0 +1,1491 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "# Stratified LD Score Regression (S-LDSC) Enrichment\n",
+    "\n",
+    "Minimal working-example driver for the S-LDSC functional-enrichment pipeline. The **Steps** section below gives one ready-to-run `sos run` command per workflow, using the toy inputs symlinked under `input/`.\n",
+    "\n",
+    "> **Environment note.** Steps 1–2 (`make_annotation_files_ldscore`, `get_heritability`) wrap the external **polyfun** toolkit (`compute_ldscores.py`, `ldsc.py`, `munge_polyfun_sumstats.py`) and require pre-computed reference-panel files (baseline-LD scores, LD weights, `.frq`, and PLINK `.bed/.bim/.fam`). polyfun is **not installed in this environment** and the reference panel is not shipped with the toy example, so those two steps cannot be executed here; their commands are provided for use on a system where polyfun and a matching panel are available. Steps 3–4 (`postprocess`, `meta_subset`) use `pecotmr::sldsc_postprocessing_pipeline` (available here) and read the `.results`/`.log` files produced by Step 2.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Description\n",
+    "This notebook implements the pipeline of [S-LDSC](https://www.nature.com/articles/ng.3404) for LD score and functional enrichment analysis.\n",
+    "\n",
+    "**Important: the S-LDSC implementation comes from the [polyfun](https://github.com/omerwe/polyfun/tree/master) package, not the original LDSC from `bulik/ldsc` GitHub repo.**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "Markdown"
+   },
+   "source": [
+    "Uses GWAS summary statistics together with annotation and LD reference-panel data to compute per-SNP heritability enrichment for each annotation. It supports single-annotation (individual contribution) and joint multi-annotation (independent contribution) analysis.\n",
+    "\n",
+    "**Background.** LD Score Regression (Bulik-Sullivan et al. 2015) distinguishes confounding (e.g. population stratification) from true polygenic signal by regressing GWAS chi-square statistics on LD scores: SNPs tagging more variation (high LD score) show higher chi-square under true polygenicity, whereas confounding inflates statistics independently of LD. S-LDSC (Finucane et al. 2015) partitions heritability across overlapping annotation categories; standardized tau accounts for negative selection (Gazal et al. 2017). The model details and the tau*/EnrichStat definitions follow below.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Methods - Workflow Overview\n",
+    "\n",
+    "The pipeline runs in three stages: (1) annotation preparation and the S-LDSC regression (polyfun), (2) post-processing into standardized $\\tau^*$ and meta-analysis (the `pecotmr` package), and (3) optional re-meta on user-defined trait subsets. The concrete commands for stages 1-2 are in the **Steps** section below.\n",
+    "\n",
+    "**Stage 1 - polyfun.** Three SoS workflows wrap polyfun: `make_annotation_files_ldscore` converts target annotations into polyfun `.annot.gz` and runs `compute_ldscores.py` (toggles `compute_single` and `compute_joint`, both default `True`; the joint dir is only emitted when $N \\geq 2$); `munge_sumstats_polyfun` preprocesses each GWAS into LDSC format; `get_heritability` runs polyfun's `ldsc.py` once per `--target-anno-dir`, enforcing the MAF cutoff via `--frqfile-chr` (`maf_cutoff` accepts only `0` or `0.05`).\n",
+    "\n",
+    "**Stage 2 - pecotmr post-processing.** A single `pecotmr::sldsc_postprocessing_pipeline` call consumes all polyfun outputs: it extracts $\\tau$, $E$, $h^2_g$, EnrichStat p-value and per-block jackknife $\\tau$ values; computes $sd_C$ and $M_{\\mathrm{ref}}$ over the regression's MAF-cutoff SNP set; standardizes $\\tau \\to \\tau^*$ for single and joint modes; auto-detects binary vs continuous annotations; and runs a DerSimonian-Laird random-effects meta-analysis across traits, producing three meta tables ($\\tau^*$ cross-type comparable, $E$ within-binary, EnrichStat within-binary). Output is an R list with `per_trait` and `meta` entries.\n",
+    "\n",
+    "**Stage 3 - subset meta-analysis.** `pecotmr::meta_sldsc_random` re-runs the meta on a trait subset without re-running the regression (lightweight, interactive):\n",
+    "\n",
+    "```r\n",
+    "res <- readRDS(\"sldsc_results.rds\")\n",
+    "neuro <- c(\"AD_GWAX\", \"PD_meta\", \"ALS_meta\")\n",
+    "meta_neuro_taustar <- pecotmr::meta_sldsc_random(\n",
+    "  res$per_trait[neuro], category = \"my_target_anno\", quantity = \"tau_star\"\n",
+    ")\n",
+    "```\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Theory\n",
+    "\n",
+    "The statistical model behind the pipeline is summarized below. Because the same framework underlies several of the workflow steps, the model, its stratified extension, and the tau-estimation / enrichment definitions are described together here rather than repeated per step."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "### LDSC model\n",
+    "\n",
+    "Under a polygenic assumption, in which effect sizes for variants are drawn independently from distributions with variance proportional to $1/(p(1-p))$ where $p$ is the minor allele frequency (MAF), the expected $\\chi^2$ statistic of variant $j$ is:\n",
+    "\n",
+    "$$E[\\chi^2_j \\mid \\ell_j] \\;=\\; \\frac{N\\,h^2\\,\\ell_j}{M} \\;+\\; N a \\;+\\; 1 \\quad (1)$$\n",
+    "\n",
+    "where $N$ is the sample size; $M$ is the number of SNPs, so that $h^2/M$ is the average heritability per SNP; $a$ measures the contribution of confounding biases such as cryptic relatedness and population stratification; and $\\ell_j = \\sum_k r^2_{jk}$ is the LD Score of variant $j$, which measures the amount of genetic variation tagged by $j$. A full derivation is given in the Supplementary Note of Bulik-Sullivan et al. (2015); an alternative derivation appears in the Supplementary Note of Zhu and Stephens (2017) AoAS.\n",
+    "\n",
+    "Equation (1) shows that LD Score regression can compute SNP-based heritability for a phenotype from GWAS summary statistics alone, without requiring individual-level genotype data as REML and related methods do."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "### Stratified LDSC\n",
+    "\n",
+    "Heritability is the proportion of phenotypic variation that is due to variation in genetic values, and it can also be partitioned over disjoint or overlapping categories of SNPs.\n",
+    "\n",
+    "Stratified LD Score Regression (S-LDSC) partitions heritability by leveraging both LD-score information and SNPs that have not reached genome-wide significance. S-LDSC exploits the fact that the $\\chi^2$ statistic for a given SNP reflects the cumulative effects of all SNPs tagged by it: in regions of high LD, the focal SNP captures the contribution of a group of nearby SNPs.\n",
+    "\n",
+    "S-LDSC declares an annotation enriched for heritability if SNPs with high LD to that annotation have higher $\\chi^2$ statistics than SNPs with low LD to it.\n",
+    "\n",
+    "Let $a_{jC}$ denote the value of annotation $C$ at SNP $j$:\n",
+    "\n",
+    "- **Binary annotation** (e.g. an indicator for \"in enhancer\", \"in exon\", \"in cell-type-specific peak\"): $a_{jC} \\in \\{0, 1\\}$.\n",
+    "- **Continuous annotation** (e.g. gene-specificity score, conservation score, continuous epigenomic signal): $a_{jC} \\in \\mathbb{R}$.\n",
+    "\n",
+    "Under a polygenic model the per-SNP heritability for SNP $j$ is\n",
+    "\n",
+    "$$\\mathrm{Var}(\\beta_j) \\;=\\; \\sum_C a_{jC}\\, \\tau_C$$\n",
+    "\n",
+    "and the expected $\\chi^2$ statistic of SNP $j$ is\n",
+    "\n",
+    "$$E[\\chi^2_j \\mid \\mathbf{a}_j] \\;=\\; N \\sum_C \\tau_C\\, \\ell(j, C) \\;+\\; N a \\;+\\; 1 \\quad (2)$$\n",
+    "\n",
+    "where $\\ell(j, C) = \\sum_k a_{kC}\\, r^2_{jk}$ is the partitioned LD score of SNP $j$ with respect to annotation $C$, and $a$ measures confounding bias. Equation (2) allows joint estimation of all $\\tau_C$ via a (computationally simple) multiple regression of $\\chi^2_j$ against $\\ell(j, C)$.\n",
+    "\n",
+    "Interpretation of $\\tau_C$:\n",
+    "- **Binary $C$**: $\\tau_C$ is the *additive increase in per-SNP heritability* for SNPs in category $C$, on top of the contributions from any other annotations they belong to.\n",
+    "- **Continuous $C$**: $\\tau_C$ is the *additive change in per-SNP heritability per unit increase* in the value of annotation $C$.\n",
+    "\n",
+    "For application to real data and comparisons to other methods, see the three papers cited at the top of this notebook."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "### Tau Estimation and Enrichment Analysis\n",
+    "\n",
+    "Goal: quantify the contribution of functional annotations to trait heritability and assess statistical significance, accounting for LD structure and (for continuous annotations) annotation scale.\n",
+    "\n",
+    "The pipeline has two computational layers:\n",
+    "\n",
+    "- **Regression layer** — the S-LDSC regression itself, performed by the [polyfun](https://github.com/omerwe/polyfun) engine. We do not re-implement this.\n",
+    "- **Post-processing layer** — standardization, differential per-SNP heritability, binary/continuous detection, and random-effects meta-analysis across traits. Implemented in the [`pecotmr`](https://github.com/StatFunGen/pecotmr) R package (`R/sldsc_wrapper.R`).\n",
+    "\n",
+    "The notation below tags each modeling quantity as **(polyfun)** or **(pecotmr)**.\n",
+    "\n",
+    "#### Notation\n",
+    "\n",
+    "For each annotation $C$ we use:\n",
+    "\n",
+    "- $\\pi^{h^2}_C$ = proportion of trait heritability $h^2_g$ assigned to annotation $C$.\n",
+    "- $\\pi^{M}_C$ = proportion of (effective) SNPs in annotation $C$. For binary annotations this is $M_C / M_{\\mathrm{ref}}$; for continuous annotations it is the share of total annotation weight in $C$.\n",
+    "\n",
+    "#### Reference panel and MAF cutoff\n",
+    "\n",
+    "All LD-derived quantities — partitioned LD scores for the 97 baseline annotations and for our $K$ target annotations, the LD-score-regression weights, allele frequencies, and the SNP set — are computed against our own LD reference panel. We do not mix in pre-computed quantities from external panels (e.g. 1000G); $M_{\\mathrm{ref}}$ throughout this notebook denotes the number of common SNPs in our panel.\n",
+    "\n",
+    "By default we restrict to MAF $> 5\\%$ per the sLDSC recommendation: rare-variant LD is unstable and HapMap3-style regression weights are common-variant by construction. The cutoff is exposed as the SoS parameter `maf_cutoff` (default $0.05$); the regression, the standardized $sd_C$, and $M_{\\mathrm{ref}}$ are all evaluated on the same MAF $>$ cutoff SNP set. If allele-frequency files are not available the pipeline fails; the user must explicitly set `maf_cutoff = 0` to opt out (not recommended).\n",
+    "\n",
+    "#### Quantities from the regression layer (polyfun)\n",
+    "\n",
+    "Solving Equation (2) jointly across annotations, with 200-block genomic jackknife for inference, is performed by polyfun's `ldsc.py`. From each polyfun run we obtain, per annotation:\n",
+    "\n",
+    "- $\\tau_C$ and its standard error — **(polyfun)**.\n",
+    "- $\\pi^{h^2}_C$ and $\\pi^{M}_C$ — **(polyfun)**.\n",
+    "- $E_C = \\pi^{h^2}_C / \\pi^{M}_C$ and its standard error — **(polyfun)**.\n",
+    "- The p-value of the differential per-SNP heritability test (defined below) — **(polyfun)**, computed internally with the full coefficient covariance matrix.\n",
+    "\n",
+    "We also obtain, per run:\n",
+    "\n",
+    "- The total trait heritability $h^2_g$ — **(polyfun)**.\n",
+    "- The 200-block jackknife delete-values of $\\tau_C$ — **(polyfun)**.\n",
+    "\n",
+    "#### Quantities from the post-processing layer (pecotmr)\n",
+    "\n",
+    "From the polyfun outputs above plus our reference panel, the post-processing layer computes:\n",
+    "\n",
+    "- $sd_C$ — per-annotation standard deviation over MAF $>$ cutoff SNPs — **(pecotmr: `compute_sldsc_annot_sd`)**.\n",
+    "- $M_{\\mathrm{ref}}$ — reference SNP count at the MAF cutoff — **(pecotmr: `compute_sldsc_M_ref`)**.\n",
+    "- Whether each annotation is binary or continuous — **(pecotmr: `is_binary_sldsc_annot`)**.\n",
+    "- $\\tau^*_C$ point estimate and per-block $\\tau^*_C$ — **(pecotmr: `standardize_sldsc_trait`)**.\n",
+    "- EnrichStat point estimate and its standard error (formula below) — **(pecotmr: `standardize_sldsc_trait`)**.\n",
+    "- DerSimonian-Laird random-effects meta-analysis of $\\tau^*_C$, $E_C$, or EnrichStat across traits — **(pecotmr: `meta_sldsc_random`)**.\n",
+    "\n",
+    "The top-level entry point `pecotmr::sldsc_postprocessing_pipeline` orchestrates all of the above.\n",
+    "\n",
+    "#### Standardized tau ($\\tau^*$)  —  (pecotmr)\n",
+    "\n",
+    "$\\tau_C$ has units that depend on the scale of the annotation and on the total heritability of the trait, so raw $\\tau$ is not directly comparable across annotations or across traits. We compute the standardized version (Gazal et al. 2017)\n",
+    "\n",
+    "$$\\tau^*_C \\;=\\; \\tau_C \\cdot \\frac{sd_C \\cdot M_{\\mathrm{ref}}}{h^2_g}$$\n",
+    "\n",
+    "interpreted as the additive change in per-SNP heritability associated with a 1 standard deviation increase in annotation $C$, divided by the average per-SNP heritability across all SNPs. $\\tau^*_C$ is dimensionless and comparable across annotations and across traits. In a joint multi-annotation regression it is the *independent contribution* of annotation $C$ after controlling for overlapping effects of the others.\n",
+    "\n",
+    "Here $sd_C$ is the standard deviation of annotation $C$ across reference SNPs (MAF $>$ cutoff), $M_{\\mathrm{ref}}$ is the count of those SNPs, and $h^2_g$ is the trait heritability. Applying the same scaling to each of the 200 jackknife blocks yields per-block $\\tau^*_C$ values; their sample variance gives the jackknife standard error\n",
+    "$$SE^{\\text{jackknife}}(\\tau^*_C) \\;=\\; \\sqrt{\\,\\tfrac{(B-1)^2}{B}\\, \\mathrm{Var}_b(\\tau^*_{C,(b)})\\,}$$\n",
+    "with $B = 200$, used as the per-trait input to cross-trait meta-analysis.\n",
+    "\n",
+    "#### Differential per-SNP heritability (\"EnrichStat\")  —  (polyfun + pecotmr)\n",
+    "\n",
+    "To test whether the per-SNP heritability *inside* annotation $C$ differs from *outside* it (Finucane et al. 2015):\n",
+    "\n",
+    "$$\\text{EnrichStat}_C \\;=\\; \\frac{h^2_g}{M_{\\mathrm{ref}}} \\!\\left[\\, \\frac{\\pi^{h^2}_C}{\\pi^{M}_C} \\;-\\; \\frac{1 - \\pi^{h^2}_C}{1 - \\pi^{M}_C} \\,\\right]$$\n",
+    "\n",
+    "The point-estimate p-value of this test is computed by polyfun internally using the full coefficient covariance and reported as `Enrichment_p`. Its standard error is recovered from the reported p-value:\n",
+    "\n",
+    "$$|Z_C| \\;=\\; \\Phi^{-1}\\!\\left(1 - \\tfrac{p_C}{2}\\right), \\qquad SE_{\\text{EnrichStat}_C} \\;=\\; \\frac{|\\text{EnrichStat}_C|}{|Z_C|}.$$\n",
+    "\n",
+    "This per-trait point + SE is the input to cross-trait meta-analysis.\n",
+    "\n",
+    "#### Reporting: binary vs. continuous annotations  —  (pecotmr)\n",
+    "\n",
+    "The estimation machinery applies to both annotation types, but the *headline* quantity to report **within each type** differs.\n",
+    "\n",
+    "For a **binary annotation** (e.g. enhancer indicator, exon, in/out of a cell-type peak), $\\pi^{M}_C = M_C / M_{\\mathrm{ref}}$ has a direct interpretation and $E_C$ reads as \"the category explains $E_C$-fold more heritability than its share of SNPs.\" The within-type headline quantities are therefore $E_C$ and the EnrichStat p-value; $\\tau^*_C$ is reported alongside.\n",
+    "\n",
+    "For a **continuous annotation** (e.g. gene-specificity score, conservation score, continuous epigenomic signal), $E_C$ depends on the scale of the annotation: rescaling the annotation by a constant changes $E_C$ even though the underlying biology is unchanged. The within-type headline quantities are therefore $\\tau^*_C$ and its p-value; $E_C$ is reported alongside but should not be interpreted for continuous annotations.\n",
+    "\n",
+    "The pipeline determines whether an annotation is binary by inspecting whether its values lie in $\\{0, 1\\}$ and selects the appropriate within-type headline statistic automatically (pecotmr).\n",
+    "\n",
+    "> **From the official LDSC tutorial** ([Partitioned Heritability from Continuous Annotations](https://github.com/bulik/ldsc/wiki/Partitioned-Heritability-from-Continuous-Annotations)):\n",
+    ">\n",
+    "> *\"Enrichment is (Prop. heritability) / (Prop. SNPs). These outputs make sense only for binary annotations. Do not try to interpret them for continuous annotations. Using `--print-coefficients` outputs the regression coefficients and corresponding standard errors and Z score for each annotation. These coefficients measure the additional contribution of one annotation to the model and are interpretable for both binary and continuous annotations.\"*\n",
+    ">\n",
+    "> The pipeline always passes `--print-coefficients` to polyfun for this reason.\n",
+    "\n",
+    "#### Cross-type comparison: always use $\\tau^*_C$  —  (pecotmr)\n",
+    "\n",
+    "For an apple-to-apple comparison **across binary and continuous annotations** — ranking annotations on a single axis, meta-analyzing a mixed set, or reporting a leaderboard that pools both types — use $\\tau^*_C$. The standardization in Gazal et al. (2017) was designed for exactly this purpose: $sd_C = \\sqrt{p(1-p)}$ for a binary annotation (where $p$ is the proportion in the category) and $sd_C = $ empirical standard deviation for a continuous annotation, so the resulting $\\tau^*_C$ is dimensionless and has the same interpretation in both cases — additive change in per-SNP heritability per 1 SD increase in the annotation, normalized by the average per-SNP heritability. $E_C$ does not have this property and must not be compared across types.\n",
+    "\n",
+    "The pipeline emits both $E_C$ and $\\tau^*_C$ for every annotation, with the binary/continuous flag, so callers can pick the right column for the comparison they are making.\n",
+    "\n",
+    "#### Joint analysis  —  (polyfun runs the regression; pecotmr standardizes both modes)\n",
+    "\n",
+    "For **joint analysis** (multiple annotations fit together), both $\\tau$ and $E$ are conditional on the other annotations in the model. We report joint $\\tau^*_C$ as the independent contribution of annotation $C$ after controlling for the others. The annotation-prep step exposes two independent toggles, `compute_single` and `compute_joint` (both default `True`), so the user can produce the $N$ single-target outputs, the joint output, or both in one invocation. With both defaults the post-processing layer reads all $N+1$ regression outputs per trait and presents single + joint side-by-side. When the joint subset is decided after looking at single-target results (exploratory $\\rightarrow$ conditional workflow), the user runs the annotation-prep step a second time with `compute_single=False` on the curated subset.\n",
+    "\n",
+    "### Meta-Analysis across Traits (Random Effects)  —  (pecotmr)\n",
+    "\n",
+    "DerSimonian-Laird random-effects meta-analysis of per-annotation estimates across traits, implemented in `pecotmr::meta_sldsc_random` (which delegates the numerics to `rmeta::meta.summaries(..., method = \"random\")`):\n",
+    "\n",
+    "$$\\hat\\theta_{\\mathrm{meta}} \\;=\\; \\frac{\\sum_i w_i\\, \\hat\\theta_i}{\\sum_i w_i}, \\qquad SE_{\\mathrm{meta}} \\;=\\; \\sqrt{\\frac{1}{\\sum_i w_i}}, \\qquad w_i \\;=\\; \\frac{1}{SE_i^2 + \\hat\\sigma^2}$$\n",
+    "\n",
+    "where $\\hat\\theta_i$ is the per-trait estimate and $SE_i$ its standard error:\n",
+    "\n",
+    "- **For $\\tau^*_C$ meta**: $SE_i$ is the jackknife SE from the per-block $\\tau^*_C$ values.\n",
+    "- **For $E_C$ meta**: $SE_i$ is the polyfun-reported `Enrichment_std_error`.\n",
+    "- **For EnrichStat meta**: $SE_i$ is the back-solved SE from polyfun's `Enrichment_p`.\n",
+    "\n",
+    "For binary-annotation enrichment reporting we use a two-channel meta: the **effect size and SE** come from the meta on $E_C$ (interpretable on the original enrichment-fold scale), while the **p-value** comes from the meta on EnrichStat (the appropriate hypothesis test). The pipeline produces a default meta over all supplied traits; users can re-run meta on any subset of traits without re-running the regression layer.\n",
+    "\n",
+    "$$Z_{\\mathrm{meta}} \\;=\\; \\frac{\\hat\\theta_{\\mathrm{meta}}}{SE_{\\mathrm{meta}}}, \\qquad p \\;=\\; 2\\,\\Phi(-|Z_{\\mathrm{meta}}|)$$"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Minimal Working Example (MWE)\n",
+    "\n",
+    "The steps below run the four pipeline workflows end to end on the example data. Each step lists what it does, then the `sos run` command to execute it.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Step 1. `make_annotation_files_ldscore`\n",
+    "\n",
+    "*Annotation preparation and S-LDSC regression (polyfun).* This step accepts a single annotation file for a single-tau analysis (one annotation as input) or several annotation files for a joint-tau analysis (multiple annotations as input)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "#### **Inputs**\n",
+    "\n",
+    "##### 1. Target Annotation File\n",
+    "\n",
+    "- **Purpose**: Specifies the user-provided (\"target\") genome annotation files. The pipeline supports both binary and continuous annotations; the type is auto-detected per annotation column.\n",
+    "- **Formats**:\n",
+    "    - Text file (`.txt`) listing per-chromosome paths to annotation files. Annotation files can be `.rds`/`.tsv`/`.txt`.\n",
+    "    - Alternatively, files for specific chromosomes can be provided directly.\n",
+    "    - **Multiple target annotations** are supported in one input file (one column per annotation, prefixed `path`, `path1`, `path2`, ...). Single-target and joint-target analyses are produced automatically in one pipeline pass.\n",
+    "    - **Format** (the score column is optional; if absent, score is set to 1):\n",
+    "        - `is_range = False`:\n",
+    "        ```\n",
+    "        chr   pos   score\n",
+    "        1    10001   1\n",
+    "        1    10002   1\n",
+    "        ```\n",
+    "        - `is_range = True`:\n",
+    "        ```\n",
+    "        chr   start   end   score\n",
+    "        1    10001  20001  1\n",
+    "        1    30001  40001  1\n",
+    "        ```\n",
+    "\n",
+    "##### 2. Reference Annotation File (baseline-LD)\n",
+    "\n",
+    "- **Purpose**: Provides the baseline annotations (typically the 97-annotation baseline-LD model from Gazal et al. 2017) in `.annot.gz` format for each chromosome. The baseline conditions every regression.\n",
+    "- **Formats**:\n",
+    "    - Text file listing baseline annotation files for all chromosomes.\n",
+    "    - Alternatively, files for specific chromosomes can be provided directly.\n",
+    "\n",
+    "##### 3. Genome Reference File\n",
+    "\n",
+    "- **Purpose**: PLINK-format `.bed/.bim/.fam` files for our LD reference panel, per chromosome. This is the panel against which all LD-derived quantities (target LD scores, baseline LD scores, regression weights, allele frequencies) must be computed. **Do not mix files derived from different panels** (e.g. 1000G vs ADSP).\n",
+    "- **Formats**:\n",
+    "    - Text file listing per-chromosome reference files, or files for specific chromosomes.\n",
+    "\n",
+    "##### 4. SNP List\n",
+    "\n",
+    "- **Purpose**: Specifies the SNPs to include in LDSC analysis (typically a HapMap3-style list).\n",
+    "- **Format**: A list of `rsid`s, one per line.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "kernel": "Bash"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/restricted/projectnb/xqtl/jaempawi/xqtl-protocol\n"
+     ]
+    }
+   ],
+   "source": [
+    "pwd"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "kernel": "Bash"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n",
+      "  import pkg_resources\n",
+      "INFO: Running \u001b[32mmake_annotation_files_ldscore\u001b[0m: \n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=1) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=3) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=2) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=0) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=5) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=6) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=4) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=7) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=9) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=10) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=8) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=11) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=14) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=13) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=12) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=15) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=18) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=16) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=17) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=19) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=21) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=20) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m output:   \u001b[32m/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_ldscore/protocol_example_single_1/protocol_example_single_1.1.annot.gz /restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_ldscore/protocol_example_single_1/protocol_example_single_1.1.l2.ldscore.parquet... (66 items in 22 groups)\u001b[0m\n",
+      "INFO: Workflow make_annotation_files_ldscore (ID=weae0ca3fdf468fd8) is executed successfully with 1 completed step and 22 completed substeps.\n"
+     ]
+    }
+   ],
+   "source": [
+    "sos run pipeline/sldsc_enrichment.ipynb make_annotation_files_ldscore \\\n",
+    "  --annotation_file input/enrichment/sldsc/colocboost_test_annotation_path.txt \\\n",
+    "  --reference_anno_file input/enrichment/sldsc/reference_annotation0.txt \\\n",
+    "  --genome_ref_file input/enrichment/sldsc/genome_reference_bfile.txt \\\n",
+    "  --annotation_name protocol_example \\\n",
+    "  --plink_name reference. --baseline_name annotations. --weight_name weights. \\\n",
+    "  --python_exec python \\\n",
+    "  --polyfun_path polyfun \\\n",
+    "  --cwd output/sldsc_ldscore -j 4\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "### Munge summary statistics (preprocessing, run before Step 2)\n",
+    "\n",
+    "Before estimating heritability, each raw GWAS summary-statistics file must be converted into the LDSC-compatible format consumed by `get_heritability`. Run `munge_sumstats_polyfun` once per trait; the munged files are then collected in the directory passed to `get_heritability` via `--sumstat_dir`.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "# sos run pipeline/sldsc_enrichment.ipynb munge_sumstats_polyfun \\\n",
+    "#     --sumstats data/polyfun_new/example_data/trait_raw_sumstats.tsv \\\n",
+    "#     --n 0 \\\n",
+    "#     --min-info 0.6 \\\n",
+    "#     --min-maf 0.001 \\\n",
+    "#     --chi2-cutoff 30 \\\n",
+    "#     --polyfun_path data/github/polyfun \\\n",
+    "#     --cwd data/polyfun_new/example_data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Step 2. `get_heritability`\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "**Inputs**\n",
+    "\n",
+    "##### 1. Allele Frequency Files (`.frq`, our panel)\n",
+    "\n",
+    "- **Purpose**: PLINK `.frq` files for the reference panel, used to enforce the MAF cutoff. **Required** when `maf_cutoff > 0` (default `0.05`); the pipeline fails if missing unless `maf_cutoff = 0` is explicitly set.\n",
+    "\n",
+    "##### 2. GWAS Summary Statistics\n",
+    "\n",
+    "- **Purpose**: One munged sumstats file per trait, listed in a text file (`all_traits_file`). The pipeline runs the regression once per trait per single/joint mode.\n",
+    "- **Format**:\n",
+    "    ```\n",
+    "    CAD_META.filtered.sumstats.gz\n",
+    "    UKB.Lym.BOLT.sumstats.gz\n",
+    "    ```\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {
+    "kernel": "Bash"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n",
+      "  import pkg_resources\n",
+      "INFO: Running \u001b[32mget_heritability\u001b[0m: \n",
+      "maf_cutoff = 0: skipping MAF filter (--not-M-5-50)\n",
+      "maf_cutoff = 0: skipping MAF filter (--not-M-5-50)\n",
+      "maf_cutoff = 0: skipping MAF filter (--not-M-5-50)\n",
+      "python: can't open file '/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/../polyfun/ldsc.py': [Errno 2] No such file or directory\n",
+      "python: can't open file '/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/../polyfun/ldsc.py': [Errno 2] No such file or directory\n",
+      "python: can't open file '/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/../polyfun/ldsc.py': [Errno 2] No such file or directory\n",
+      "INFO: \u001b[32mget_heritability\u001b[0m (index=1) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mget_heritability\u001b[0m (index=0) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mget_heritability\u001b[0m (index=2) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mget_heritability\u001b[0m output:   \u001b[32m/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_heritability/protocol_example_single_1/sumstats.parquet.log /restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_heritability/protocol_example_single_1/sumstats.parquet.results... (6 items in 3 groups)\u001b[0m\n",
+      "INFO: Workflow get_heritability (ID=wa79eac1662f5dd2d) is executed successfully with 1 completed step and 3 completed substeps.\n"
+     ]
+    }
+   ],
+   "source": [
+    "sos run pipeline/sldsc_enrichment.ipynb get_heritability \\\n",
+    "  --target_anno_dirs output/sldsc_ldscore/protocol_example_single_1 \\\n",
+    "  --all_traits_file input/enrichment/sldsc/sumstats_test_all.txt \\\n",
+    "  --sumstat_dir input/enrichment/sldsc \\\n",
+    "  --baseline_ld_dir input/enrichment/sldsc \\\n",
+    "  --weights_dir input/enrichment/sldsc \\\n",
+    "  --plink_name reference. --baseline_name annotations. --weight_name weights. \\\n",
+    "  --annotation_name protocol_example --python_exec python \\\n",
+    "  --polyfun_path ../polyfun \\\n",
+    "  --maf_cutoff 0 --cwd output/sldsc_heritability -j 4\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Step 3. `Post-processing (pecotmr) and meta-analysis`\n",
+    "\n",
+    "*Post-Processing (`pecotmr::sldsc_postprocessing_pipeline`)*\n",
+    "\n",
+    "A single R function call consumes all polyfun outputs for the run and produces the final tables:\n",
+    "\n",
+    "- Reads each polyfun output and extracts $\\tau$, $E$, $h^2_g$, EnrichStat p-value, and per-block jackknife $\\tau$ values.\n",
+    "- Computes annotation $sd_C$ and $M_{\\mathrm{ref}}$ over the same MAF $>$ cutoff SNP set as the regression.\n",
+    "- Standardizes $\\tau \\to \\tau^*$ for both single-tau and joint-tau modes, including the per-block versions for jackknife SE.\n",
+    "- Auto-detects whether each annotation is binary or continuous and tags every output row accordingly.\n",
+    "- Reports the number and names of baseline annotations encountered (via `message()`) for transparency.\n",
+    "- Runs the default DerSimonian-Laird random-effects meta-analysis across all supplied traits, producing three meta tables: $\\tau^*$ (cross-type comparable), $E$ (within-binary), and EnrichStat (within-type).\n",
+    "\n",
+    "Outputs are returned as an R list with two top-level entries: `per_trait` (one tidy data frame per trait, single + joint estimates side-by-side per target) and `meta` (three tables, one per quantity, with rows = target annotations and columns = single/joint mean/SE/p plus an `is_binary` flag).\n",
+    "\n",
+    "The `[postprocess]` step reads all polyfun outputs under `heritability_cwd`\n",
+    "(which contains the $N$ single-target subdirectories and optionally the\n",
+    "joint subdirectory) and calls `pecotmr::sldsc_postprocessing_pipeline()`\n",
+    "to produce per-trait standardized tables and the default random-effects\n",
+    "meta across all traits.\n",
+    "\n",
+    "Use `--target-categories-label` (same order as `--target-categories`) to give the target annotations friendly names in the output — e.g. `--target-categories ANNOT_1_0 ANNOT_2_0 --target-categories-label quantile_eQTL eQTL` makes the `target` column read `quantile_eQTL` / `eQTL` instead of `ANNOT_1_0` / `ANNOT_2_0` (the original names are kept in `params$target_categories_orig`). Omit it to keep the polyfun `.results` names.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {
+    "kernel": "Bash"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n",
+      "  import pkg_resources\n",
+      "INFO: Running \u001b[32mpostprocess\u001b[0m: \n",
+      "INFO: \u001b[32mpostprocess\u001b[0m is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mpostprocess\u001b[0m output:   \u001b[32m/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_postprocess/protocol_example.sldsc_postprocess.rds\u001b[0m\n",
+      "INFO: Workflow postprocess (ID=wb64dc2b84958960c) is executed successfully with 1 completed step.\n"
+     ]
+    }
+   ],
+   "source": [
+    "sos run pipeline/sldsc_enrichment.ipynb postprocess \\\n",
+    "  --traits_file input/enrichment/sldsc/sumstats_test_all.txt \\\n",
+    "  --heritability_cwd output/sldsc_heritability \\\n",
+    "  --target_categories ANNOT_0 --target_categories_label protocol_example_annotation \\\n",
+    "  --target_anno_dir output/sldsc_ldscore/protocol_example_single_1 \\\n",
+    "  --annotation_name protocol_example --python_exec python \\\n",
+    "  --polyfun_path ../polyfun \\\n",
+    "  --maf_cutoff 0 --cwd output/sldsc_postprocess -j 4\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Step 4. `Subset Meta-Analysis (`pecotmr::meta_sldsc_random`)` (optional)\n",
+    "\n",
+    "The default meta in Step 2 pools all traits the user supplied. To re-run the meta on a subset (e.g., neurodegenerative traits only, or autoimmune traits only) without re-running the regression layer:\n",
+    "\n",
+    "\n",
+    "```r\n",
+    "res <- readRDS(\"sldsc_results.rds\")\n",
+    "neuro <- c(\"AD_GWAX\", \"PD_meta\", \"ALS_meta\")\n",
+    "meta_neuro_taustar <- pecotmr::meta_sldsc_random(\n",
+    "  res$per_trait[neuro], category = \"my_target_anno\", quantity = \"tau_star\"\n",
+    ")\n",
+    "```\n",
+    "\n",
+    "This step is light-weight and can be run interactively.\n",
+    "\n",
+    "\n",
+    "The default meta in step 3 pools all traits supplied to `[postprocess]`. Use `[meta_subset]` to re-run the meta on a user-defined trait subset (e.g., neurodegenerative traits only, autoimmune traits only) without re-running the regression or the per-trait standardization. The subset operates on the cached `.sldsc_postprocess.rds` output; it is light-weight and can be run interactively or in batch.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {
+    "kernel": "Bash"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n",
+      "  import pkg_resources\n",
+      "INFO: Running \u001b[32mmeta_subset\u001b[0m: \n",
+      "INFO: \u001b[32mmeta_subset\u001b[0m is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmeta_subset\u001b[0m output:   \u001b[32m/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_postprocess/protocol_example.category1.meta.rds\u001b[0m\n",
+      "INFO: Workflow meta_subset (ID=w09a2a0530119f1d2) is executed successfully with 1 completed step.\n"
+     ]
+    }
+   ],
+   "source": [
+    "sos run pipeline/sldsc_enrichment.ipynb meta_subset \\\n",
+    "  --postprocess_rds output/sldsc_postprocess/protocol_example.sldsc_postprocess.rds \\\n",
+    "  --subset_traits_file input/enrichment/sldsc/sumstats_test_category1.txt \\\n",
+    "  --subset_name category1 --target_categories ANNOT_0 \\\n",
+    "  --annotation_name protocol_example --python_exec python \\\n",
+    "  --polyfun_path ../polyfun \\\n",
+    "  --maf_cutoff 0 --cwd output/sldsc_postprocess -j 4\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Output\n",
+    "\n",
+    "### Output summary\n",
+    "\n",
+    "| Stage | Cached on disk | Recomputable from | Purpose |\n",
+    "|---|---|---|---|\n",
+    "| Target LD scores | per-annotation, once | annotation + reference panel | input to every regression |\n",
+    "| polyfun `.results` per (trait, mode) | yes | regression run | $\\tau$, $E$, EnrichStat |\n",
+    "| Per-trait standardized table | yes (RDS) | polyfun outputs + $sd_C$ + $M_{\\mathrm{ref}}$ | reporting + meta |\n",
+    "| Default meta tables | yes (RDS) | per-trait standardized | headline figures |\n",
+    "| Subset meta | re-run on demand | per-trait standardized | custom analyses |\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "### Per-stage outputs\n",
+    "\n",
+    "Each workflow writes into its `--cwd`:\n",
+    "\n",
+    "- **make_annotation_files_ldscore** — polyfun `.annot.gz` files plus per-annotation LD-score directories (`.l2.ldscore.{gz,parquet}`, `.l2.M`, `.l2.M_5_50`). One single-target directory per annotation, plus (when more than one annotation) a joint directory.\n",
+    "- **get_heritability** — per trait and per target directory, the S-LDSC regression outputs `<trait>.{results,log,part_delete}`. The `.results` `Category` column carries the annotation name with a `_<ref-ld-index>` suffix.\n",
+    "- **postprocess** — a single `<annotation_name>.sldsc_postprocess.rds` containing per-trait tables (Gazal-style tau*, EnrichStat with back-solved jackknife SE) and three DerSimonian–Laird random-effects meta tables (tau*, E, EnrichStat).\n",
+    "- **meta_subset** — a re-meta of the cached `.sldsc_postprocess.rds` over a user-defined trait subset (lightweight; no regression re-run).\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Anticipated Results\n",
+    "\n",
+    "Produces per-annotation enrichment statistics (tau, enrichment, p-value) from stratified LD score regression."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Command interface\n",
+    "\n",
+    "List all workflows and their options:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "Bash"
+   },
+   "outputs": [],
+   "source": [
+    "sos run pipeline/sldsc_enrichment.ipynb -h"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Workflow implementation\n",
+    "\n",
+    "The cells below are the pipeline definition (preserved from the original notebook): the `[global]` parameter block and the workflow step bodies.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "[global]\n",
+    "# Path to the work directory of the analysis.\n",
+    "parameter: cwd = path('output')\n",
+    "# Prefix for the analysis output\n",
+    "parameter: annotation_name = str\n",
+    "parameter: python_exec = \"python\" # e.g. \"/home/you/.conda/envs/polyfun/bin/python\"\n",
+    "parameter: polyfun_path   = path # e.g. \"/home/you/tools/polyfun\"\n",
+    "\n",
+    "# MAF cutoff for sLDSC. Default 0.05 per sLDSC recommendation (rare-variant LD is unstable\n",
+    "# and HapMap3-style regression weights are common-variant by construction).\n",
+    "# Set to 0 to opt out of MAF filtering (NOT recommended; only use if you understand the implications).\n",
+    "# Other values would require recomputing LD scores at that cutoff.\n",
+    "parameter: maf_cutoff = 0.05\n",
+    "\n",
+    "# for make_annotation_files_ldscore workflow:\n",
+    "parameter: annotation_file = path()\n",
+    "parameter: reference_anno_file = path()\n",
+    "parameter: genome_ref_file = path() # with .bed\n",
+    "parameter: chromosome = []\n",
+    "parameter: snp_list = path()\n",
+    "parameter: ld_wind_kb = 0 # use kb if the value is provided\n",
+    "parameter: ld_wind_cm = 1.0 # default using ld_wind_cm\n",
+    "\n",
+    "# for get_heritability workflow.\n",
+    "# Note: all LD-derived inputs (baseline LD scores, target LD scores, regression weights,\n",
+    "# allele frequencies) must be computed against the same reference panel as `genome_ref_file`.\n",
+    "# Do not mix files derived from different reference panels (e.g., 1000G vs ADSP).\n",
+    "parameter: all_traits_file = path() # txt file, each row contains all GWAS summary statistics name: e.g. CAD_META.filtered.sumstats.gz\n",
+    "parameter: sumstat_dir = path() # Directory containing GWAS summary statistics\n",
+    "parameter: target_anno_dir = path()  # Directory containing target annotation files: output of ldscore\n",
+    "parameter: baseline_ld_dir = path()  # Directory containing baseline LD score files (computed against our panel)\n",
+    "parameter: frqfile_dir = path()  # Directory containing allele frequency files (.frq, our panel)\n",
+    "parameter: plink_name = \"ADSP_chr\"\n",
+    "parameter: weights_dir = path()  # Directory containing LD weights (computed against our panel)\n",
+    "parameter: baseline_name = \"baseline_chr\"  # Prefix of baseline annotation files\n",
+    "parameter: weight_name = \"weights_chr\"  # Prefix of LD weights files\n",
+    "parameter: n_blocks = 200\n",
+    "\n",
+    "# Number of threads\n",
+    "parameter: numThreads = 16\n",
+    "# For cluster jobs, number commands to run per job\n",
+    "parameter: job_size = 1\n",
+    "parameter: walltime = '12h'\n",
+    "parameter: mem = '16G'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "Python 3 (ipykernel)"
+   },
+   "source": [
+    "## Make Annotation File"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "[make_annotation_files_ldscore]\n",
+    "# Annotation preparation. Takes one annotation_file with N target annotations\n",
+    "# and produces, in one invocation, any combination of:\n",
+    "#   - N single-target LD-score directories (when compute_single = TRUE, default)\n",
+    "#   - 1 joint LD-score directory containing all N (when compute_joint = TRUE\n",
+    "#     and N >= 2, default)\n",
+    "#\n",
+    "# Outputs per chromosome <chr>:\n",
+    "#   <cwd>/<annotation_name>_single_<i>/<annotation_name>_single_<i>.<chr>.annot.gz   (i in 1..N, when compute_single)\n",
+    "#   <cwd>/<annotation_name>_single_<i>/<annotation_name>_single_<i>.<chr>.l2.ldscore.{parquet|gz}\n",
+    "#   <cwd>/<annotation_name>_single_<i>/<annotation_name>_single_<i>.<chr>.l2.M\n",
+    "#   <cwd>/<annotation_name>_single_<i>/<annotation_name>_single_<i>.<chr>.l2.M_5_50  (when .frq present)\n",
+    "#\n",
+    "#   <cwd>/<annotation_name>_joint/<annotation_name>_joint.<chr>.{...}                (when compute_joint and N>=2)\n",
+    "#\n",
+    "# Workflows:\n",
+    "#   - Workflow A (\"all at once\"): compute_single=TRUE, compute_joint=TRUE (defaults).\n",
+    "#     Produces both, fits the case where you have already chosen the joint set.\n",
+    "#   - Workflow B (\"exploratory then conditional\"):\n",
+    "#       Step 1: compute_single=TRUE, compute_joint=FALSE.\n",
+    "#               Run on N candidate annotations -> N single-target dirs.\n",
+    "#               Inspect single-target results, identify K significant ones.\n",
+    "#       Step 2: compute_single=FALSE, compute_joint=TRUE.\n",
+    "#               Run on a NEW annotation_file with the K selected annotations\n",
+    "#               -> 1 joint dir with the conditional model.\n",
+    "\n",
+    "#\n",
+    "# --- snplist (--snp_list) vs no-snplist: which polyfun script, output format,\n",
+    "#     column name, and the CM requirement ---\n",
+    "#   --snp_list given  -> ldsc.py --l2 --print-snps   -> output .l2.ldscore.gz\n",
+    "#   --snp_list absent -> compute_ldscores.py         -> output .l2.ldscore.parquet\n",
+    "#\n",
+    "#   LD-score column name (this is what becomes the .results \"Category\" in\n",
+    "#   [get_heritability], with a \"_<ref-ld-index>\" suffix appended there):\n",
+    "#     * compute_ldscores.py  ALWAYS keeps the annot column name(s):\n",
+    "#         single annot column \"ANNOT\"          -> ldscore column \"ANNOT\"\n",
+    "#         joint  annot columns \"ANNOT_1\",\"ANNOT_2\",...  -> \"ANNOT_1\",\"ANNOT_2\",...\n",
+    "#     * ldsc.py --l2 has a quirk: with EXACTLY ONE annotation (n_annot == 1) it\n",
+    "#       HARD-CODES the ldscore column name to \"L2\" and DROPS the annot's original\n",
+    "#       column name. With >=2 annotations it uses \"<annot_name>L2\"\n",
+    "#       (\"ANNOT_1L2\",\"ANNOT_2L2\",...).\n",
+    "#     => a single-target snplist run reports \"L2_0\" in .results, while a\n",
+    "#        single-target no-snplist run reports \"ANNOT_0\".  [postprocess] auto-\n",
+    "#        detects either; only matters if you pass --target-categories explicitly.\n",
+    "#\n",
+    "#   CM column requirement for snplist:  ldsc.py --l2 --print-snps requires the\n",
+    "#   target annot to (a) carry a \"CM\" (centimorgan) column and (b) line up with\n",
+    "#   the plink .bim (same SNP set, same row order). This step handles both\n",
+    "#   internally (normalize_for_ldsc: takes CM from the .bim 4th column, re-expands\n",
+    "#   the annot onto the .bim rows, filling 0). Therefore the plink .bim files MUST\n",
+    "#   carry genetic-map (cM) positions when using --ld-wind-cm (the default);\n",
+    "#   if your .bim has 0 in the cM column, switch to --ld-wind-kb instead.\n",
+    "#\n",
+    "parameter: compute_single = True\n",
+    "parameter: compute_joint = True\n",
+    "parameter: score_column = 3\n",
+    "parameter: is_range = False\n",
+    "\n",
+    "import pandas as pd\n",
+    "import os\n",
+    "\n",
+    "if not (compute_single or compute_joint):\n",
+    "    raise ValueError(\"[make_annotation_files_ldscore] at least one of compute_single or compute_joint must be TRUE\")\n",
+    "\n",
+    "def adapt_file_path(file_path, reference_file):\n",
+    "    reference_path = os.path.dirname(reference_file)\n",
+    "    if os.path.isfile(file_path):\n",
+    "        return file_path\n",
+    "    file_name = os.path.basename(file_path)\n",
+    "    if os.path.isfile(file_name):\n",
+    "        return file_name\n",
+    "    file_in_ref_dir = os.path.join(reference_path, file_name)\n",
+    "    if os.path.isfile(file_in_ref_dir):\n",
+    "        return file_in_ref_dir\n",
+    "    file_prefixed = os.path.join(reference_path, file_path)\n",
+    "    if os.path.isfile(file_prefixed):\n",
+    "        return file_prefixed\n",
+    "    raise FileNotFoundError(f\"No valid path found for file: {file_path}\")\n",
+    "\n",
+    "\n",
+    "# ---- Parse inputs and determine N ----\n",
+    "if (str(annotation_file).endswith(('rds', 'tsv', 'txt', 'tsv.gz', 'txt.gz')) and\n",
+    "    str(reference_anno_file).endswith('annot.gz')):\n",
+    "    # Case 1: direct file paths (single-chromosome run). Multiple target files separated by ','.\n",
+    "    target_files_direct = str(annotation_file).split(',')\n",
+    "    N_targets = len(target_files_direct)\n",
+    "    target_names = [f\"target_{i+1}\" for i in range(N_targets)]\n",
+    "    input_files = [[*target_files_direct, str(reference_anno_file), str(genome_ref_file)]]\n",
+    "    if len(chromosome) > 0:\n",
+    "        input_chroms = [int(x) for x in chromosome]\n",
+    "    else:\n",
+    "        input_chroms = [0]\n",
+    "else:\n",
+    "    # Case 2: txt list with #id and one or more 'path' columns\n",
+    "    target_files_df = pd.read_csv(annotation_file, sep=\"\\t\")\n",
+    "    reference_files = pd.read_csv(reference_anno_file, sep=\"\\t\")\n",
+    "    genome_ref_files = pd.read_csv(genome_ref_file, sep=\"\\t\")\n",
+    "\n",
+    "    target_files_df[\"#id\"] = [x.replace(\"chr\", \"\") for x in target_files_df[\"#id\"].astype(str)]\n",
+    "    reference_files[\"#id\"]  = [x.replace(\"chr\", \"\") for x in reference_files[\"#id\"].astype(str)]\n",
+    "    genome_ref_files[\"#id\"] = [x.replace(\"chr\", \"\") for x in genome_ref_files[\"#id\"].astype(str)]\n",
+    "\n",
+    "    path_columns = [c for c in target_files_df.columns if c.startswith('path')]\n",
+    "    N_targets = len(path_columns)\n",
+    "    target_names = path_columns[:]   # 'path', 'path1', 'path2', ...\n",
+    "\n",
+    "    for col in path_columns:\n",
+    "        target_files_df[col] = target_files_df[col].apply(lambda x: adapt_file_path(x, str(annotation_file)))\n",
+    "    reference_files[\"path\"] = reference_files[\"path\"].apply(lambda x: adapt_file_path(x, str(reference_anno_file)))\n",
+    "    genome_ref_files[\"path\"] = genome_ref_files[\"path\"].apply(lambda x: adapt_file_path(x, str(genome_ref_file)))\n",
+    "\n",
+    "    merged = target_files_df.merge(reference_files, on=\"#id\").merge(genome_ref_files, on=\"#id\")\n",
+    "    if len(chromosome) > 0:\n",
+    "        merged = merged[merged[\"#id\"].isin([str(c) for c in chromosome])]\n",
+    "\n",
+    "    rows = merged.values.tolist()\n",
+    "    input_chroms = [r[0] for r in rows]\n",
+    "    input_files = [[*r[1:N_targets+1], r[-2], r[-1]] for r in rows]\n",
+    "\n",
+    "# ---- Determine output format ----\n",
+    "use_print_snps = snp_list.is_file()\n",
+    "ldscore_ext = \"l2.ldscore.gz\" if use_print_snps else \"l2.ldscore.parquet\"\n",
+    "\n",
+    "if ld_wind_kb > 0:\n",
+    "    use_kb_window = True\n",
+    "    ld_window_param = ld_wind_kb\n",
+    "    ld_window_flag = \"--ld-wind-kb\"\n",
+    "else:\n",
+    "    use_kb_window = False\n",
+    "    ld_window_param = ld_wind_cm\n",
+    "    ld_window_flag = \"--ld-wind-cm\"\n",
+    "\n",
+    "emit_single = compute_single\n",
+    "emit_joint  = compute_joint and N_targets >= 2\n",
+    "\n",
+    "# ---- Build per-chromosome output list ----\n",
+    "def chrom_outputs(chrom):\n",
+    "    outs = []\n",
+    "    if emit_single:\n",
+    "        for i in range(N_targets):\n",
+    "            name = f\"{annotation_name}_single_{i+1}\"\n",
+    "            prefix = f\"{cwd:a}/{name}/{name}.{chrom}\"\n",
+    "            outs += [f\"{prefix}.annot.gz\", f\"{prefix}.{ldscore_ext}\", f\"{prefix}.l2.M\"]\n",
+    "    if emit_joint:\n",
+    "        name = f\"{annotation_name}_joint\"\n",
+    "        prefix = f\"{cwd:a}/{name}/{name}.{chrom}\"\n",
+    "        outs += [f\"{prefix}.annot.gz\", f\"{prefix}.{ldscore_ext}\", f\"{prefix}.l2.M\"]\n",
+    "    return outs\n",
+    "\n",
+    "input: input_files, group_by = N_targets + 2, group_with = \"input_chroms\"\n",
+    "\n",
+    "output: chrom_outputs(input_chroms[_index])\n",
+    "\n",
+    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bnn}'\n",
+    "\n",
+    "# ----------------------------------------------------------------------------\n",
+    "# Step A: write the requested .annot files for this chromosome.\n",
+    "# ----------------------------------------------------------------------------\n",
+    "R: expand = \"${ }\", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'\n",
+    "    library(data.table)\n",
+    "\n",
+    "    clean_chr <- function(x) as.numeric(gsub(\"^chr\", \"\", x))\n",
+    "\n",
+    "    process_range_data <- function(data, chr_value) {\n",
+    "        data$chr <- clean_chr(data$chr)\n",
+    "        data <- data[data$chr == chr_value,]\n",
+    "        if (nrow(data) == 0) return(NULL)\n",
+    "        expanded <- lapply(seq_len(nrow(data)), function(j) {\n",
+    "            row <- data[j,]\n",
+    "            pos_seq <- seq(row$start, row$end - 1)\n",
+    "            result <- data.frame(chr = rep(row$chr, length(pos_seq)), pos = pos_seq)\n",
+    "            if (ncol(data) > 3) {\n",
+    "                for (col in 4:ncol(data))\n",
+    "                    result[[names(data)[col]]] <- rep(row[[col]], length(pos_seq))\n",
+    "            }\n",
+    "            result\n",
+    "        })\n",
+    "        unique(rbindlist(expanded))\n",
+    "    }\n",
+    "\n",
+    "    process_annotation <- function(target_anno, ref_anno, score_column_value) {\n",
+    "        target_anno <- as.data.frame(target_anno)\n",
+    "        ref_anno    <- as.data.frame(ref_anno)\n",
+    "        target_anno$chr <- clean_chr(target_anno$chr)\n",
+    "        ref_anno$CHR    <- clean_chr(ref_anno$CHR)\n",
+    "        chr_value <- unique(ref_anno$CHR)\n",
+    "        anno_scores <- rep(0, nrow(ref_anno))\n",
+    "        match_pos <- match(target_anno$pos, ref_anno$BP)\n",
+    "        valid_pos <- as.numeric(na.omit(match_pos))\n",
+    "        if (score_column_value <= ncol(target_anno)) {\n",
+    "            anno_scores[valid_pos] <- target_anno[[score_column_value]][!is.na(match_pos)]\n",
+    "        } else {\n",
+    "            anno_scores[valid_pos] <- 1\n",
+    "            print(\"Warning: score column does not exist; setting scores to 1\")\n",
+    "        }\n",
+    "        anno_scores\n",
+    "    }\n",
+    "\n",
+    "    read_target_anno <- function(file_path, ref_anno) {\n",
+    "        if (endsWith(file_path, \"rds\")) {\n",
+    "            target_anno <- readRDS(file_path)\n",
+    "            return(process_annotation(target_anno, ref_anno, ${score_column}))\n",
+    "        }\n",
+    "        target_anno <- fread(file_path)\n",
+    "        if (${\"TRUE\" if is_range else \"FALSE\"}) {\n",
+    "            names(target_anno)[1:3] <- c(\"chr\", \"start\", \"end\")\n",
+    "            target_anno <- process_range_data(target_anno, unique(ref_anno$CHR))\n",
+    "            if (is.null(target_anno)) return(rep(0, nrow(ref_anno)))\n",
+    "        } else {\n",
+    "            names(target_anno)[1:2] <- c(\"chr\", \"pos\")\n",
+    "        }\n",
+    "        process_annotation(target_anno, ref_anno, ${score_column})\n",
+    "    }\n",
+    "\n",
+    "    # ---- Read reference annotation ----\n",
+    "    ref_anno <- as.data.frame(fread(${_input[-2]:ar}))\n",
+    "    if (\"ANNOT\" %in% colnames(ref_anno)) ref_anno <- ref_anno[, -which(colnames(ref_anno) == \"ANNOT\")]\n",
+    "\n",
+    "    # ---- Compute per-target annotation scores ----\n",
+    "    target_files <- c(${\",\".join('\"%s\"' % str(p.absolute()) for p in _input[:-2])})\n",
+    "    N_local <- length(target_files)\n",
+    "    score_list <- lapply(target_files, read_target_anno, ref_anno = ref_anno)\n",
+    "\n",
+    "    emit_single_local <- ${\"TRUE\" if emit_single else \"FALSE\"}\n",
+    "    emit_joint_local  <- ${\"TRUE\" if emit_joint  else \"FALSE\"}\n",
+    "    use_print_snps_local <- ${\"TRUE\" if use_print_snps else \"FALSE\"}\n",
+    "    bfile_prefix         <- \"${_input[-1]:na}\"\n",
+    "\n",
+    "    # Reshape annot to match .bim panel for ldsc.py --l2 --print-snps\n",
+    "    # (drop A1/A2/MAF, expand to .bim rows filling 0, take CM from .bim).\n",
+    "    normalize_for_ldsc <- function(df) {\n",
+    "        if (!use_print_snps_local) return(df)\n",
+    "        df <- df[, !names(df) %in% c(\"A1\", \"A2\", \"MAF\", \"CM\"), drop = FALSE]\n",
+    "        annot_cols <- setdiff(names(df), c(\"CHR\", \"BP\", \"SNP\"))\n",
+    "        bim <- as.data.frame(fread(paste0(bfile_prefix, \".bim\"), header = FALSE,\n",
+    "                                   col.names = c(\"CHR\", \"SNP\", \"CM\", \"BP\", \"A1\", \"A2\")))\n",
+    "        bim$CHR <- as.character(bim$CHR); df$CHR <- as.character(df$CHR)\n",
+    "        idx <- match(bim$SNP, df$SNP)\n",
+    "        out <- data.frame(CHR = bim$CHR, BP = bim$BP, SNP = bim$SNP, CM = bim$CM,\n",
+    "                          stringsAsFactors = FALSE)\n",
+    "        for (col in annot_cols) {\n",
+    "            v <- rep(0, nrow(bim))\n",
+    "            non_na <- !is.na(idx)\n",
+    "            v[non_na] <- df[[col]][idx[non_na]]\n",
+    "            out[[col]] <- v\n",
+    "        }\n",
+    "        out\n",
+    "    }\n",
+    "\n",
+    "    # ---- Write N single-target .annot files (when requested) ----\n",
+    "    if (emit_single_local) {\n",
+    "        for (i in seq_len(N_local)) {\n",
+    "            out_anno <- ref_anno\n",
+    "            out_anno$ANNOT <- score_list[[i]]\n",
+    "            out_anno <- normalize_for_ldsc(out_anno)\n",
+    "            name <- paste0(\"${annotation_name}\", \"_single_\", i)\n",
+    "            out_path_gz  <- file.path(\"${cwd:a}\", name, paste0(name, \".${input_chroms[_index]}.annot.gz\"))\n",
+    "            out_path_tsv <- sub(\"\\\\.gz$\", \"\", out_path_gz)\n",
+    "            dir.create(dirname(out_path_gz), showWarnings = FALSE, recursive = TRUE)\n",
+    "            fwrite(out_anno, out_path_tsv, quote = FALSE, col.names = TRUE, row.names = FALSE, sep = \"\\t\")\n",
+    "        }\n",
+    "    }\n",
+    "\n",
+    "    # ---- Optionally write joint .annot ----\n",
+    "    if (emit_joint_local) {\n",
+    "        joint_anno <- ref_anno\n",
+    "        for (i in seq_len(N_local)) {\n",
+    "            joint_anno[[paste0(\"ANNOT_\", i)]] <- score_list[[i]]\n",
+    "        }\n",
+    "        joint_anno <- normalize_for_ldsc(joint_anno)\n",
+    "        joint_name   <- paste0(\"${annotation_name}\", \"_joint\")\n",
+    "        joint_out_gz <- file.path(\"${cwd:a}\", joint_name, paste0(joint_name, \".${input_chroms[_index]}.annot.gz\"))\n",
+    "        joint_out_tsv <- sub(\"\\\\.gz$\", \"\", joint_out_gz)\n",
+    "        dir.create(dirname(joint_out_gz), showWarnings = FALSE, recursive = TRUE)\n",
+    "        fwrite(joint_anno, joint_out_tsv, quote = FALSE, col.names = TRUE, row.names = FALSE, sep = \"\\t\")\n",
+    "    }\n",
+    "\n",
+    "# ----------------------------------------------------------------------------\n",
+    "# Step B: gzip all annot files. Uses expand=\"$[ ]\" so bash ${var} survives.\n",
+    "# ----------------------------------------------------------------------------\n",
+    "bash: expand = \"$[ ]\", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'\n",
+    "    set -e\n",
+    "    annots=()\n",
+    "    if [ \"$[str(emit_single)]\" = \"True\" ]; then\n",
+    "        for i in $(seq 1 $[N_targets]); do\n",
+    "            annots+=(\"$[cwd:a]/$[annotation_name]_single_$i/$[annotation_name]_single_$i.$[input_chroms[_index]].annot\")\n",
+    "        done\n",
+    "    fi\n",
+    "    if [ \"$[str(emit_joint)]\" = \"True\" ]; then\n",
+    "        annots+=(\"$[cwd:a]/$[annotation_name]_joint/$[annotation_name]_joint.$[input_chroms[_index]].annot\")\n",
+    "    fi\n",
+    "    for a in \"${annots[@]}\"; do\n",
+    "        gzip -f \"$a\"\n",
+    "    done\n",
+    "\n",
+    "# ----------------------------------------------------------------------------\n",
+    "# Step C: run polyfun's LD-score computation for each emitted annotation file.\n",
+    "# ----------------------------------------------------------------------------\n",
+    "bash: expand = \"$[ ]\", stderr = f'{_output[1]}.stderr', stdout = f'{_output[1]}.stdout'\n",
+    "    set -e\n",
+    "    chrom=\"$[input_chroms[_index]]\"\n",
+    "\n",
+    "    run_polyfun() {\n",
+    "        local annot=\"$1\"\n",
+    "        local out_prefix=\"$2\"\n",
+    "        if [ \"$[str(use_print_snps)]\" = \"True\" ]; then\n",
+    "            $[python_exec] $[polyfun_path]/ldsc.py \\\n",
+    "                --print-snps $[snp_list] \\\n",
+    "                $[ld_window_flag] $[ld_window_param] \\\n",
+    "                --out \"$out_prefix\" \\\n",
+    "                --bfile $[_input[-1]:nar] \\\n",
+    "                --yes-really \\\n",
+    "                --annot \"$annot\" \\\n",
+    "                --l2\n",
+    "        else\n",
+    "            $[python_exec] $[polyfun_path]/compute_ldscores.py \\\n",
+    "                --annot \"$annot\" \\\n",
+    "                --bfile $[_input[-1]:nar] \\\n",
+    "                $[ld_window_flag] $[ld_window_param] \\\n",
+    "                --out \"${out_prefix}.$[ldscore_ext]\" \\\n",
+    "                --allow-missing\n",
+    "        fi\n",
+    "    }\n",
+    "\n",
+    "    if [ \"$[str(emit_single)]\" = \"True\" ]; then\n",
+    "        for i in $(seq 1 $[N_targets]); do\n",
+    "            name=\"$[annotation_name]_single_$i\"\n",
+    "            annot=\"$[cwd:a]/$name/$name.$chrom.annot.gz\"\n",
+    "            prefix=\"$[cwd:a]/$name/$name.$chrom\"\n",
+    "            run_polyfun \"$annot\" \"$prefix\"\n",
+    "        done\n",
+    "    fi\n",
+    "    if [ \"$[str(emit_joint)]\" = \"True\" ]; then\n",
+    "        name=\"$[annotation_name]_joint\"\n",
+    "        annot=\"$[cwd:a]/$name/$name.$chrom.annot.gz\"\n",
+    "        prefix=\"$[cwd:a]/$name/$name.$chrom\"\n",
+    "        run_polyfun \"$annot\" \"$prefix\"\n",
+    "    fi\n",
+    "\n",
+    "# ----------------------------------------------------------------------------\n",
+    "# Step D: write .l2.M and .l2.M_5_50 files for each emitted annotation directory.\n",
+    "# ----------------------------------------------------------------------------\n",
+    "R: expand = \"${ }\", stderr = f'{_output[2]}.stderr', stdout = f'{_output[2]}.stdout'\n",
+    "    suppressPackageStartupMessages({ library(data.table); library(dplyr) })\n",
+    "    use_print_snps <- ${str(use_print_snps).upper()}\n",
+    "\n",
+    "    chrom <- \"${input_chroms[_index]}\"\n",
+    "    # Look up .frq file under frqfile_dir, using plink_name + chrom (matches cell 25).\n",
+    "    frq_file <- file.path(\"${frqfile_dir}\", paste0(\"${plink_name}\", chrom, \".frq\"))\n",
+    "    has_frq  <- file.exists(frq_file)\n",
+    "    frq_dt <- if (has_frq) fread(frq_file)[, .(SNP, MAF)] else NULL\n",
+    "\n",
+    "    write_M_files <- function(annot_path, ldscore_path, m_path) {\n",
+    "        if (use_print_snps && file.exists(m_path) && file.exists(paste0(m_path, \"_5_50\"))) {\n",
+    "            cat(\"M files already exist for\", m_path, \"\\n\"); return(invisible())\n",
+    "        }\n",
+    "        ldscore_dt <- if (endsWith(ldscore_path, \".parquet\")) {\n",
+    "            suppressPackageStartupMessages(library(arrow)); arrow::read_parquet(ldscore_path)\n",
+    "        } else fread(ldscore_path)\n",
+    "        annot_dt <- fread(annot_path)\n",
+    "        annot_filtered <- annot_dt[annot_dt$SNP %in% ldscore_dt$SNP, ]\n",
+    "        merged <- if (has_frq) merge(annot_filtered, frq_dt, by = \"SNP\", all.x = TRUE) else annot_filtered\n",
+    "        std_cols <- c(\"CHR\", \"SNP\", \"BP\", \"CM\", \"A1\", \"A2\", if (has_frq) \"MAF\")\n",
+    "        annot_cols <- setdiff(names(merged), std_cols)\n",
+    "        if (length(annot_cols) == 0L) { merged[, ANNOT := 1L]; annot_cols <- \"ANNOT\" }\n",
+    "        M <- merged[, lapply(.SD, sum, na.rm = TRUE), .SDcols = annot_cols]\n",
+    "        writeLines(paste(as.numeric(M), collapse = \" \"), m_path)\n",
+    "        if (has_frq) {\n",
+    "            common <- merged[!is.na(MAF) & MAF > 0.05, ]\n",
+    "            M5 <- common[, lapply(.SD, sum, na.rm = TRUE), .SDcols = annot_cols]\n",
+    "            writeLines(paste(as.numeric(M5), collapse = \" \"), paste0(m_path, \"_5_50\"))\n",
+    "        }\n",
+    "    }\n",
+    "\n",
+    "    targets <- c()\n",
+    "    if (${\"TRUE\" if emit_single else \"FALSE\"}) {\n",
+    "        for (i in seq_len(${N_targets})) {\n",
+    "            targets <- c(targets, paste0(\"${annotation_name}\", \"_single_\", i))\n",
+    "        }\n",
+    "    }\n",
+    "    if (${\"TRUE\" if emit_joint else \"FALSE\"}) {\n",
+    "        targets <- c(targets, paste0(\"${annotation_name}\", \"_joint\"))\n",
+    "    }\n",
+    "    for (name in targets) {\n",
+    "        annot_path   <- file.path(\"${cwd:a}\", name, paste0(name, \".\", chrom, \".annot.gz\"))\n",
+    "        ldscore_path <- file.path(\"${cwd:a}\", name, paste0(name, \".\", chrom, \".${ldscore_ext}\"))\n",
+    "        m_path       <- file.path(\"${cwd:a}\", name, paste0(name, \".\", chrom, \".l2.M\"))\n",
+    "        write_M_files(annot_path, ldscore_path, m_path)\n",
+    "    }\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "Python 3 (ipykernel)"
+   },
+   "source": [
+    "## Calculate Functional Enrichment using Annotations"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "[get_heritability]\n",
+    "# Per-trait sLDSC regression via polyfun. Fans out across target_anno_dirs:\n",
+    "# each (trait, target_dir) pair becomes one polyfun invocation. Outputs go to\n",
+    "# <cwd>/<basename(target_dir)>/<trait>.{results,log,part_delete}.\n",
+    "#\n",
+    "# `target_anno_dirs` is the list produced by [make_annotation_files_ldscore]:\n",
+    "# typically N _single_<i> directories plus optionally one _joint directory.\n",
+    "\n",
+    "#\n",
+    "# --- about the \".results\" Category column and the \"_0 / _1\" suffix ---\n",
+    "# Each (trait, target_dir) pair is ONE polyfun call; its `ldsc.py --ref-ld-chr`\n",
+    "# always gets exactly two LD-score sources, in this order:\n",
+    "#     \"<target_dir>/<target>.\"   (index 0)  ,  \"<baseline_dir>/<baseline>\"   (index 1)\n",
+    "# With --overlap-annot, every annotation column in the .results \"Category\" is\n",
+    "# named  <ldscore_column_name>_<ref-ld-index>:\n",
+    "#     index 0 = the target file   -> \"ANNOT_0\"  (no-snplist; compute_ldscores.py keeps the annot col name)\n",
+    "#                                  -> \"L2_0\"    (snplist + single annot; ldsc.py hard-codes \"L2\", see below)\n",
+    "#                                  -> \"ANNOT_1_0\",\"ANNOT_2_0\"      (no-snplist joint dir, N>=2 annot cols)\n",
+    "#                                  -> \"ANNOT_1L2_0\",\"ANNOT_2L2_0\"  (snplist joint dir, N>=2 -> \"<name>L2\")\n",
+    "#     index 1 = the baseline file -> \"base_1\",\"Coding_UCSC_1\", ...  (the 97 baseline annots)\n",
+    "# So in this pipeline the suffix is only ever 0 (target) or 1 (baseline); it would\n",
+    "# continue 0,1,2,... only if you handed `ldsc.py --ref-ld-chr` more than two sources.\n",
+    "# (Why ANNOT_0 vs L2_0: see the [make_annotation_files_ldscore] header — ldsc.py's\n",
+    "#  \"n_annot == 1 -> column name 'L2'\" quirk vs compute_ldscores.py keeping the annot\n",
+    "#  column name.)  [postprocess] auto-detects the target Category; if you instead pass\n",
+    "# --target-categories, the names must match this column exactly.\n",
+    "#\n",
+    "parameter: target_anno_dirs = paths()\n",
+    "parameter: all_traits = []\n",
+    "\n",
+    "import os\n",
+    "\n",
+    "with open(all_traits_file, 'r') as f:\n",
+    "    trait_paths = [os.path.join(sumstat_dir, line.strip()) for line in f if line.strip()]\n",
+    "\n",
+    "# Build (trait, target_dir) Cartesian product as parallel flat lists.\n",
+    "input_list  = []\n",
+    "target_meta = []\n",
+    "for td in target_anno_dirs:\n",
+    "    for t in trait_paths:\n",
+    "        input_list.append(t)\n",
+    "        target_meta.append(str(td))\n",
+    "\n",
+    "input: input_list, group_by = 1, group_with = \"target_meta\"\n",
+    "\n",
+    "output: f\"{cwd:a}/{os.path.basename(target_meta[_index])}/{os.path.basename(_input[0])}.log\",  \\\n",
+    "        f\"{cwd:a}/{os.path.basename(target_meta[_index])}/{os.path.basename(_input[0])}.results\"\n",
+    "\n",
+    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'\n",
+    "\n",
+    "bash: expand = \"${ }\"\n",
+    "    target_dir=\"${target_meta[_index]}\"\n",
+    "    target_name=\"$(basename ${target_meta[_index]})\"\n",
+    "    trait=\"$(basename ${_input[0]})\"\n",
+    "    output_dir=\"${cwd:a}/$target_name\"\n",
+    "    mkdir -p \"$output_dir\"\n",
+    "\n",
+    "    # MAF cutoff handling. Only 0 (disabled) or 0.05 (sLDSC default) are supported;\n",
+    "    # other values would require recomputing LD scores at that cutoff.\n",
+    "    frq_file_check=\"${frqfile_dir}/${plink_name}22.frq\"\n",
+    "    if [ \"${maf_cutoff}\" = \"0\" ] || [ \"${maf_cutoff}\" = \"0.0\" ]; then\n",
+    "        echo \"maf_cutoff = 0: skipping MAF filter (--not-M-5-50)\"\n",
+    "        frq_option=\"--not-M-5-50\"\n",
+    "    elif [ \"${maf_cutoff}\" = \"0.05\" ]; then\n",
+    "        if [ -f \"$frq_file_check\" ]; then\n",
+    "            echo \"maf_cutoff = 0.05: using --frqfile-chr (MAF > 5%)\"\n",
+    "            frq_option=\"--frqfile-chr ${frqfile_dir}/${plink_name}\"\n",
+    "        else\n",
+    "            echo \"ERROR: maf_cutoff=0.05 requires .frq files for the reference panel,\"\n",
+    "            echo \"       but none found at ${frqfile_dir}/${plink_name}*.frq.\"\n",
+    "            echo \"       Provide .frq files in frqfile_dir, or set maf_cutoff=0 (NOT recommended).\"\n",
+    "            exit 1\n",
+    "        fi\n",
+    "    else\n",
+    "        echo \"ERROR: maf_cutoff=${maf_cutoff} is not supported. Only 0 (no filter) or\"\n",
+    "        echo \"       0.05 (sLDSC default) are accepted. Other values would require\"\n",
+    "        echo \"       recomputing LD scores at that cutoff.\"\n",
+    "        exit 1\n",
+    "    fi\n",
+    "\n",
+    "    run_ldsc() {\n",
+    "        local extra_args=\"$1\"\n",
+    "        ${python_exec} ${polyfun_path}/ldsc.py \\\n",
+    "            --h2 ${sumstat_dir}/$trait \\\n",
+    "            --ref-ld-chr \"$target_dir/$target_name.\",\"${baseline_ld_dir}/${baseline_name}\" \\\n",
+    "            --out \"$output_dir/$trait\" \\\n",
+    "            --overlap-annot \\\n",
+    "            --w-ld-chr ${weights_dir}/${weight_name} \\\n",
+    "            $frq_option \\\n",
+    "            --print-coefficients \\\n",
+    "            --print-delete-vals \\\n",
+    "            --n-blocks ${n_blocks} \\\n",
+    "            $extra_args\n",
+    "    }\n",
+    "\n",
+    "    run_ldsc \"\"\n",
+    "    log_file=\"$output_dir/$trait.log\"\n",
+    "\n",
+    "    # FloatingPointError retry ladder (preserved from original): 30 -> 20 -> 10\n",
+    "    for max in 30 20 10; do\n",
+    "        if [ -f \"$log_file\" ] && grep -q \"FloatingPointError\\|invalid value encountered in sqrt\" \"$log_file\"; then\n",
+    "            echo \"FloatingPointError detected, retrying with --chisq-max $max...\"\n",
+    "            run_ldsc \"--chisq-max $max\"\n",
+    "        else\n",
+    "            break\n",
+    "        fi\n",
+    "    done\n",
+    "\n",
+    "    if [ -f \"$log_file\" ] && grep -q \"FloatingPointError\\|invalid value encountered in sqrt\" \"$log_file\"; then\n",
+    "        echo \"ERROR: FloatingPointError persists for trait $trait at target $target_name even with --chisq-max 10\"\n",
+    "        echo \"This trait may have severe numerical instability issues in the summary statistics.\"\n",
+    "    fi\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "[munge_sumstats_polyfun]\n",
+    "parameter: sumstats  = path\n",
+    "parameter: n       = 0\n",
+    "parameter: min_info = 0.6\n",
+    "parameter: min_maf  = 0.001\n",
+    "parameter: keep_hla = False\n",
+    "parameter: chi2_cut = 30\n",
+    "input: sumstats\n",
+    "output: f\"{_input:n}.munged.parquet\"\n",
+    "bash: expand=True, stderr=f'{_output:nn}.stderr', stdout=f'{_output:nn}.stdout'\n",
+    "    {python_exec} {polyfun_path}/munge_polyfun_sumstats.py \\\n",
+    "        --sumstats {_input} \\\n",
+    "        --out {_output} \\\n",
+    "        {'--n {}'.format(n) if n>0 else ''} \\\n",
+    "        {'--min-info {}'.format(min_info)} \\\n",
+    "        {'--min-maf {}'.format(min_maf)} \\\n",
+    "        {'--chi2-cutoff {}'.format(chi2_cut)} \\\n",
+    "        {'--keep-hla' if keep_hla else ''} \\\n",
+    "        --remove-strand-ambig"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "[postprocess]\n",
+    "# Post-processing of polyfun outputs via pecotmr::sldsc_postprocessing_pipeline.\n",
+    "# Reads .results / .log / .part_delete for all traits in `traits_file`, both\n",
+    "# single-target and (when present) joint-target runs, computes Gazal-style\n",
+    "# tau*, EnrichStat with back-solved jackknife SE, and runs the default\n",
+    "# DerSimonian-Laird random-effects meta across all supplied traits. Writes\n",
+    "# one RDS containing per-trait tables and three meta tables (tau*, E, EnrichStat).\n",
+    "\n",
+    "parameter: traits_file = path()             # text file: one trait sumstats filename per line\n",
+    "parameter: heritability_cwd = path()        # parent directory of [get_heritability] outputs (contains <annotation_name>_single_<i>/ subdirs and optionally <annotation_name>_joint/)\n",
+    "parameter: target_categories = []           # target annotation names. Auto-detected from the joint-run results if empty.\n",
+    "parameter: target_categories_label = []     # optional display names, same order as target_categories;\n",
+    "                                            # when given, every \"target\" column / tau*-block colname in\n",
+    "                                            # the output RDS is renamed to these (params$target_categories\n",
+    "                                            # holds the labels, params$target_categories_orig the originals).\n",
+    "parameter: target_anno_dir = path()         # directory of target .annot.gz files used for sd_C and binary detection (typically the joint dir, since it carries all target columns)\n",
+    "\n",
+    "input: traits_file\n",
+    "output: f\"{cwd:a}/{annotation_name}.sldsc_postprocess.rds\"\n",
+    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n",
+    "\n",
+    "R: expand = \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'\n",
+    "    library(pecotmr)\n",
+    "\n",
+    "    traits <- readLines(\"${traits_file}\")\n",
+    "    target_cats <- c(${\",\".join('\"%s\"' % c for c in target_categories)})\n",
+    "    target_lab  <- c(${\",\".join('\"%s\"' % c for c in target_categories_label)})\n",
+    "\n",
+    "    # Auto-detect single-target and joint-target output directories.\n",
+    "    her_root  <- \"${heritability_cwd}\"\n",
+    "    all_subdirs <- list.dirs(her_root, recursive = FALSE)\n",
+    "    single_pattern <- paste0(\"^\", \"${annotation_name}\", \"_single_([0-9]+)$\")\n",
+    "    joint_name     <- paste0(\"${annotation_name}\", \"_joint\")\n",
+    "    single_dirs <- all_subdirs[grepl(single_pattern, basename(all_subdirs))]\n",
+    "    single_indices <- as.integer(sub(single_pattern, \"\\\\1\", basename(single_dirs)))\n",
+    "    single_dirs <- single_dirs[order(single_indices)]\n",
+    "    joint_dir   <- file.path(her_root, joint_name)\n",
+    "    has_joint   <- dir.exists(joint_dir)\n",
+    "\n",
+    "    message(sprintf(\"Detected %d single-target dirs%s\",\n",
+    "                    length(single_dirs),\n",
+    "                    if (has_joint) \"; joint-target dir present\" else \"; no joint-target dir\"))\n",
+    "\n",
+    "    # Build per-trait prefix maps. Each trait's polyfun output is at <dir>/<trait>\n",
+    "    # (polyfun appends .results / .log / .part_delete).\n",
+    "    trait_single_prefixes <- lapply(traits, function(t) file.path(single_dirs, t))\n",
+    "    names(trait_single_prefixes) <- traits\n",
+    "\n",
+    "    if (has_joint) {\n",
+    "        trait_joint_prefix <- setNames(file.path(joint_dir, traits), traits)\n",
+    "    } else {\n",
+    "        trait_joint_prefix <- setNames(rep(NA_character_, length(traits)), traits)\n",
+    "    }\n",
+    "\n",
+    "    res <- sldsc_postprocessing_pipeline(\n",
+    "        trait_single_prefixes = trait_single_prefixes,\n",
+    "        trait_joint_prefix    = trait_joint_prefix,\n",
+    "        target_anno_dir       = \"${target_anno_dir}\",\n",
+    "        frqfile_dir          = \"${frqfile_dir}\",\n",
+    "        plink_name           = \"${plink_name}\",\n",
+    "        maf_cutoff           = ${maf_cutoff},\n",
+    "        target_categories    = if (length(target_cats) > 0) target_cats else NULL,\n",
+    "        target_labels        = if (length(target_lab)  > 0) target_lab  else NULL\n",
+    "    )\n",
+    "\n",
+    "    saveRDS(res, \"${_output[0]}\")\n",
+    "    message(\"S-LDSC post-processing complete; results written to ${_output[0]}\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "[meta_subset]\n",
+    "# Optional: re-run random-effects meta on a user-defined subset of traits, using\n",
+    "# the cached per-trait standardized results from [postprocess]. No regression rerun.\n",
+    "\n",
+    "parameter: postprocess_rds = path()           # output of [postprocess]\n",
+    "parameter: subset_traits_file = path()        # text file: one trait id per line, subset of those passed to [postprocess]\n",
+    "parameter: subset_name = str                  # label used in the output filename\n",
+    "parameter: target_categories = []             # target annotation names to meta on; if empty, uses all from postprocess output\n",
+    "# If [postprocess] was run with --target-categories-label, the cached RDS already\n",
+    "# carries the display names (params$target_categories = the labels), so leave\n",
+    "# --target-categories empty here (or pass the labels, not the original ANNOT_* names).\n",
+    "\n",
+    "input: postprocess_rds, subset_traits_file\n",
+    "output: f\"{cwd:a}/{annotation_name}.{subset_name}.meta.rds\"\n",
+    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n",
+    "\n",
+    "R: expand = \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'\n",
+    "    library(pecotmr)\n",
+    "\n",
+    "    res <- readRDS(\"${postprocess_rds}\")\n",
+    "    subset_traits <- readLines(\"${subset_traits_file}\")\n",
+    "    target_cats <- c(${\",\".join([f'\"{c}\"' for c in target_categories])})\n",
+    "    if (length(target_cats) == 0) target_cats <- res$params$target_categories\n",
+    "\n",
+    "    subset_per_trait <- res$per_trait[subset_traits]\n",
+    "\n",
+    "    # Map wide names (tau_star_single/joint) to bare names meta_sldsc_random expects.\n",
+    "    view_single <- pecotmr:::.sldsc_view_for_meta(subset_per_trait, \"single\")\n",
+    "    view_joint  <- pecotmr:::.sldsc_view_for_meta(subset_per_trait, \"joint\")\n",
+    "\n",
+    "    out <- list(\n",
+    "        tau_star_single = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_single, c, \"tau_star\")),   target_cats),\n",
+    "        tau_star_joint  = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_joint,  c, \"tau_star\")),   target_cats),\n",
+    "        enrichment      = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_single, c, \"enrichment\")), target_cats),\n",
+    "        enrichstat      = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_single, c, \"enrichstat\")), target_cats)\n",
+    "    )\n",
+    "\n",
+    "    saveRDS(out, \"${_output[0]}\")\n",
+    "    message(\"Subset meta complete; results written to ${_output[0]}\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "SoS",
+   "language": "sos",
+   "name": "sos"
+  },
+  "language_info": {
+   "codemirror_mode": "sos",
+   "file_extension": ".sos",
+   "mimetype": "text/x-sos",
+   "name": "sos",
+   "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter",
+   "pygments_lexer": "sos"
+  },
+  "sos": {
+   "kernels": [
+    [
+     "Bash",
+     "calysto_bash",
+     "Bash",
+     "#E6EEFF",
+     "shell"
+    ],
+    [
+     "R",
+     "ir",
+     "R",
+     "#DCDCDA",
+     "r"
+    ],
+    [
+     "SoS",
+     "sos",
+     "",
+     "",
+     "sos"
+    ]
+   ],
+   "panel": {
+    "displayed": true,
+    "height": 0
+   },
+   "version": "0.22.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

From 304c2e7401e51469a7d8950066e0f630ca45d1c4 Mon Sep 17 00:00:00 2001
From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com>
Date: Tue, 23 Jun 2026 12:11:19 -0400
Subject: [PATCH 5/6] Delete code/SoS/enrichment/sldsc_enrichment.ipynb

---
 code/SoS/enrichment/sldsc_enrichment.ipynb | 1491 --------------------
 1 file changed, 1491 deletions(-)
 delete mode 100644 code/SoS/enrichment/sldsc_enrichment.ipynb

diff --git a/code/SoS/enrichment/sldsc_enrichment.ipynb b/code/SoS/enrichment/sldsc_enrichment.ipynb
deleted file mode 100644
index 0569c353..00000000
--- a/code/SoS/enrichment/sldsc_enrichment.ipynb
+++ /dev/null
@@ -1,1491 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "# Stratified LD Score Regression (S-LDSC) Enrichment\n",
-    "\n",
-    "Minimal working-example driver for the S-LDSC functional-enrichment pipeline. The **Steps** section below gives one ready-to-run `sos run` command per workflow, using the toy inputs symlinked under `input/`.\n",
-    "\n",
-    "> **Environment note.** Steps 1–2 (`make_annotation_files_ldscore`, `get_heritability`) wrap the external **polyfun** toolkit (`compute_ldscores.py`, `ldsc.py`, `munge_polyfun_sumstats.py`) and require pre-computed reference-panel files (baseline-LD scores, LD weights, `.frq`, and PLINK `.bed/.bim/.fam`). polyfun is **not installed in this environment** and the reference panel is not shipped with the toy example, so those two steps cannot be executed here; their commands are provided for use on a system where polyfun and a matching panel are available. Steps 3–4 (`postprocess`, `meta_subset`) use `pecotmr::sldsc_postprocessing_pipeline` (available here) and read the `.results`/`.log` files produced by Step 2.\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Description\n",
-    "This notebook implements the pipeline of [S-LDSC](https://www.nature.com/articles/ng.3404) for LD score and functional enrichment analysis.\n",
-    "\n",
-    "**Important: the S-LDSC implementation comes from the [polyfun](https://github.com/omerwe/polyfun/tree/master) package, not the original LDSC from `bulik/ldsc` GitHub repo.**"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "Markdown"
-   },
-   "source": [
-    "Uses GWAS summary statistics together with annotation and LD reference-panel data to compute per-SNP heritability enrichment for each annotation. It supports single-annotation (individual contribution) and joint multi-annotation (independent contribution) analysis.\n",
-    "\n",
-    "**Background.** LD Score Regression (Bulik-Sullivan et al. 2015) distinguishes confounding (e.g. population stratification) from true polygenic signal by regressing GWAS chi-square statistics on LD scores: SNPs tagging more variation (high LD score) show higher chi-square under true polygenicity, whereas confounding inflates statistics independently of LD. S-LDSC (Finucane et al. 2015) partitions heritability across overlapping annotation categories; standardized tau accounts for negative selection (Gazal et al. 2017). The model details and the tau*/EnrichStat definitions follow below.\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Methods - Workflow Overview\n",
-    "\n",
-    "The pipeline runs in three stages: (1) annotation preparation and the S-LDSC regression (polyfun), (2) post-processing into standardized $\\tau^*$ and meta-analysis (the `pecotmr` package), and (3) optional re-meta on user-defined trait subsets. The concrete commands for stages 1-2 are in the **Steps** section below.\n",
-    "\n",
-    "**Stage 1 - polyfun.** Three SoS workflows wrap polyfun: `make_annotation_files_ldscore` converts target annotations into polyfun `.annot.gz` and runs `compute_ldscores.py` (toggles `compute_single` and `compute_joint`, both default `True`; the joint dir is only emitted when $N \\geq 2$); `munge_sumstats_polyfun` preprocesses each GWAS into LDSC format; `get_heritability` runs polyfun's `ldsc.py` once per `--target-anno-dir`, enforcing the MAF cutoff via `--frqfile-chr` (`maf_cutoff` accepts only `0` or `0.05`).\n",
-    "\n",
-    "**Stage 2 - pecotmr post-processing.** A single `pecotmr::sldsc_postprocessing_pipeline` call consumes all polyfun outputs: it extracts $\\tau$, $E$, $h^2_g$, EnrichStat p-value and per-block jackknife $\\tau$ values; computes $sd_C$ and $M_{\\mathrm{ref}}$ over the regression's MAF-cutoff SNP set; standardizes $\\tau \\to \\tau^*$ for single and joint modes; auto-detects binary vs continuous annotations; and runs a DerSimonian-Laird random-effects meta-analysis across traits, producing three meta tables ($\\tau^*$ cross-type comparable, $E$ within-binary, EnrichStat within-binary). Output is an R list with `per_trait` and `meta` entries.\n",
-    "\n",
-    "**Stage 3 - subset meta-analysis.** `pecotmr::meta_sldsc_random` re-runs the meta on a trait subset without re-running the regression (lightweight, interactive):\n",
-    "\n",
-    "```r\n",
-    "res <- readRDS(\"sldsc_results.rds\")\n",
-    "neuro <- c(\"AD_GWAX\", \"PD_meta\", \"ALS_meta\")\n",
-    "meta_neuro_taustar <- pecotmr::meta_sldsc_random(\n",
-    "  res$per_trait[neuro], category = \"my_target_anno\", quantity = \"tau_star\"\n",
-    ")\n",
-    "```\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Theory\n",
-    "\n",
-    "The statistical model behind the pipeline is summarized below. Because the same framework underlies several of the workflow steps, the model, its stratified extension, and the tau-estimation / enrichment definitions are described together here rather than repeated per step."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "### LDSC model\n",
-    "\n",
-    "Under a polygenic assumption, in which effect sizes for variants are drawn independently from distributions with variance proportional to $1/(p(1-p))$ where $p$ is the minor allele frequency (MAF), the expected $\\chi^2$ statistic of variant $j$ is:\n",
-    "\n",
-    "$$E[\\chi^2_j \\mid \\ell_j] \\;=\\; \\frac{N\\,h^2\\,\\ell_j}{M} \\;+\\; N a \\;+\\; 1 \\quad (1)$$\n",
-    "\n",
-    "where $N$ is the sample size; $M$ is the number of SNPs, so that $h^2/M$ is the average heritability per SNP; $a$ measures the contribution of confounding biases such as cryptic relatedness and population stratification; and $\\ell_j = \\sum_k r^2_{jk}$ is the LD Score of variant $j$, which measures the amount of genetic variation tagged by $j$. A full derivation is given in the Supplementary Note of Bulik-Sullivan et al. (2015); an alternative derivation appears in the Supplementary Note of Zhu and Stephens (2017) AoAS.\n",
-    "\n",
-    "Equation (1) shows that LD Score regression can compute SNP-based heritability for a phenotype from GWAS summary statistics alone, without requiring individual-level genotype data as REML and related methods do."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "### Stratified LDSC\n",
-    "\n",
-    "Heritability is the proportion of phenotypic variation that is due to variation in genetic values, and it can also be partitioned over disjoint or overlapping categories of SNPs.\n",
-    "\n",
-    "Stratified LD Score Regression (S-LDSC) partitions heritability by leveraging both LD-score information and SNPs that have not reached genome-wide significance. S-LDSC exploits the fact that the $\\chi^2$ statistic for a given SNP reflects the cumulative effects of all SNPs tagged by it: in regions of high LD, the focal SNP captures the contribution of a group of nearby SNPs.\n",
-    "\n",
-    "S-LDSC declares an annotation enriched for heritability if SNPs with high LD to that annotation have higher $\\chi^2$ statistics than SNPs with low LD to it.\n",
-    "\n",
-    "Let $a_{jC}$ denote the value of annotation $C$ at SNP $j$:\n",
-    "\n",
-    "- **Binary annotation** (e.g. an indicator for \"in enhancer\", \"in exon\", \"in cell-type-specific peak\"): $a_{jC} \\in \\{0, 1\\}$.\n",
-    "- **Continuous annotation** (e.g. gene-specificity score, conservation score, continuous epigenomic signal): $a_{jC} \\in \\mathbb{R}$.\n",
-    "\n",
-    "Under a polygenic model the per-SNP heritability for SNP $j$ is\n",
-    "\n",
-    "$$\\mathrm{Var}(\\beta_j) \\;=\\; \\sum_C a_{jC}\\, \\tau_C$$\n",
-    "\n",
-    "and the expected $\\chi^2$ statistic of SNP $j$ is\n",
-    "\n",
-    "$$E[\\chi^2_j \\mid \\mathbf{a}_j] \\;=\\; N \\sum_C \\tau_C\\, \\ell(j, C) \\;+\\; N a \\;+\\; 1 \\quad (2)$$\n",
-    "\n",
-    "where $\\ell(j, C) = \\sum_k a_{kC}\\, r^2_{jk}$ is the partitioned LD score of SNP $j$ with respect to annotation $C$, and $a$ measures confounding bias. Equation (2) allows joint estimation of all $\\tau_C$ via a (computationally simple) multiple regression of $\\chi^2_j$ against $\\ell(j, C)$.\n",
-    "\n",
-    "Interpretation of $\\tau_C$:\n",
-    "- **Binary $C$**: $\\tau_C$ is the *additive increase in per-SNP heritability* for SNPs in category $C$, on top of the contributions from any other annotations they belong to.\n",
-    "- **Continuous $C$**: $\\tau_C$ is the *additive change in per-SNP heritability per unit increase* in the value of annotation $C$.\n",
-    "\n",
-    "For application to real data and comparisons to other methods, see the three papers cited at the top of this notebook."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "### Tau Estimation and Enrichment Analysis\n",
-    "\n",
-    "Goal: quantify the contribution of functional annotations to trait heritability and assess statistical significance, accounting for LD structure and (for continuous annotations) annotation scale.\n",
-    "\n",
-    "The pipeline has two computational layers:\n",
-    "\n",
-    "- **Regression layer** — the S-LDSC regression itself, performed by the [polyfun](https://github.com/omerwe/polyfun) engine. We do not re-implement this.\n",
-    "- **Post-processing layer** — standardization, differential per-SNP heritability, binary/continuous detection, and random-effects meta-analysis across traits. Implemented in the [`pecotmr`](https://github.com/StatFunGen/pecotmr) R package (`R/sldsc_wrapper.R`).\n",
-    "\n",
-    "The notation below tags each modeling quantity as **(polyfun)** or **(pecotmr)**.\n",
-    "\n",
-    "#### Notation\n",
-    "\n",
-    "For each annotation $C$ we use:\n",
-    "\n",
-    "- $\\pi^{h^2}_C$ = proportion of trait heritability $h^2_g$ assigned to annotation $C$.\n",
-    "- $\\pi^{M}_C$ = proportion of (effective) SNPs in annotation $C$. For binary annotations this is $M_C / M_{\\mathrm{ref}}$; for continuous annotations it is the share of total annotation weight in $C$.\n",
-    "\n",
-    "#### Reference panel and MAF cutoff\n",
-    "\n",
-    "All LD-derived quantities — partitioned LD scores for the 97 baseline annotations and for our $K$ target annotations, the LD-score-regression weights, allele frequencies, and the SNP set — are computed against our own LD reference panel. We do not mix in pre-computed quantities from external panels (e.g. 1000G); $M_{\\mathrm{ref}}$ throughout this notebook denotes the number of common SNPs in our panel.\n",
-    "\n",
-    "By default we restrict to MAF $> 5\\%$ per the sLDSC recommendation: rare-variant LD is unstable and HapMap3-style regression weights are common-variant by construction. The cutoff is exposed as the SoS parameter `maf_cutoff` (default $0.05$); the regression, the standardized $sd_C$, and $M_{\\mathrm{ref}}$ are all evaluated on the same MAF $>$ cutoff SNP set. If allele-frequency files are not available the pipeline fails; the user must explicitly set `maf_cutoff = 0` to opt out (not recommended).\n",
-    "\n",
-    "#### Quantities from the regression layer (polyfun)\n",
-    "\n",
-    "Solving Equation (2) jointly across annotations, with 200-block genomic jackknife for inference, is performed by polyfun's `ldsc.py`. From each polyfun run we obtain, per annotation:\n",
-    "\n",
-    "- $\\tau_C$ and its standard error — **(polyfun)**.\n",
-    "- $\\pi^{h^2}_C$ and $\\pi^{M}_C$ — **(polyfun)**.\n",
-    "- $E_C = \\pi^{h^2}_C / \\pi^{M}_C$ and its standard error — **(polyfun)**.\n",
-    "- The p-value of the differential per-SNP heritability test (defined below) — **(polyfun)**, computed internally with the full coefficient covariance matrix.\n",
-    "\n",
-    "We also obtain, per run:\n",
-    "\n",
-    "- The total trait heritability $h^2_g$ — **(polyfun)**.\n",
-    "- The 200-block jackknife delete-values of $\\tau_C$ — **(polyfun)**.\n",
-    "\n",
-    "#### Quantities from the post-processing layer (pecotmr)\n",
-    "\n",
-    "From the polyfun outputs above plus our reference panel, the post-processing layer computes:\n",
-    "\n",
-    "- $sd_C$ — per-annotation standard deviation over MAF $>$ cutoff SNPs — **(pecotmr: `compute_sldsc_annot_sd`)**.\n",
-    "- $M_{\\mathrm{ref}}$ — reference SNP count at the MAF cutoff — **(pecotmr: `compute_sldsc_M_ref`)**.\n",
-    "- Whether each annotation is binary or continuous — **(pecotmr: `is_binary_sldsc_annot`)**.\n",
-    "- $\\tau^*_C$ point estimate and per-block $\\tau^*_C$ — **(pecotmr: `standardize_sldsc_trait`)**.\n",
-    "- EnrichStat point estimate and its standard error (formula below) — **(pecotmr: `standardize_sldsc_trait`)**.\n",
-    "- DerSimonian-Laird random-effects meta-analysis of $\\tau^*_C$, $E_C$, or EnrichStat across traits — **(pecotmr: `meta_sldsc_random`)**.\n",
-    "\n",
-    "The top-level entry point `pecotmr::sldsc_postprocessing_pipeline` orchestrates all of the above.\n",
-    "\n",
-    "#### Standardized tau ($\\tau^*$)  —  (pecotmr)\n",
-    "\n",
-    "$\\tau_C$ has units that depend on the scale of the annotation and on the total heritability of the trait, so raw $\\tau$ is not directly comparable across annotations or across traits. We compute the standardized version (Gazal et al. 2017)\n",
-    "\n",
-    "$$\\tau^*_C \\;=\\; \\tau_C \\cdot \\frac{sd_C \\cdot M_{\\mathrm{ref}}}{h^2_g}$$\n",
-    "\n",
-    "interpreted as the additive change in per-SNP heritability associated with a 1 standard deviation increase in annotation $C$, divided by the average per-SNP heritability across all SNPs. $\\tau^*_C$ is dimensionless and comparable across annotations and across traits. In a joint multi-annotation regression it is the *independent contribution* of annotation $C$ after controlling for overlapping effects of the others.\n",
-    "\n",
-    "Here $sd_C$ is the standard deviation of annotation $C$ across reference SNPs (MAF $>$ cutoff), $M_{\\mathrm{ref}}$ is the count of those SNPs, and $h^2_g$ is the trait heritability. Applying the same scaling to each of the 200 jackknife blocks yields per-block $\\tau^*_C$ values; their sample variance gives the jackknife standard error\n",
-    "$$SE^{\\text{jackknife}}(\\tau^*_C) \\;=\\; \\sqrt{\\,\\tfrac{(B-1)^2}{B}\\, \\mathrm{Var}_b(\\tau^*_{C,(b)})\\,}$$\n",
-    "with $B = 200$, used as the per-trait input to cross-trait meta-analysis.\n",
-    "\n",
-    "#### Differential per-SNP heritability (\"EnrichStat\")  —  (polyfun + pecotmr)\n",
-    "\n",
-    "To test whether the per-SNP heritability *inside* annotation $C$ differs from *outside* it (Finucane et al. 2015):\n",
-    "\n",
-    "$$\\text{EnrichStat}_C \\;=\\; \\frac{h^2_g}{M_{\\mathrm{ref}}} \\!\\left[\\, \\frac{\\pi^{h^2}_C}{\\pi^{M}_C} \\;-\\; \\frac{1 - \\pi^{h^2}_C}{1 - \\pi^{M}_C} \\,\\right]$$\n",
-    "\n",
-    "The point-estimate p-value of this test is computed by polyfun internally using the full coefficient covariance and reported as `Enrichment_p`. Its standard error is recovered from the reported p-value:\n",
-    "\n",
-    "$$|Z_C| \\;=\\; \\Phi^{-1}\\!\\left(1 - \\tfrac{p_C}{2}\\right), \\qquad SE_{\\text{EnrichStat}_C} \\;=\\; \\frac{|\\text{EnrichStat}_C|}{|Z_C|}.$$\n",
-    "\n",
-    "This per-trait point + SE is the input to cross-trait meta-analysis.\n",
-    "\n",
-    "#### Reporting: binary vs. continuous annotations  —  (pecotmr)\n",
-    "\n",
-    "The estimation machinery applies to both annotation types, but the *headline* quantity to report **within each type** differs.\n",
-    "\n",
-    "For a **binary annotation** (e.g. enhancer indicator, exon, in/out of a cell-type peak), $\\pi^{M}_C = M_C / M_{\\mathrm{ref}}$ has a direct interpretation and $E_C$ reads as \"the category explains $E_C$-fold more heritability than its share of SNPs.\" The within-type headline quantities are therefore $E_C$ and the EnrichStat p-value; $\\tau^*_C$ is reported alongside.\n",
-    "\n",
-    "For a **continuous annotation** (e.g. gene-specificity score, conservation score, continuous epigenomic signal), $E_C$ depends on the scale of the annotation: rescaling the annotation by a constant changes $E_C$ even though the underlying biology is unchanged. The within-type headline quantities are therefore $\\tau^*_C$ and its p-value; $E_C$ is reported alongside but should not be interpreted for continuous annotations.\n",
-    "\n",
-    "The pipeline determines whether an annotation is binary by inspecting whether its values lie in $\\{0, 1\\}$ and selects the appropriate within-type headline statistic automatically (pecotmr).\n",
-    "\n",
-    "> **From the official LDSC tutorial** ([Partitioned Heritability from Continuous Annotations](https://github.com/bulik/ldsc/wiki/Partitioned-Heritability-from-Continuous-Annotations)):\n",
-    ">\n",
-    "> *\"Enrichment is (Prop. heritability) / (Prop. SNPs). These outputs make sense only for binary annotations. Do not try to interpret them for continuous annotations. Using `--print-coefficients` outputs the regression coefficients and corresponding standard errors and Z score for each annotation. These coefficients measure the additional contribution of one annotation to the model and are interpretable for both binary and continuous annotations.\"*\n",
-    ">\n",
-    "> The pipeline always passes `--print-coefficients` to polyfun for this reason.\n",
-    "\n",
-    "#### Cross-type comparison: always use $\\tau^*_C$  —  (pecotmr)\n",
-    "\n",
-    "For an apple-to-apple comparison **across binary and continuous annotations** — ranking annotations on a single axis, meta-analyzing a mixed set, or reporting a leaderboard that pools both types — use $\\tau^*_C$. The standardization in Gazal et al. (2017) was designed for exactly this purpose: $sd_C = \\sqrt{p(1-p)}$ for a binary annotation (where $p$ is the proportion in the category) and $sd_C = $ empirical standard deviation for a continuous annotation, so the resulting $\\tau^*_C$ is dimensionless and has the same interpretation in both cases — additive change in per-SNP heritability per 1 SD increase in the annotation, normalized by the average per-SNP heritability. $E_C$ does not have this property and must not be compared across types.\n",
-    "\n",
-    "The pipeline emits both $E_C$ and $\\tau^*_C$ for every annotation, with the binary/continuous flag, so callers can pick the right column for the comparison they are making.\n",
-    "\n",
-    "#### Joint analysis  —  (polyfun runs the regression; pecotmr standardizes both modes)\n",
-    "\n",
-    "For **joint analysis** (multiple annotations fit together), both $\\tau$ and $E$ are conditional on the other annotations in the model. We report joint $\\tau^*_C$ as the independent contribution of annotation $C$ after controlling for the others. The annotation-prep step exposes two independent toggles, `compute_single` and `compute_joint` (both default `True`), so the user can produce the $N$ single-target outputs, the joint output, or both in one invocation. With both defaults the post-processing layer reads all $N+1$ regression outputs per trait and presents single + joint side-by-side. When the joint subset is decided after looking at single-target results (exploratory $\\rightarrow$ conditional workflow), the user runs the annotation-prep step a second time with `compute_single=False` on the curated subset.\n",
-    "\n",
-    "### Meta-Analysis across Traits (Random Effects)  —  (pecotmr)\n",
-    "\n",
-    "DerSimonian-Laird random-effects meta-analysis of per-annotation estimates across traits, implemented in `pecotmr::meta_sldsc_random` (which delegates the numerics to `rmeta::meta.summaries(..., method = \"random\")`):\n",
-    "\n",
-    "$$\\hat\\theta_{\\mathrm{meta}} \\;=\\; \\frac{\\sum_i w_i\\, \\hat\\theta_i}{\\sum_i w_i}, \\qquad SE_{\\mathrm{meta}} \\;=\\; \\sqrt{\\frac{1}{\\sum_i w_i}}, \\qquad w_i \\;=\\; \\frac{1}{SE_i^2 + \\hat\\sigma^2}$$\n",
-    "\n",
-    "where $\\hat\\theta_i$ is the per-trait estimate and $SE_i$ its standard error:\n",
-    "\n",
-    "- **For $\\tau^*_C$ meta**: $SE_i$ is the jackknife SE from the per-block $\\tau^*_C$ values.\n",
-    "- **For $E_C$ meta**: $SE_i$ is the polyfun-reported `Enrichment_std_error`.\n",
-    "- **For EnrichStat meta**: $SE_i$ is the back-solved SE from polyfun's `Enrichment_p`.\n",
-    "\n",
-    "For binary-annotation enrichment reporting we use a two-channel meta: the **effect size and SE** come from the meta on $E_C$ (interpretable on the original enrichment-fold scale), while the **p-value** comes from the meta on EnrichStat (the appropriate hypothesis test). The pipeline produces a default meta over all supplied traits; users can re-run meta on any subset of traits without re-running the regression layer.\n",
-    "\n",
-    "$$Z_{\\mathrm{meta}} \\;=\\; \\frac{\\hat\\theta_{\\mathrm{meta}}}{SE_{\\mathrm{meta}}}, \\qquad p \\;=\\; 2\\,\\Phi(-|Z_{\\mathrm{meta}}|)$$"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Minimal Working Example (MWE)\n",
-    "\n",
-    "The steps below run the four pipeline workflows end to end on the example data. Each step lists what it does, then the `sos run` command to execute it.\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Step 1. `make_annotation_files_ldscore`\n",
-    "\n",
-    "*Annotation preparation and S-LDSC regression (polyfun).* This step accepts a single annotation file for a single-tau analysis (one annotation as input) or several annotation files for a joint-tau analysis (multiple annotations as input)."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "#### **Inputs**\n",
-    "\n",
-    "##### 1. Target Annotation File\n",
-    "\n",
-    "- **Purpose**: Specifies the user-provided (\"target\") genome annotation files. The pipeline supports both binary and continuous annotations; the type is auto-detected per annotation column.\n",
-    "- **Formats**:\n",
-    "    - Text file (`.txt`) listing per-chromosome paths to annotation files. Annotation files can be `.rds`/`.tsv`/`.txt`.\n",
-    "    - Alternatively, files for specific chromosomes can be provided directly.\n",
-    "    - **Multiple target annotations** are supported in one input file (one column per annotation, prefixed `path`, `path1`, `path2`, ...). Single-target and joint-target analyses are produced automatically in one pipeline pass.\n",
-    "    - **Format** (the score column is optional; if absent, score is set to 1):\n",
-    "        - `is_range = False`:\n",
-    "        ```\n",
-    "        chr   pos   score\n",
-    "        1    10001   1\n",
-    "        1    10002   1\n",
-    "        ```\n",
-    "        - `is_range = True`:\n",
-    "        ```\n",
-    "        chr   start   end   score\n",
-    "        1    10001  20001  1\n",
-    "        1    30001  40001  1\n",
-    "        ```\n",
-    "\n",
-    "##### 2. Reference Annotation File (baseline-LD)\n",
-    "\n",
-    "- **Purpose**: Provides the baseline annotations (typically the 97-annotation baseline-LD model from Gazal et al. 2017) in `.annot.gz` format for each chromosome. The baseline conditions every regression.\n",
-    "- **Formats**:\n",
-    "    - Text file listing baseline annotation files for all chromosomes.\n",
-    "    - Alternatively, files for specific chromosomes can be provided directly.\n",
-    "\n",
-    "##### 3. Genome Reference File\n",
-    "\n",
-    "- **Purpose**: PLINK-format `.bed/.bim/.fam` files for our LD reference panel, per chromosome. This is the panel against which all LD-derived quantities (target LD scores, baseline LD scores, regression weights, allele frequencies) must be computed. **Do not mix files derived from different panels** (e.g. 1000G vs ADSP).\n",
-    "- **Formats**:\n",
-    "    - Text file listing per-chromosome reference files, or files for specific chromosomes.\n",
-    "\n",
-    "##### 4. SNP List\n",
-    "\n",
-    "- **Purpose**: Specifies the SNPs to include in LDSC analysis (typically a HapMap3-style list).\n",
-    "- **Format**: A list of `rsid`s, one per line.\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 5,
-   "metadata": {
-    "kernel": "Bash"
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "/restricted/projectnb/xqtl/jaempawi/xqtl-protocol\n"
-     ]
-    }
-   ],
-   "source": [
-    "pwd"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 9,
-   "metadata": {
-    "kernel": "Bash"
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n",
-      "  import pkg_resources\n",
-      "INFO: Running \u001b[32mmake_annotation_files_ldscore\u001b[0m: \n",
-      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=1) is \u001b[32mcompleted\u001b[0m.\n",
-      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=3) is \u001b[32mcompleted\u001b[0m.\n",
-      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=2) is \u001b[32mcompleted\u001b[0m.\n",
-      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=0) is \u001b[32mcompleted\u001b[0m.\n",
-      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=5) is \u001b[32mcompleted\u001b[0m.\n",
-      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=6) is \u001b[32mcompleted\u001b[0m.\n",
-      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=4) is \u001b[32mcompleted\u001b[0m.\n",
-      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=7) is \u001b[32mcompleted\u001b[0m.\n",
-      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=9) is \u001b[32mcompleted\u001b[0m.\n",
-      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=10) is \u001b[32mcompleted\u001b[0m.\n",
-      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=8) is \u001b[32mcompleted\u001b[0m.\n",
-      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=11) is \u001b[32mcompleted\u001b[0m.\n",
-      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=14) is \u001b[32mcompleted\u001b[0m.\n",
-      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=13) is \u001b[32mcompleted\u001b[0m.\n",
-      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=12) is \u001b[32mcompleted\u001b[0m.\n",
-      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=15) is \u001b[32mcompleted\u001b[0m.\n",
-      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=18) is \u001b[32mcompleted\u001b[0m.\n",
-      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=16) is \u001b[32mcompleted\u001b[0m.\n",
-      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=17) is \u001b[32mcompleted\u001b[0m.\n",
-      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=19) is \u001b[32mcompleted\u001b[0m.\n",
-      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=21) is \u001b[32mcompleted\u001b[0m.\n",
-      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=20) is \u001b[32mcompleted\u001b[0m.\n",
-      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m output:   \u001b[32m/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_ldscore/protocol_example_single_1/protocol_example_single_1.1.annot.gz /restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_ldscore/protocol_example_single_1/protocol_example_single_1.1.l2.ldscore.parquet... (66 items in 22 groups)\u001b[0m\n",
-      "INFO: Workflow make_annotation_files_ldscore (ID=weae0ca3fdf468fd8) is executed successfully with 1 completed step and 22 completed substeps.\n"
-     ]
-    }
-   ],
-   "source": [
-    "sos run pipeline/sldsc_enrichment.ipynb make_annotation_files_ldscore \\\n",
-    "  --annotation_file input/enrichment/sldsc/colocboost_test_annotation_path.txt \\\n",
-    "  --reference_anno_file input/enrichment/sldsc/reference_annotation0.txt \\\n",
-    "  --genome_ref_file input/enrichment/sldsc/genome_reference_bfile.txt \\\n",
-    "  --annotation_name protocol_example \\\n",
-    "  --plink_name reference. --baseline_name annotations. --weight_name weights. \\\n",
-    "  --python_exec python \\\n",
-    "  --polyfun_path polyfun \\\n",
-    "  --cwd output/sldsc_ldscore -j 4\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "### Munge summary statistics (preprocessing, run before Step 2)\n",
-    "\n",
-    "Before estimating heritability, each raw GWAS summary-statistics file must be converted into the LDSC-compatible format consumed by `get_heritability`. Run `munge_sumstats_polyfun` once per trait; the munged files are then collected in the directory passed to `get_heritability` via `--sumstat_dir`.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": [
-    "# sos run pipeline/sldsc_enrichment.ipynb munge_sumstats_polyfun \\\n",
-    "#     --sumstats data/polyfun_new/example_data/trait_raw_sumstats.tsv \\\n",
-    "#     --n 0 \\\n",
-    "#     --min-info 0.6 \\\n",
-    "#     --min-maf 0.001 \\\n",
-    "#     --chi2-cutoff 30 \\\n",
-    "#     --polyfun_path data/github/polyfun \\\n",
-    "#     --cwd data/polyfun_new/example_data"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Step 2. `get_heritability`\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "**Inputs**\n",
-    "\n",
-    "##### 1. Allele Frequency Files (`.frq`, our panel)\n",
-    "\n",
-    "- **Purpose**: PLINK `.frq` files for the reference panel, used to enforce the MAF cutoff. **Required** when `maf_cutoff > 0` (default `0.05`); the pipeline fails if missing unless `maf_cutoff = 0` is explicitly set.\n",
-    "\n",
-    "##### 2. GWAS Summary Statistics\n",
-    "\n",
-    "- **Purpose**: One munged sumstats file per trait, listed in a text file (`all_traits_file`). The pipeline runs the regression once per trait per single/joint mode.\n",
-    "- **Format**:\n",
-    "    ```\n",
-    "    CAD_META.filtered.sumstats.gz\n",
-    "    UKB.Lym.BOLT.sumstats.gz\n",
-    "    ```\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 10,
-   "metadata": {
-    "kernel": "Bash"
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n",
-      "  import pkg_resources\n",
-      "INFO: Running \u001b[32mget_heritability\u001b[0m: \n",
-      "maf_cutoff = 0: skipping MAF filter (--not-M-5-50)\n",
-      "maf_cutoff = 0: skipping MAF filter (--not-M-5-50)\n",
-      "maf_cutoff = 0: skipping MAF filter (--not-M-5-50)\n",
-      "python: can't open file '/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/../polyfun/ldsc.py': [Errno 2] No such file or directory\n",
-      "python: can't open file '/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/../polyfun/ldsc.py': [Errno 2] No such file or directory\n",
-      "python: can't open file '/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/../polyfun/ldsc.py': [Errno 2] No such file or directory\n",
-      "INFO: \u001b[32mget_heritability\u001b[0m (index=1) is \u001b[32mcompleted\u001b[0m.\n",
-      "INFO: \u001b[32mget_heritability\u001b[0m (index=0) is \u001b[32mcompleted\u001b[0m.\n",
-      "INFO: \u001b[32mget_heritability\u001b[0m (index=2) is \u001b[32mcompleted\u001b[0m.\n",
-      "INFO: \u001b[32mget_heritability\u001b[0m output:   \u001b[32m/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_heritability/protocol_example_single_1/sumstats.parquet.log /restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_heritability/protocol_example_single_1/sumstats.parquet.results... (6 items in 3 groups)\u001b[0m\n",
-      "INFO: Workflow get_heritability (ID=wa79eac1662f5dd2d) is executed successfully with 1 completed step and 3 completed substeps.\n"
-     ]
-    }
-   ],
-   "source": [
-    "sos run pipeline/sldsc_enrichment.ipynb get_heritability \\\n",
-    "  --target_anno_dirs output/sldsc_ldscore/protocol_example_single_1 \\\n",
-    "  --all_traits_file input/enrichment/sldsc/sumstats_test_all.txt \\\n",
-    "  --sumstat_dir input/enrichment/sldsc \\\n",
-    "  --baseline_ld_dir input/enrichment/sldsc \\\n",
-    "  --weights_dir input/enrichment/sldsc \\\n",
-    "  --plink_name reference. --baseline_name annotations. --weight_name weights. \\\n",
-    "  --annotation_name protocol_example --python_exec python \\\n",
-    "  --polyfun_path ../polyfun \\\n",
-    "  --maf_cutoff 0 --cwd output/sldsc_heritability -j 4\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Step 3. `Post-processing (pecotmr) and meta-analysis`\n",
-    "\n",
-    "*Post-Processing (`pecotmr::sldsc_postprocessing_pipeline`)*\n",
-    "\n",
-    "A single R function call consumes all polyfun outputs for the run and produces the final tables:\n",
-    "\n",
-    "- Reads each polyfun output and extracts $\\tau$, $E$, $h^2_g$, EnrichStat p-value, and per-block jackknife $\\tau$ values.\n",
-    "- Computes annotation $sd_C$ and $M_{\\mathrm{ref}}$ over the same MAF $>$ cutoff SNP set as the regression.\n",
-    "- Standardizes $\\tau \\to \\tau^*$ for both single-tau and joint-tau modes, including the per-block versions for jackknife SE.\n",
-    "- Auto-detects whether each annotation is binary or continuous and tags every output row accordingly.\n",
-    "- Reports the number and names of baseline annotations encountered (via `message()`) for transparency.\n",
-    "- Runs the default DerSimonian-Laird random-effects meta-analysis across all supplied traits, producing three meta tables: $\\tau^*$ (cross-type comparable), $E$ (within-binary), and EnrichStat (within-type).\n",
-    "\n",
-    "Outputs are returned as an R list with two top-level entries: `per_trait` (one tidy data frame per trait, single + joint estimates side-by-side per target) and `meta` (three tables, one per quantity, with rows = target annotations and columns = single/joint mean/SE/p plus an `is_binary` flag).\n",
-    "\n",
-    "The `[postprocess]` step reads all polyfun outputs under `heritability_cwd`\n",
-    "(which contains the $N$ single-target subdirectories and optionally the\n",
-    "joint subdirectory) and calls `pecotmr::sldsc_postprocessing_pipeline()`\n",
-    "to produce per-trait standardized tables and the default random-effects\n",
-    "meta across all traits.\n",
-    "\n",
-    "Use `--target-categories-label` (same order as `--target-categories`) to give the target annotations friendly names in the output — e.g. `--target-categories ANNOT_1_0 ANNOT_2_0 --target-categories-label quantile_eQTL eQTL` makes the `target` column read `quantile_eQTL` / `eQTL` instead of `ANNOT_1_0` / `ANNOT_2_0` (the original names are kept in `params$target_categories_orig`). Omit it to keep the polyfun `.results` names.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 15,
-   "metadata": {
-    "kernel": "Bash"
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n",
-      "  import pkg_resources\n",
-      "INFO: Running \u001b[32mpostprocess\u001b[0m: \n",
-      "INFO: \u001b[32mpostprocess\u001b[0m is \u001b[32mcompleted\u001b[0m.\n",
-      "INFO: \u001b[32mpostprocess\u001b[0m output:   \u001b[32m/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_postprocess/protocol_example.sldsc_postprocess.rds\u001b[0m\n",
-      "INFO: Workflow postprocess (ID=wb64dc2b84958960c) is executed successfully with 1 completed step.\n"
-     ]
-    }
-   ],
-   "source": [
-    "sos run pipeline/sldsc_enrichment.ipynb postprocess \\\n",
-    "  --traits_file input/enrichment/sldsc/sumstats_test_all.txt \\\n",
-    "  --heritability_cwd output/sldsc_heritability \\\n",
-    "  --target_categories ANNOT_0 --target_categories_label protocol_example_annotation \\\n",
-    "  --target_anno_dir output/sldsc_ldscore/protocol_example_single_1 \\\n",
-    "  --annotation_name protocol_example --python_exec python \\\n",
-    "  --polyfun_path ../polyfun \\\n",
-    "  --maf_cutoff 0 --cwd output/sldsc_postprocess -j 4\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Step 4. `Subset Meta-Analysis (`pecotmr::meta_sldsc_random`)` (optional)\n",
-    "\n",
-    "The default meta in Step 2 pools all traits the user supplied. To re-run the meta on a subset (e.g., neurodegenerative traits only, or autoimmune traits only) without re-running the regression layer:\n",
-    "\n",
-    "\n",
-    "```r\n",
-    "res <- readRDS(\"sldsc_results.rds\")\n",
-    "neuro <- c(\"AD_GWAX\", \"PD_meta\", \"ALS_meta\")\n",
-    "meta_neuro_taustar <- pecotmr::meta_sldsc_random(\n",
-    "  res$per_trait[neuro], category = \"my_target_anno\", quantity = \"tau_star\"\n",
-    ")\n",
-    "```\n",
-    "\n",
-    "This step is light-weight and can be run interactively.\n",
-    "\n",
-    "\n",
-    "The default meta in step 3 pools all traits supplied to `[postprocess]`. Use `[meta_subset]` to re-run the meta on a user-defined trait subset (e.g., neurodegenerative traits only, autoimmune traits only) without re-running the regression or the per-trait standardization. The subset operates on the cached `.sldsc_postprocess.rds` output; it is light-weight and can be run interactively or in batch.\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 17,
-   "metadata": {
-    "kernel": "Bash"
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n",
-      "  import pkg_resources\n",
-      "INFO: Running \u001b[32mmeta_subset\u001b[0m: \n",
-      "INFO: \u001b[32mmeta_subset\u001b[0m is \u001b[32mcompleted\u001b[0m.\n",
-      "INFO: \u001b[32mmeta_subset\u001b[0m output:   \u001b[32m/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_postprocess/protocol_example.category1.meta.rds\u001b[0m\n",
-      "INFO: Workflow meta_subset (ID=w09a2a0530119f1d2) is executed successfully with 1 completed step.\n"
-     ]
-    }
-   ],
-   "source": [
-    "sos run pipeline/sldsc_enrichment.ipynb meta_subset \\\n",
-    "  --postprocess_rds output/sldsc_postprocess/protocol_example.sldsc_postprocess.rds \\\n",
-    "  --subset_traits_file input/enrichment/sldsc/sumstats_test_category1.txt \\\n",
-    "  --subset_name category1 --target_categories ANNOT_0 \\\n",
-    "  --annotation_name protocol_example --python_exec python \\\n",
-    "  --polyfun_path ../polyfun \\\n",
-    "  --maf_cutoff 0 --cwd output/sldsc_postprocess -j 4\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Output\n",
-    "\n",
-    "### Output summary\n",
-    "\n",
-    "| Stage | Cached on disk | Recomputable from | Purpose |\n",
-    "|---|---|---|---|\n",
-    "| Target LD scores | per-annotation, once | annotation + reference panel | input to every regression |\n",
-    "| polyfun `.results` per (trait, mode) | yes | regression run | $\\tau$, $E$, EnrichStat |\n",
-    "| Per-trait standardized table | yes (RDS) | polyfun outputs + $sd_C$ + $M_{\\mathrm{ref}}$ | reporting + meta |\n",
-    "| Default meta tables | yes (RDS) | per-trait standardized | headline figures |\n",
-    "| Subset meta | re-run on demand | per-trait standardized | custom analyses |\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "### Per-stage outputs\n",
-    "\n",
-    "Each workflow writes into its `--cwd`:\n",
-    "\n",
-    "- **make_annotation_files_ldscore** — polyfun `.annot.gz` files plus per-annotation LD-score directories (`.l2.ldscore.{gz,parquet}`, `.l2.M`, `.l2.M_5_50`). One single-target directory per annotation, plus (when more than one annotation) a joint directory.\n",
-    "- **get_heritability** — per trait and per target directory, the S-LDSC regression outputs `<trait>.{results,log,part_delete}`. The `.results` `Category` column carries the annotation name with a `_<ref-ld-index>` suffix.\n",
-    "- **postprocess** — a single `<annotation_name>.sldsc_postprocess.rds` containing per-trait tables (Gazal-style tau*, EnrichStat with back-solved jackknife SE) and three DerSimonian–Laird random-effects meta tables (tau*, E, EnrichStat).\n",
-    "- **meta_subset** — a re-meta of the cached `.sldsc_postprocess.rds` over a user-defined trait subset (lightweight; no regression re-run).\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Anticipated Results\n",
-    "\n",
-    "Produces per-annotation enrichment statistics (tau, enrichment, p-value) from stratified LD score regression."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Command interface\n",
-    "\n",
-    "List all workflows and their options:\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "kernel": "Bash"
-   },
-   "outputs": [],
-   "source": [
-    "sos run pipeline/sldsc_enrichment.ipynb -h"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Workflow implementation\n",
-    "\n",
-    "The cells below are the pipeline definition (preserved from the original notebook): the `[global]` parameter block and the workflow step bodies.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": [
-    "[global]\n",
-    "# Path to the work directory of the analysis.\n",
-    "parameter: cwd = path('output')\n",
-    "# Prefix for the analysis output\n",
-    "parameter: annotation_name = str\n",
-    "parameter: python_exec = \"python\" # e.g. \"/home/you/.conda/envs/polyfun/bin/python\"\n",
-    "parameter: polyfun_path   = path # e.g. \"/home/you/tools/polyfun\"\n",
-    "\n",
-    "# MAF cutoff for sLDSC. Default 0.05 per sLDSC recommendation (rare-variant LD is unstable\n",
-    "# and HapMap3-style regression weights are common-variant by construction).\n",
-    "# Set to 0 to opt out of MAF filtering (NOT recommended; only use if you understand the implications).\n",
-    "# Other values would require recomputing LD scores at that cutoff.\n",
-    "parameter: maf_cutoff = 0.05\n",
-    "\n",
-    "# for make_annotation_files_ldscore workflow:\n",
-    "parameter: annotation_file = path()\n",
-    "parameter: reference_anno_file = path()\n",
-    "parameter: genome_ref_file = path() # with .bed\n",
-    "parameter: chromosome = []\n",
-    "parameter: snp_list = path()\n",
-    "parameter: ld_wind_kb = 0 # use kb if the value is provided\n",
-    "parameter: ld_wind_cm = 1.0 # default using ld_wind_cm\n",
-    "\n",
-    "# for get_heritability workflow.\n",
-    "# Note: all LD-derived inputs (baseline LD scores, target LD scores, regression weights,\n",
-    "# allele frequencies) must be computed against the same reference panel as `genome_ref_file`.\n",
-    "# Do not mix files derived from different reference panels (e.g., 1000G vs ADSP).\n",
-    "parameter: all_traits_file = path() # txt file, each row contains all GWAS summary statistics name: e.g. CAD_META.filtered.sumstats.gz\n",
-    "parameter: sumstat_dir = path() # Directory containing GWAS summary statistics\n",
-    "parameter: target_anno_dir = path()  # Directory containing target annotation files: output of ldscore\n",
-    "parameter: baseline_ld_dir = path()  # Directory containing baseline LD score files (computed against our panel)\n",
-    "parameter: frqfile_dir = path()  # Directory containing allele frequency files (.frq, our panel)\n",
-    "parameter: plink_name = \"ADSP_chr\"\n",
-    "parameter: weights_dir = path()  # Directory containing LD weights (computed against our panel)\n",
-    "parameter: baseline_name = \"baseline_chr\"  # Prefix of baseline annotation files\n",
-    "parameter: weight_name = \"weights_chr\"  # Prefix of LD weights files\n",
-    "parameter: n_blocks = 200\n",
-    "\n",
-    "# Number of threads\n",
-    "parameter: numThreads = 16\n",
-    "# For cluster jobs, number commands to run per job\n",
-    "parameter: job_size = 1\n",
-    "parameter: walltime = '12h'\n",
-    "parameter: mem = '16G'"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "Python 3 (ipykernel)"
-   },
-   "source": [
-    "## Make Annotation File"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": [
-    "[make_annotation_files_ldscore]\n",
-    "# Annotation preparation. Takes one annotation_file with N target annotations\n",
-    "# and produces, in one invocation, any combination of:\n",
-    "#   - N single-target LD-score directories (when compute_single = TRUE, default)\n",
-    "#   - 1 joint LD-score directory containing all N (when compute_joint = TRUE\n",
-    "#     and N >= 2, default)\n",
-    "#\n",
-    "# Outputs per chromosome <chr>:\n",
-    "#   <cwd>/<annotation_name>_single_<i>/<annotation_name>_single_<i>.<chr>.annot.gz   (i in 1..N, when compute_single)\n",
-    "#   <cwd>/<annotation_name>_single_<i>/<annotation_name>_single_<i>.<chr>.l2.ldscore.{parquet|gz}\n",
-    "#   <cwd>/<annotation_name>_single_<i>/<annotation_name>_single_<i>.<chr>.l2.M\n",
-    "#   <cwd>/<annotation_name>_single_<i>/<annotation_name>_single_<i>.<chr>.l2.M_5_50  (when .frq present)\n",
-    "#\n",
-    "#   <cwd>/<annotation_name>_joint/<annotation_name>_joint.<chr>.{...}                (when compute_joint and N>=2)\n",
-    "#\n",
-    "# Workflows:\n",
-    "#   - Workflow A (\"all at once\"): compute_single=TRUE, compute_joint=TRUE (defaults).\n",
-    "#     Produces both, fits the case where you have already chosen the joint set.\n",
-    "#   - Workflow B (\"exploratory then conditional\"):\n",
-    "#       Step 1: compute_single=TRUE, compute_joint=FALSE.\n",
-    "#               Run on N candidate annotations -> N single-target dirs.\n",
-    "#               Inspect single-target results, identify K significant ones.\n",
-    "#       Step 2: compute_single=FALSE, compute_joint=TRUE.\n",
-    "#               Run on a NEW annotation_file with the K selected annotations\n",
-    "#               -> 1 joint dir with the conditional model.\n",
-    "\n",
-    "#\n",
-    "# --- snplist (--snp_list) vs no-snplist: which polyfun script, output format,\n",
-    "#     column name, and the CM requirement ---\n",
-    "#   --snp_list given  -> ldsc.py --l2 --print-snps   -> output .l2.ldscore.gz\n",
-    "#   --snp_list absent -> compute_ldscores.py         -> output .l2.ldscore.parquet\n",
-    "#\n",
-    "#   LD-score column name (this is what becomes the .results \"Category\" in\n",
-    "#   [get_heritability], with a \"_<ref-ld-index>\" suffix appended there):\n",
-    "#     * compute_ldscores.py  ALWAYS keeps the annot column name(s):\n",
-    "#         single annot column \"ANNOT\"          -> ldscore column \"ANNOT\"\n",
-    "#         joint  annot columns \"ANNOT_1\",\"ANNOT_2\",...  -> \"ANNOT_1\",\"ANNOT_2\",...\n",
-    "#     * ldsc.py --l2 has a quirk: with EXACTLY ONE annotation (n_annot == 1) it\n",
-    "#       HARD-CODES the ldscore column name to \"L2\" and DROPS the annot's original\n",
-    "#       column name. With >=2 annotations it uses \"<annot_name>L2\"\n",
-    "#       (\"ANNOT_1L2\",\"ANNOT_2L2\",...).\n",
-    "#     => a single-target snplist run reports \"L2_0\" in .results, while a\n",
-    "#        single-target no-snplist run reports \"ANNOT_0\".  [postprocess] auto-\n",
-    "#        detects either; only matters if you pass --target-categories explicitly.\n",
-    "#\n",
-    "#   CM column requirement for snplist:  ldsc.py --l2 --print-snps requires the\n",
-    "#   target annot to (a) carry a \"CM\" (centimorgan) column and (b) line up with\n",
-    "#   the plink .bim (same SNP set, same row order). This step handles both\n",
-    "#   internally (normalize_for_ldsc: takes CM from the .bim 4th column, re-expands\n",
-    "#   the annot onto the .bim rows, filling 0). Therefore the plink .bim files MUST\n",
-    "#   carry genetic-map (cM) positions when using --ld-wind-cm (the default);\n",
-    "#   if your .bim has 0 in the cM column, switch to --ld-wind-kb instead.\n",
-    "#\n",
-    "parameter: compute_single = True\n",
-    "parameter: compute_joint = True\n",
-    "parameter: score_column = 3\n",
-    "parameter: is_range = False\n",
-    "\n",
-    "import pandas as pd\n",
-    "import os\n",
-    "\n",
-    "if not (compute_single or compute_joint):\n",
-    "    raise ValueError(\"[make_annotation_files_ldscore] at least one of compute_single or compute_joint must be TRUE\")\n",
-    "\n",
-    "def adapt_file_path(file_path, reference_file):\n",
-    "    reference_path = os.path.dirname(reference_file)\n",
-    "    if os.path.isfile(file_path):\n",
-    "        return file_path\n",
-    "    file_name = os.path.basename(file_path)\n",
-    "    if os.path.isfile(file_name):\n",
-    "        return file_name\n",
-    "    file_in_ref_dir = os.path.join(reference_path, file_name)\n",
-    "    if os.path.isfile(file_in_ref_dir):\n",
-    "        return file_in_ref_dir\n",
-    "    file_prefixed = os.path.join(reference_path, file_path)\n",
-    "    if os.path.isfile(file_prefixed):\n",
-    "        return file_prefixed\n",
-    "    raise FileNotFoundError(f\"No valid path found for file: {file_path}\")\n",
-    "\n",
-    "\n",
-    "# ---- Parse inputs and determine N ----\n",
-    "if (str(annotation_file).endswith(('rds', 'tsv', 'txt', 'tsv.gz', 'txt.gz')) and\n",
-    "    str(reference_anno_file).endswith('annot.gz')):\n",
-    "    # Case 1: direct file paths (single-chromosome run). Multiple target files separated by ','.\n",
-    "    target_files_direct = str(annotation_file).split(',')\n",
-    "    N_targets = len(target_files_direct)\n",
-    "    target_names = [f\"target_{i+1}\" for i in range(N_targets)]\n",
-    "    input_files = [[*target_files_direct, str(reference_anno_file), str(genome_ref_file)]]\n",
-    "    if len(chromosome) > 0:\n",
-    "        input_chroms = [int(x) for x in chromosome]\n",
-    "    else:\n",
-    "        input_chroms = [0]\n",
-    "else:\n",
-    "    # Case 2: txt list with #id and one or more 'path' columns\n",
-    "    target_files_df = pd.read_csv(annotation_file, sep=\"\\t\")\n",
-    "    reference_files = pd.read_csv(reference_anno_file, sep=\"\\t\")\n",
-    "    genome_ref_files = pd.read_csv(genome_ref_file, sep=\"\\t\")\n",
-    "\n",
-    "    target_files_df[\"#id\"] = [x.replace(\"chr\", \"\") for x in target_files_df[\"#id\"].astype(str)]\n",
-    "    reference_files[\"#id\"]  = [x.replace(\"chr\", \"\") for x in reference_files[\"#id\"].astype(str)]\n",
-    "    genome_ref_files[\"#id\"] = [x.replace(\"chr\", \"\") for x in genome_ref_files[\"#id\"].astype(str)]\n",
-    "\n",
-    "    path_columns = [c for c in target_files_df.columns if c.startswith('path')]\n",
-    "    N_targets = len(path_columns)\n",
-    "    target_names = path_columns[:]   # 'path', 'path1', 'path2', ...\n",
-    "\n",
-    "    for col in path_columns:\n",
-    "        target_files_df[col] = target_files_df[col].apply(lambda x: adapt_file_path(x, str(annotation_file)))\n",
-    "    reference_files[\"path\"] = reference_files[\"path\"].apply(lambda x: adapt_file_path(x, str(reference_anno_file)))\n",
-    "    genome_ref_files[\"path\"] = genome_ref_files[\"path\"].apply(lambda x: adapt_file_path(x, str(genome_ref_file)))\n",
-    "\n",
-    "    merged = target_files_df.merge(reference_files, on=\"#id\").merge(genome_ref_files, on=\"#id\")\n",
-    "    if len(chromosome) > 0:\n",
-    "        merged = merged[merged[\"#id\"].isin([str(c) for c in chromosome])]\n",
-    "\n",
-    "    rows = merged.values.tolist()\n",
-    "    input_chroms = [r[0] for r in rows]\n",
-    "    input_files = [[*r[1:N_targets+1], r[-2], r[-1]] for r in rows]\n",
-    "\n",
-    "# ---- Determine output format ----\n",
-    "use_print_snps = snp_list.is_file()\n",
-    "ldscore_ext = \"l2.ldscore.gz\" if use_print_snps else \"l2.ldscore.parquet\"\n",
-    "\n",
-    "if ld_wind_kb > 0:\n",
-    "    use_kb_window = True\n",
-    "    ld_window_param = ld_wind_kb\n",
-    "    ld_window_flag = \"--ld-wind-kb\"\n",
-    "else:\n",
-    "    use_kb_window = False\n",
-    "    ld_window_param = ld_wind_cm\n",
-    "    ld_window_flag = \"--ld-wind-cm\"\n",
-    "\n",
-    "emit_single = compute_single\n",
-    "emit_joint  = compute_joint and N_targets >= 2\n",
-    "\n",
-    "# ---- Build per-chromosome output list ----\n",
-    "def chrom_outputs(chrom):\n",
-    "    outs = []\n",
-    "    if emit_single:\n",
-    "        for i in range(N_targets):\n",
-    "            name = f\"{annotation_name}_single_{i+1}\"\n",
-    "            prefix = f\"{cwd:a}/{name}/{name}.{chrom}\"\n",
-    "            outs += [f\"{prefix}.annot.gz\", f\"{prefix}.{ldscore_ext}\", f\"{prefix}.l2.M\"]\n",
-    "    if emit_joint:\n",
-    "        name = f\"{annotation_name}_joint\"\n",
-    "        prefix = f\"{cwd:a}/{name}/{name}.{chrom}\"\n",
-    "        outs += [f\"{prefix}.annot.gz\", f\"{prefix}.{ldscore_ext}\", f\"{prefix}.l2.M\"]\n",
-    "    return outs\n",
-    "\n",
-    "input: input_files, group_by = N_targets + 2, group_with = \"input_chroms\"\n",
-    "\n",
-    "output: chrom_outputs(input_chroms[_index])\n",
-    "\n",
-    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bnn}'\n",
-    "\n",
-    "# ----------------------------------------------------------------------------\n",
-    "# Step A: write the requested .annot files for this chromosome.\n",
-    "# ----------------------------------------------------------------------------\n",
-    "R: expand = \"${ }\", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'\n",
-    "    library(data.table)\n",
-    "\n",
-    "    clean_chr <- function(x) as.numeric(gsub(\"^chr\", \"\", x))\n",
-    "\n",
-    "    process_range_data <- function(data, chr_value) {\n",
-    "        data$chr <- clean_chr(data$chr)\n",
-    "        data <- data[data$chr == chr_value,]\n",
-    "        if (nrow(data) == 0) return(NULL)\n",
-    "        expanded <- lapply(seq_len(nrow(data)), function(j) {\n",
-    "            row <- data[j,]\n",
-    "            pos_seq <- seq(row$start, row$end - 1)\n",
-    "            result <- data.frame(chr = rep(row$chr, length(pos_seq)), pos = pos_seq)\n",
-    "            if (ncol(data) > 3) {\n",
-    "                for (col in 4:ncol(data))\n",
-    "                    result[[names(data)[col]]] <- rep(row[[col]], length(pos_seq))\n",
-    "            }\n",
-    "            result\n",
-    "        })\n",
-    "        unique(rbindlist(expanded))\n",
-    "    }\n",
-    "\n",
-    "    process_annotation <- function(target_anno, ref_anno, score_column_value) {\n",
-    "        target_anno <- as.data.frame(target_anno)\n",
-    "        ref_anno    <- as.data.frame(ref_anno)\n",
-    "        target_anno$chr <- clean_chr(target_anno$chr)\n",
-    "        ref_anno$CHR    <- clean_chr(ref_anno$CHR)\n",
-    "        chr_value <- unique(ref_anno$CHR)\n",
-    "        anno_scores <- rep(0, nrow(ref_anno))\n",
-    "        match_pos <- match(target_anno$pos, ref_anno$BP)\n",
-    "        valid_pos <- as.numeric(na.omit(match_pos))\n",
-    "        if (score_column_value <= ncol(target_anno)) {\n",
-    "            anno_scores[valid_pos] <- target_anno[[score_column_value]][!is.na(match_pos)]\n",
-    "        } else {\n",
-    "            anno_scores[valid_pos] <- 1\n",
-    "            print(\"Warning: score column does not exist; setting scores to 1\")\n",
-    "        }\n",
-    "        anno_scores\n",
-    "    }\n",
-    "\n",
-    "    read_target_anno <- function(file_path, ref_anno) {\n",
-    "        if (endsWith(file_path, \"rds\")) {\n",
-    "            target_anno <- readRDS(file_path)\n",
-    "            return(process_annotation(target_anno, ref_anno, ${score_column}))\n",
-    "        }\n",
-    "        target_anno <- fread(file_path)\n",
-    "        if (${\"TRUE\" if is_range else \"FALSE\"}) {\n",
-    "            names(target_anno)[1:3] <- c(\"chr\", \"start\", \"end\")\n",
-    "            target_anno <- process_range_data(target_anno, unique(ref_anno$CHR))\n",
-    "            if (is.null(target_anno)) return(rep(0, nrow(ref_anno)))\n",
-    "        } else {\n",
-    "            names(target_anno)[1:2] <- c(\"chr\", \"pos\")\n",
-    "        }\n",
-    "        process_annotation(target_anno, ref_anno, ${score_column})\n",
-    "    }\n",
-    "\n",
-    "    # ---- Read reference annotation ----\n",
-    "    ref_anno <- as.data.frame(fread(${_input[-2]:ar}))\n",
-    "    if (\"ANNOT\" %in% colnames(ref_anno)) ref_anno <- ref_anno[, -which(colnames(ref_anno) == \"ANNOT\")]\n",
-    "\n",
-    "    # ---- Compute per-target annotation scores ----\n",
-    "    target_files <- c(${\",\".join('\"%s\"' % str(p.absolute()) for p in _input[:-2])})\n",
-    "    N_local <- length(target_files)\n",
-    "    score_list <- lapply(target_files, read_target_anno, ref_anno = ref_anno)\n",
-    "\n",
-    "    emit_single_local <- ${\"TRUE\" if emit_single else \"FALSE\"}\n",
-    "    emit_joint_local  <- ${\"TRUE\" if emit_joint  else \"FALSE\"}\n",
-    "    use_print_snps_local <- ${\"TRUE\" if use_print_snps else \"FALSE\"}\n",
-    "    bfile_prefix         <- \"${_input[-1]:na}\"\n",
-    "\n",
-    "    # Reshape annot to match .bim panel for ldsc.py --l2 --print-snps\n",
-    "    # (drop A1/A2/MAF, expand to .bim rows filling 0, take CM from .bim).\n",
-    "    normalize_for_ldsc <- function(df) {\n",
-    "        if (!use_print_snps_local) return(df)\n",
-    "        df <- df[, !names(df) %in% c(\"A1\", \"A2\", \"MAF\", \"CM\"), drop = FALSE]\n",
-    "        annot_cols <- setdiff(names(df), c(\"CHR\", \"BP\", \"SNP\"))\n",
-    "        bim <- as.data.frame(fread(paste0(bfile_prefix, \".bim\"), header = FALSE,\n",
-    "                                   col.names = c(\"CHR\", \"SNP\", \"CM\", \"BP\", \"A1\", \"A2\")))\n",
-    "        bim$CHR <- as.character(bim$CHR); df$CHR <- as.character(df$CHR)\n",
-    "        idx <- match(bim$SNP, df$SNP)\n",
-    "        out <- data.frame(CHR = bim$CHR, BP = bim$BP, SNP = bim$SNP, CM = bim$CM,\n",
-    "                          stringsAsFactors = FALSE)\n",
-    "        for (col in annot_cols) {\n",
-    "            v <- rep(0, nrow(bim))\n",
-    "            non_na <- !is.na(idx)\n",
-    "            v[non_na] <- df[[col]][idx[non_na]]\n",
-    "            out[[col]] <- v\n",
-    "        }\n",
-    "        out\n",
-    "    }\n",
-    "\n",
-    "    # ---- Write N single-target .annot files (when requested) ----\n",
-    "    if (emit_single_local) {\n",
-    "        for (i in seq_len(N_local)) {\n",
-    "            out_anno <- ref_anno\n",
-    "            out_anno$ANNOT <- score_list[[i]]\n",
-    "            out_anno <- normalize_for_ldsc(out_anno)\n",
-    "            name <- paste0(\"${annotation_name}\", \"_single_\", i)\n",
-    "            out_path_gz  <- file.path(\"${cwd:a}\", name, paste0(name, \".${input_chroms[_index]}.annot.gz\"))\n",
-    "            out_path_tsv <- sub(\"\\\\.gz$\", \"\", out_path_gz)\n",
-    "            dir.create(dirname(out_path_gz), showWarnings = FALSE, recursive = TRUE)\n",
-    "            fwrite(out_anno, out_path_tsv, quote = FALSE, col.names = TRUE, row.names = FALSE, sep = \"\\t\")\n",
-    "        }\n",
-    "    }\n",
-    "\n",
-    "    # ---- Optionally write joint .annot ----\n",
-    "    if (emit_joint_local) {\n",
-    "        joint_anno <- ref_anno\n",
-    "        for (i in seq_len(N_local)) {\n",
-    "            joint_anno[[paste0(\"ANNOT_\", i)]] <- score_list[[i]]\n",
-    "        }\n",
-    "        joint_anno <- normalize_for_ldsc(joint_anno)\n",
-    "        joint_name   <- paste0(\"${annotation_name}\", \"_joint\")\n",
-    "        joint_out_gz <- file.path(\"${cwd:a}\", joint_name, paste0(joint_name, \".${input_chroms[_index]}.annot.gz\"))\n",
-    "        joint_out_tsv <- sub(\"\\\\.gz$\", \"\", joint_out_gz)\n",
-    "        dir.create(dirname(joint_out_gz), showWarnings = FALSE, recursive = TRUE)\n",
-    "        fwrite(joint_anno, joint_out_tsv, quote = FALSE, col.names = TRUE, row.names = FALSE, sep = \"\\t\")\n",
-    "    }\n",
-    "\n",
-    "# ----------------------------------------------------------------------------\n",
-    "# Step B: gzip all annot files. Uses expand=\"$[ ]\" so bash ${var} survives.\n",
-    "# ----------------------------------------------------------------------------\n",
-    "bash: expand = \"$[ ]\", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'\n",
-    "    set -e\n",
-    "    annots=()\n",
-    "    if [ \"$[str(emit_single)]\" = \"True\" ]; then\n",
-    "        for i in $(seq 1 $[N_targets]); do\n",
-    "            annots+=(\"$[cwd:a]/$[annotation_name]_single_$i/$[annotation_name]_single_$i.$[input_chroms[_index]].annot\")\n",
-    "        done\n",
-    "    fi\n",
-    "    if [ \"$[str(emit_joint)]\" = \"True\" ]; then\n",
-    "        annots+=(\"$[cwd:a]/$[annotation_name]_joint/$[annotation_name]_joint.$[input_chroms[_index]].annot\")\n",
-    "    fi\n",
-    "    for a in \"${annots[@]}\"; do\n",
-    "        gzip -f \"$a\"\n",
-    "    done\n",
-    "\n",
-    "# ----------------------------------------------------------------------------\n",
-    "# Step C: run polyfun's LD-score computation for each emitted annotation file.\n",
-    "# ----------------------------------------------------------------------------\n",
-    "bash: expand = \"$[ ]\", stderr = f'{_output[1]}.stderr', stdout = f'{_output[1]}.stdout'\n",
-    "    set -e\n",
-    "    chrom=\"$[input_chroms[_index]]\"\n",
-    "\n",
-    "    run_polyfun() {\n",
-    "        local annot=\"$1\"\n",
-    "        local out_prefix=\"$2\"\n",
-    "        if [ \"$[str(use_print_snps)]\" = \"True\" ]; then\n",
-    "            $[python_exec] $[polyfun_path]/ldsc.py \\\n",
-    "                --print-snps $[snp_list] \\\n",
-    "                $[ld_window_flag] $[ld_window_param] \\\n",
-    "                --out \"$out_prefix\" \\\n",
-    "                --bfile $[_input[-1]:nar] \\\n",
-    "                --yes-really \\\n",
-    "                --annot \"$annot\" \\\n",
-    "                --l2\n",
-    "        else\n",
-    "            $[python_exec] $[polyfun_path]/compute_ldscores.py \\\n",
-    "                --annot \"$annot\" \\\n",
-    "                --bfile $[_input[-1]:nar] \\\n",
-    "                $[ld_window_flag] $[ld_window_param] \\\n",
-    "                --out \"${out_prefix}.$[ldscore_ext]\" \\\n",
-    "                --allow-missing\n",
-    "        fi\n",
-    "    }\n",
-    "\n",
-    "    if [ \"$[str(emit_single)]\" = \"True\" ]; then\n",
-    "        for i in $(seq 1 $[N_targets]); do\n",
-    "            name=\"$[annotation_name]_single_$i\"\n",
-    "            annot=\"$[cwd:a]/$name/$name.$chrom.annot.gz\"\n",
-    "            prefix=\"$[cwd:a]/$name/$name.$chrom\"\n",
-    "            run_polyfun \"$annot\" \"$prefix\"\n",
-    "        done\n",
-    "    fi\n",
-    "    if [ \"$[str(emit_joint)]\" = \"True\" ]; then\n",
-    "        name=\"$[annotation_name]_joint\"\n",
-    "        annot=\"$[cwd:a]/$name/$name.$chrom.annot.gz\"\n",
-    "        prefix=\"$[cwd:a]/$name/$name.$chrom\"\n",
-    "        run_polyfun \"$annot\" \"$prefix\"\n",
-    "    fi\n",
-    "\n",
-    "# ----------------------------------------------------------------------------\n",
-    "# Step D: write .l2.M and .l2.M_5_50 files for each emitted annotation directory.\n",
-    "# ----------------------------------------------------------------------------\n",
-    "R: expand = \"${ }\", stderr = f'{_output[2]}.stderr', stdout = f'{_output[2]}.stdout'\n",
-    "    suppressPackageStartupMessages({ library(data.table); library(dplyr) })\n",
-    "    use_print_snps <- ${str(use_print_snps).upper()}\n",
-    "\n",
-    "    chrom <- \"${input_chroms[_index]}\"\n",
-    "    # Look up .frq file under frqfile_dir, using plink_name + chrom (matches cell 25).\n",
-    "    frq_file <- file.path(\"${frqfile_dir}\", paste0(\"${plink_name}\", chrom, \".frq\"))\n",
-    "    has_frq  <- file.exists(frq_file)\n",
-    "    frq_dt <- if (has_frq) fread(frq_file)[, .(SNP, MAF)] else NULL\n",
-    "\n",
-    "    write_M_files <- function(annot_path, ldscore_path, m_path) {\n",
-    "        if (use_print_snps && file.exists(m_path) && file.exists(paste0(m_path, \"_5_50\"))) {\n",
-    "            cat(\"M files already exist for\", m_path, \"\\n\"); return(invisible())\n",
-    "        }\n",
-    "        ldscore_dt <- if (endsWith(ldscore_path, \".parquet\")) {\n",
-    "            suppressPackageStartupMessages(library(arrow)); arrow::read_parquet(ldscore_path)\n",
-    "        } else fread(ldscore_path)\n",
-    "        annot_dt <- fread(annot_path)\n",
-    "        annot_filtered <- annot_dt[annot_dt$SNP %in% ldscore_dt$SNP, ]\n",
-    "        merged <- if (has_frq) merge(annot_filtered, frq_dt, by = \"SNP\", all.x = TRUE) else annot_filtered\n",
-    "        std_cols <- c(\"CHR\", \"SNP\", \"BP\", \"CM\", \"A1\", \"A2\", if (has_frq) \"MAF\")\n",
-    "        annot_cols <- setdiff(names(merged), std_cols)\n",
-    "        if (length(annot_cols) == 0L) { merged[, ANNOT := 1L]; annot_cols <- \"ANNOT\" }\n",
-    "        M <- merged[, lapply(.SD, sum, na.rm = TRUE), .SDcols = annot_cols]\n",
-    "        writeLines(paste(as.numeric(M), collapse = \" \"), m_path)\n",
-    "        if (has_frq) {\n",
-    "            common <- merged[!is.na(MAF) & MAF > 0.05, ]\n",
-    "            M5 <- common[, lapply(.SD, sum, na.rm = TRUE), .SDcols = annot_cols]\n",
-    "            writeLines(paste(as.numeric(M5), collapse = \" \"), paste0(m_path, \"_5_50\"))\n",
-    "        }\n",
-    "    }\n",
-    "\n",
-    "    targets <- c()\n",
-    "    if (${\"TRUE\" if emit_single else \"FALSE\"}) {\n",
-    "        for (i in seq_len(${N_targets})) {\n",
-    "            targets <- c(targets, paste0(\"${annotation_name}\", \"_single_\", i))\n",
-    "        }\n",
-    "    }\n",
-    "    if (${\"TRUE\" if emit_joint else \"FALSE\"}) {\n",
-    "        targets <- c(targets, paste0(\"${annotation_name}\", \"_joint\"))\n",
-    "    }\n",
-    "    for (name in targets) {\n",
-    "        annot_path   <- file.path(\"${cwd:a}\", name, paste0(name, \".\", chrom, \".annot.gz\"))\n",
-    "        ldscore_path <- file.path(\"${cwd:a}\", name, paste0(name, \".\", chrom, \".${ldscore_ext}\"))\n",
-    "        m_path       <- file.path(\"${cwd:a}\", name, paste0(name, \".\", chrom, \".l2.M\"))\n",
-    "        write_M_files(annot_path, ldscore_path, m_path)\n",
-    "    }\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "Python 3 (ipykernel)"
-   },
-   "source": [
-    "## Calculate Functional Enrichment using Annotations"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": [
-    "[get_heritability]\n",
-    "# Per-trait sLDSC regression via polyfun. Fans out across target_anno_dirs:\n",
-    "# each (trait, target_dir) pair becomes one polyfun invocation. Outputs go to\n",
-    "# <cwd>/<basename(target_dir)>/<trait>.{results,log,part_delete}.\n",
-    "#\n",
-    "# `target_anno_dirs` is the list produced by [make_annotation_files_ldscore]:\n",
-    "# typically N _single_<i> directories plus optionally one _joint directory.\n",
-    "\n",
-    "#\n",
-    "# --- about the \".results\" Category column and the \"_0 / _1\" suffix ---\n",
-    "# Each (trait, target_dir) pair is ONE polyfun call; its `ldsc.py --ref-ld-chr`\n",
-    "# always gets exactly two LD-score sources, in this order:\n",
-    "#     \"<target_dir>/<target>.\"   (index 0)  ,  \"<baseline_dir>/<baseline>\"   (index 1)\n",
-    "# With --overlap-annot, every annotation column in the .results \"Category\" is\n",
-    "# named  <ldscore_column_name>_<ref-ld-index>:\n",
-    "#     index 0 = the target file   -> \"ANNOT_0\"  (no-snplist; compute_ldscores.py keeps the annot col name)\n",
-    "#                                  -> \"L2_0\"    (snplist + single annot; ldsc.py hard-codes \"L2\", see below)\n",
-    "#                                  -> \"ANNOT_1_0\",\"ANNOT_2_0\"      (no-snplist joint dir, N>=2 annot cols)\n",
-    "#                                  -> \"ANNOT_1L2_0\",\"ANNOT_2L2_0\"  (snplist joint dir, N>=2 -> \"<name>L2\")\n",
-    "#     index 1 = the baseline file -> \"base_1\",\"Coding_UCSC_1\", ...  (the 97 baseline annots)\n",
-    "# So in this pipeline the suffix is only ever 0 (target) or 1 (baseline); it would\n",
-    "# continue 0,1,2,... only if you handed `ldsc.py --ref-ld-chr` more than two sources.\n",
-    "# (Why ANNOT_0 vs L2_0: see the [make_annotation_files_ldscore] header — ldsc.py's\n",
-    "#  \"n_annot == 1 -> column name 'L2'\" quirk vs compute_ldscores.py keeping the annot\n",
-    "#  column name.)  [postprocess] auto-detects the target Category; if you instead pass\n",
-    "# --target-categories, the names must match this column exactly.\n",
-    "#\n",
-    "parameter: target_anno_dirs = paths()\n",
-    "parameter: all_traits = []\n",
-    "\n",
-    "import os\n",
-    "\n",
-    "with open(all_traits_file, 'r') as f:\n",
-    "    trait_paths = [os.path.join(sumstat_dir, line.strip()) for line in f if line.strip()]\n",
-    "\n",
-    "# Build (trait, target_dir) Cartesian product as parallel flat lists.\n",
-    "input_list  = []\n",
-    "target_meta = []\n",
-    "for td in target_anno_dirs:\n",
-    "    for t in trait_paths:\n",
-    "        input_list.append(t)\n",
-    "        target_meta.append(str(td))\n",
-    "\n",
-    "input: input_list, group_by = 1, group_with = \"target_meta\"\n",
-    "\n",
-    "output: f\"{cwd:a}/{os.path.basename(target_meta[_index])}/{os.path.basename(_input[0])}.log\",  \\\n",
-    "        f\"{cwd:a}/{os.path.basename(target_meta[_index])}/{os.path.basename(_input[0])}.results\"\n",
-    "\n",
-    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'\n",
-    "\n",
-    "bash: expand = \"${ }\"\n",
-    "    target_dir=\"${target_meta[_index]}\"\n",
-    "    target_name=\"$(basename ${target_meta[_index]})\"\n",
-    "    trait=\"$(basename ${_input[0]})\"\n",
-    "    output_dir=\"${cwd:a}/$target_name\"\n",
-    "    mkdir -p \"$output_dir\"\n",
-    "\n",
-    "    # MAF cutoff handling. Only 0 (disabled) or 0.05 (sLDSC default) are supported;\n",
-    "    # other values would require recomputing LD scores at that cutoff.\n",
-    "    frq_file_check=\"${frqfile_dir}/${plink_name}22.frq\"\n",
-    "    if [ \"${maf_cutoff}\" = \"0\" ] || [ \"${maf_cutoff}\" = \"0.0\" ]; then\n",
-    "        echo \"maf_cutoff = 0: skipping MAF filter (--not-M-5-50)\"\n",
-    "        frq_option=\"--not-M-5-50\"\n",
-    "    elif [ \"${maf_cutoff}\" = \"0.05\" ]; then\n",
-    "        if [ -f \"$frq_file_check\" ]; then\n",
-    "            echo \"maf_cutoff = 0.05: using --frqfile-chr (MAF > 5%)\"\n",
-    "            frq_option=\"--frqfile-chr ${frqfile_dir}/${plink_name}\"\n",
-    "        else\n",
-    "            echo \"ERROR: maf_cutoff=0.05 requires .frq files for the reference panel,\"\n",
-    "            echo \"       but none found at ${frqfile_dir}/${plink_name}*.frq.\"\n",
-    "            echo \"       Provide .frq files in frqfile_dir, or set maf_cutoff=0 (NOT recommended).\"\n",
-    "            exit 1\n",
-    "        fi\n",
-    "    else\n",
-    "        echo \"ERROR: maf_cutoff=${maf_cutoff} is not supported. Only 0 (no filter) or\"\n",
-    "        echo \"       0.05 (sLDSC default) are accepted. Other values would require\"\n",
-    "        echo \"       recomputing LD scores at that cutoff.\"\n",
-    "        exit 1\n",
-    "    fi\n",
-    "\n",
-    "    run_ldsc() {\n",
-    "        local extra_args=\"$1\"\n",
-    "        ${python_exec} ${polyfun_path}/ldsc.py \\\n",
-    "            --h2 ${sumstat_dir}/$trait \\\n",
-    "            --ref-ld-chr \"$target_dir/$target_name.\",\"${baseline_ld_dir}/${baseline_name}\" \\\n",
-    "            --out \"$output_dir/$trait\" \\\n",
-    "            --overlap-annot \\\n",
-    "            --w-ld-chr ${weights_dir}/${weight_name} \\\n",
-    "            $frq_option \\\n",
-    "            --print-coefficients \\\n",
-    "            --print-delete-vals \\\n",
-    "            --n-blocks ${n_blocks} \\\n",
-    "            $extra_args\n",
-    "    }\n",
-    "\n",
-    "    run_ldsc \"\"\n",
-    "    log_file=\"$output_dir/$trait.log\"\n",
-    "\n",
-    "    # FloatingPointError retry ladder (preserved from original): 30 -> 20 -> 10\n",
-    "    for max in 30 20 10; do\n",
-    "        if [ -f \"$log_file\" ] && grep -q \"FloatingPointError\\|invalid value encountered in sqrt\" \"$log_file\"; then\n",
-    "            echo \"FloatingPointError detected, retrying with --chisq-max $max...\"\n",
-    "            run_ldsc \"--chisq-max $max\"\n",
-    "        else\n",
-    "            break\n",
-    "        fi\n",
-    "    done\n",
-    "\n",
-    "    if [ -f \"$log_file\" ] && grep -q \"FloatingPointError\\|invalid value encountered in sqrt\" \"$log_file\"; then\n",
-    "        echo \"ERROR: FloatingPointError persists for trait $trait at target $target_name even with --chisq-max 10\"\n",
-    "        echo \"This trait may have severe numerical instability issues in the summary statistics.\"\n",
-    "    fi\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": [
-    "[munge_sumstats_polyfun]\n",
-    "parameter: sumstats  = path\n",
-    "parameter: n       = 0\n",
-    "parameter: min_info = 0.6\n",
-    "parameter: min_maf  = 0.001\n",
-    "parameter: keep_hla = False\n",
-    "parameter: chi2_cut = 30\n",
-    "input: sumstats\n",
-    "output: f\"{_input:n}.munged.parquet\"\n",
-    "bash: expand=True, stderr=f'{_output:nn}.stderr', stdout=f'{_output:nn}.stdout'\n",
-    "    {python_exec} {polyfun_path}/munge_polyfun_sumstats.py \\\n",
-    "        --sumstats {_input} \\\n",
-    "        --out {_output} \\\n",
-    "        {'--n {}'.format(n) if n>0 else ''} \\\n",
-    "        {'--min-info {}'.format(min_info)} \\\n",
-    "        {'--min-maf {}'.format(min_maf)} \\\n",
-    "        {'--chi2-cutoff {}'.format(chi2_cut)} \\\n",
-    "        {'--keep-hla' if keep_hla else ''} \\\n",
-    "        --remove-strand-ambig"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": [
-    "[postprocess]\n",
-    "# Post-processing of polyfun outputs via pecotmr::sldsc_postprocessing_pipeline.\n",
-    "# Reads .results / .log / .part_delete for all traits in `traits_file`, both\n",
-    "# single-target and (when present) joint-target runs, computes Gazal-style\n",
-    "# tau*, EnrichStat with back-solved jackknife SE, and runs the default\n",
-    "# DerSimonian-Laird random-effects meta across all supplied traits. Writes\n",
-    "# one RDS containing per-trait tables and three meta tables (tau*, E, EnrichStat).\n",
-    "\n",
-    "parameter: traits_file = path()             # text file: one trait sumstats filename per line\n",
-    "parameter: heritability_cwd = path()        # parent directory of [get_heritability] outputs (contains <annotation_name>_single_<i>/ subdirs and optionally <annotation_name>_joint/)\n",
-    "parameter: target_categories = []           # target annotation names. Auto-detected from the joint-run results if empty.\n",
-    "parameter: target_categories_label = []     # optional display names, same order as target_categories;\n",
-    "                                            # when given, every \"target\" column / tau*-block colname in\n",
-    "                                            # the output RDS is renamed to these (params$target_categories\n",
-    "                                            # holds the labels, params$target_categories_orig the originals).\n",
-    "parameter: target_anno_dir = path()         # directory of target .annot.gz files used for sd_C and binary detection (typically the joint dir, since it carries all target columns)\n",
-    "\n",
-    "input: traits_file\n",
-    "output: f\"{cwd:a}/{annotation_name}.sldsc_postprocess.rds\"\n",
-    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n",
-    "\n",
-    "R: expand = \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'\n",
-    "    library(pecotmr)\n",
-    "\n",
-    "    traits <- readLines(\"${traits_file}\")\n",
-    "    target_cats <- c(${\",\".join('\"%s\"' % c for c in target_categories)})\n",
-    "    target_lab  <- c(${\",\".join('\"%s\"' % c for c in target_categories_label)})\n",
-    "\n",
-    "    # Auto-detect single-target and joint-target output directories.\n",
-    "    her_root  <- \"${heritability_cwd}\"\n",
-    "    all_subdirs <- list.dirs(her_root, recursive = FALSE)\n",
-    "    single_pattern <- paste0(\"^\", \"${annotation_name}\", \"_single_([0-9]+)$\")\n",
-    "    joint_name     <- paste0(\"${annotation_name}\", \"_joint\")\n",
-    "    single_dirs <- all_subdirs[grepl(single_pattern, basename(all_subdirs))]\n",
-    "    single_indices <- as.integer(sub(single_pattern, \"\\\\1\", basename(single_dirs)))\n",
-    "    single_dirs <- single_dirs[order(single_indices)]\n",
-    "    joint_dir   <- file.path(her_root, joint_name)\n",
-    "    has_joint   <- dir.exists(joint_dir)\n",
-    "\n",
-    "    message(sprintf(\"Detected %d single-target dirs%s\",\n",
-    "                    length(single_dirs),\n",
-    "                    if (has_joint) \"; joint-target dir present\" else \"; no joint-target dir\"))\n",
-    "\n",
-    "    # Build per-trait prefix maps. Each trait's polyfun output is at <dir>/<trait>\n",
-    "    # (polyfun appends .results / .log / .part_delete).\n",
-    "    trait_single_prefixes <- lapply(traits, function(t) file.path(single_dirs, t))\n",
-    "    names(trait_single_prefixes) <- traits\n",
-    "\n",
-    "    if (has_joint) {\n",
-    "        trait_joint_prefix <- setNames(file.path(joint_dir, traits), traits)\n",
-    "    } else {\n",
-    "        trait_joint_prefix <- setNames(rep(NA_character_, length(traits)), traits)\n",
-    "    }\n",
-    "\n",
-    "    res <- sldsc_postprocessing_pipeline(\n",
-    "        trait_single_prefixes = trait_single_prefixes,\n",
-    "        trait_joint_prefix    = trait_joint_prefix,\n",
-    "        target_anno_dir       = \"${target_anno_dir}\",\n",
-    "        frqfile_dir          = \"${frqfile_dir}\",\n",
-    "        plink_name           = \"${plink_name}\",\n",
-    "        maf_cutoff           = ${maf_cutoff},\n",
-    "        target_categories    = if (length(target_cats) > 0) target_cats else NULL,\n",
-    "        target_labels        = if (length(target_lab)  > 0) target_lab  else NULL\n",
-    "    )\n",
-    "\n",
-    "    saveRDS(res, \"${_output[0]}\")\n",
-    "    message(\"S-LDSC post-processing complete; results written to ${_output[0]}\")\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": [
-    "[meta_subset]\n",
-    "# Optional: re-run random-effects meta on a user-defined subset of traits, using\n",
-    "# the cached per-trait standardized results from [postprocess]. No regression rerun.\n",
-    "\n",
-    "parameter: postprocess_rds = path()           # output of [postprocess]\n",
-    "parameter: subset_traits_file = path()        # text file: one trait id per line, subset of those passed to [postprocess]\n",
-    "parameter: subset_name = str                  # label used in the output filename\n",
-    "parameter: target_categories = []             # target annotation names to meta on; if empty, uses all from postprocess output\n",
-    "# If [postprocess] was run with --target-categories-label, the cached RDS already\n",
-    "# carries the display names (params$target_categories = the labels), so leave\n",
-    "# --target-categories empty here (or pass the labels, not the original ANNOT_* names).\n",
-    "\n",
-    "input: postprocess_rds, subset_traits_file\n",
-    "output: f\"{cwd:a}/{annotation_name}.{subset_name}.meta.rds\"\n",
-    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n",
-    "\n",
-    "R: expand = \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'\n",
-    "    library(pecotmr)\n",
-    "\n",
-    "    res <- readRDS(\"${postprocess_rds}\")\n",
-    "    subset_traits <- readLines(\"${subset_traits_file}\")\n",
-    "    target_cats <- c(${\",\".join([f'\"{c}\"' for c in target_categories])})\n",
-    "    if (length(target_cats) == 0) target_cats <- res$params$target_categories\n",
-    "\n",
-    "    subset_per_trait <- res$per_trait[subset_traits]\n",
-    "\n",
-    "    # Map wide names (tau_star_single/joint) to bare names meta_sldsc_random expects.\n",
-    "    view_single <- pecotmr:::.sldsc_view_for_meta(subset_per_trait, \"single\")\n",
-    "    view_joint  <- pecotmr:::.sldsc_view_for_meta(subset_per_trait, \"joint\")\n",
-    "\n",
-    "    out <- list(\n",
-    "        tau_star_single = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_single, c, \"tau_star\")),   target_cats),\n",
-    "        tau_star_joint  = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_joint,  c, \"tau_star\")),   target_cats),\n",
-    "        enrichment      = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_single, c, \"enrichment\")), target_cats),\n",
-    "        enrichstat      = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_single, c, \"enrichstat\")), target_cats)\n",
-    "    )\n",
-    "\n",
-    "    saveRDS(out, \"${_output[0]}\")\n",
-    "    message(\"Subset meta complete; results written to ${_output[0]}\")"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "SoS",
-   "language": "sos",
-   "name": "sos"
-  },
-  "language_info": {
-   "codemirror_mode": "sos",
-   "file_extension": ".sos",
-   "mimetype": "text/x-sos",
-   "name": "sos",
-   "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter",
-   "pygments_lexer": "sos"
-  },
-  "sos": {
-   "kernels": [
-    [
-     "Bash",
-     "calysto_bash",
-     "Bash",
-     "#E6EEFF",
-     "shell"
-    ],
-    [
-     "R",
-     "ir",
-     "R",
-     "#DCDCDA",
-     "r"
-    ],
-    [
-     "SoS",
-     "sos",
-     "",
-     "",
-     "sos"
-    ]
-   ],
-   "panel": {
-    "displayed": true,
-    "height": 0
-   },
-   "version": "0.22.4"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}

From d1284945db46c6ee4175d7ea3c8149c0b96c9141 Mon Sep 17 00:00:00 2001
From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com>
Date: Tue, 23 Jun 2026 12:11:46 -0400
Subject: [PATCH 6/6] remove absolute local path

---
 code/SoS/enrichment/sldsc_enrichment.ipynb | 1472 ++++++++++++++++++++
 1 file changed, 1472 insertions(+)
 create mode 100644 code/SoS/enrichment/sldsc_enrichment.ipynb

diff --git a/code/SoS/enrichment/sldsc_enrichment.ipynb b/code/SoS/enrichment/sldsc_enrichment.ipynb
new file mode 100644
index 00000000..8b352789
--- /dev/null
+++ b/code/SoS/enrichment/sldsc_enrichment.ipynb
@@ -0,0 +1,1472 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "# Stratified LD Score Regression (S-LDSC) Enrichment\n",
+    "\n",
+    "Minimal working-example driver for the S-LDSC functional-enrichment pipeline. The **Steps** section below gives one ready-to-run `sos run` command per workflow, using the toy inputs symlinked under `input/`.\n",
+    "\n",
+    "> **Environment note.** Steps 1–2 (`make_annotation_files_ldscore`, `get_heritability`) wrap the external **polyfun** toolkit (`compute_ldscores.py`, `ldsc.py`, `munge_polyfun_sumstats.py`) and require pre-computed reference-panel files (baseline-LD scores, LD weights, `.frq`, and PLINK `.bed/.bim/.fam`). polyfun is **not installed in this environment** and the reference panel is not shipped with the toy example, so those two steps cannot be executed here; their commands are provided for use on a system where polyfun and a matching panel are available. Steps 3–4 (`postprocess`, `meta_subset`) use `pecotmr::sldsc_postprocessing_pipeline` (available here) and read the `.results`/`.log` files produced by Step 2.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Description\n",
+    "This notebook implements the pipeline of [S-LDSC](https://www.nature.com/articles/ng.3404) for LD score and functional enrichment analysis.\n",
+    "\n",
+    "**Important: the S-LDSC implementation comes from the [polyfun](https://github.com/omerwe/polyfun/tree/master) package, not the original LDSC from `bulik/ldsc` GitHub repo.**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "Markdown"
+   },
+   "source": [
+    "Uses GWAS summary statistics together with annotation and LD reference-panel data to compute per-SNP heritability enrichment for each annotation. It supports single-annotation (individual contribution) and joint multi-annotation (independent contribution) analysis.\n",
+    "\n",
+    "**Background.** LD Score Regression (Bulik-Sullivan et al. 2015) distinguishes confounding (e.g. population stratification) from true polygenic signal by regressing GWAS chi-square statistics on LD scores: SNPs tagging more variation (high LD score) show higher chi-square under true polygenicity, whereas confounding inflates statistics independently of LD. S-LDSC (Finucane et al. 2015) partitions heritability across overlapping annotation categories; standardized tau accounts for negative selection (Gazal et al. 2017). The model details and the tau*/EnrichStat definitions follow below.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Methods - Workflow Overview\n",
+    "\n",
+    "The pipeline runs in three stages: (1) annotation preparation and the S-LDSC regression (polyfun), (2) post-processing into standardized $\\tau^*$ and meta-analysis (the `pecotmr` package), and (3) optional re-meta on user-defined trait subsets. The concrete commands for stages 1-2 are in the **Steps** section below.\n",
+    "\n",
+    "**Stage 1 - polyfun.** Three SoS workflows wrap polyfun: `make_annotation_files_ldscore` converts target annotations into polyfun `.annot.gz` and runs `compute_ldscores.py` (toggles `compute_single` and `compute_joint`, both default `True`; the joint dir is only emitted when $N \\geq 2$); `munge_sumstats_polyfun` preprocesses each GWAS into LDSC format; `get_heritability` runs polyfun's `ldsc.py` once per `--target-anno-dir`, enforcing the MAF cutoff via `--frqfile-chr` (`maf_cutoff` accepts only `0` or `0.05`).\n",
+    "\n",
+    "**Stage 2 - pecotmr post-processing.** A single `pecotmr::sldsc_postprocessing_pipeline` call consumes all polyfun outputs: it extracts $\\tau$, $E$, $h^2_g$, EnrichStat p-value and per-block jackknife $\\tau$ values; computes $sd_C$ and $M_{\\mathrm{ref}}$ over the regression's MAF-cutoff SNP set; standardizes $\\tau \\to \\tau^*$ for single and joint modes; auto-detects binary vs continuous annotations; and runs a DerSimonian-Laird random-effects meta-analysis across traits, producing three meta tables ($\\tau^*$ cross-type comparable, $E$ within-binary, EnrichStat within-binary). Output is an R list with `per_trait` and `meta` entries.\n",
+    "\n",
+    "**Stage 3 - subset meta-analysis.** `pecotmr::meta_sldsc_random` re-runs the meta on a trait subset without re-running the regression (lightweight, interactive):\n",
+    "\n",
+    "```r\n",
+    "res <- readRDS(\"sldsc_results.rds\")\n",
+    "neuro <- c(\"AD_GWAX\", \"PD_meta\", \"ALS_meta\")\n",
+    "meta_neuro_taustar <- pecotmr::meta_sldsc_random(\n",
+    "  res$per_trait[neuro], category = \"my_target_anno\", quantity = \"tau_star\"\n",
+    ")\n",
+    "```\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Theory\n",
+    "\n",
+    "The statistical model behind the pipeline is summarized below. Because the same framework underlies several of the workflow steps, the model, its stratified extension, and the tau-estimation / enrichment definitions are described together here rather than repeated per step."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "### LDSC model\n",
+    "\n",
+    "Under a polygenic assumption, in which effect sizes for variants are drawn independently from distributions with variance proportional to $1/(p(1-p))$ where $p$ is the minor allele frequency (MAF), the expected $\\chi^2$ statistic of variant $j$ is:\n",
+    "\n",
+    "$$E[\\chi^2_j \\mid \\ell_j] \\;=\\; \\frac{N\\,h^2\\,\\ell_j}{M} \\;+\\; N a \\;+\\; 1 \\quad (1)$$\n",
+    "\n",
+    "where $N$ is the sample size; $M$ is the number of SNPs, so that $h^2/M$ is the average heritability per SNP; $a$ measures the contribution of confounding biases such as cryptic relatedness and population stratification; and $\\ell_j = \\sum_k r^2_{jk}$ is the LD Score of variant $j$, which measures the amount of genetic variation tagged by $j$. A full derivation is given in the Supplementary Note of Bulik-Sullivan et al. (2015); an alternative derivation appears in the Supplementary Note of Zhu and Stephens (2017) AoAS.\n",
+    "\n",
+    "Equation (1) shows that LD Score regression can compute SNP-based heritability for a phenotype from GWAS summary statistics alone, without requiring individual-level genotype data as REML and related methods do."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "### Stratified LDSC\n",
+    "\n",
+    "Heritability is the proportion of phenotypic variation that is due to variation in genetic values, and it can also be partitioned over disjoint or overlapping categories of SNPs.\n",
+    "\n",
+    "Stratified LD Score Regression (S-LDSC) partitions heritability by leveraging both LD-score information and SNPs that have not reached genome-wide significance. S-LDSC exploits the fact that the $\\chi^2$ statistic for a given SNP reflects the cumulative effects of all SNPs tagged by it: in regions of high LD, the focal SNP captures the contribution of a group of nearby SNPs.\n",
+    "\n",
+    "S-LDSC declares an annotation enriched for heritability if SNPs with high LD to that annotation have higher $\\chi^2$ statistics than SNPs with low LD to it.\n",
+    "\n",
+    "Let $a_{jC}$ denote the value of annotation $C$ at SNP $j$:\n",
+    "\n",
+    "- **Binary annotation** (e.g. an indicator for \"in enhancer\", \"in exon\", \"in cell-type-specific peak\"): $a_{jC} \\in \\{0, 1\\}$.\n",
+    "- **Continuous annotation** (e.g. gene-specificity score, conservation score, continuous epigenomic signal): $a_{jC} \\in \\mathbb{R}$.\n",
+    "\n",
+    "Under a polygenic model the per-SNP heritability for SNP $j$ is\n",
+    "\n",
+    "$$\\mathrm{Var}(\\beta_j) \\;=\\; \\sum_C a_{jC}\\, \\tau_C$$\n",
+    "\n",
+    "and the expected $\\chi^2$ statistic of SNP $j$ is\n",
+    "\n",
+    "$$E[\\chi^2_j \\mid \\mathbf{a}_j] \\;=\\; N \\sum_C \\tau_C\\, \\ell(j, C) \\;+\\; N a \\;+\\; 1 \\quad (2)$$\n",
+    "\n",
+    "where $\\ell(j, C) = \\sum_k a_{kC}\\, r^2_{jk}$ is the partitioned LD score of SNP $j$ with respect to annotation $C$, and $a$ measures confounding bias. Equation (2) allows joint estimation of all $\\tau_C$ via a (computationally simple) multiple regression of $\\chi^2_j$ against $\\ell(j, C)$.\n",
+    "\n",
+    "Interpretation of $\\tau_C$:\n",
+    "- **Binary $C$**: $\\tau_C$ is the *additive increase in per-SNP heritability* for SNPs in category $C$, on top of the contributions from any other annotations they belong to.\n",
+    "- **Continuous $C$**: $\\tau_C$ is the *additive change in per-SNP heritability per unit increase* in the value of annotation $C$.\n",
+    "\n",
+    "For application to real data and comparisons to other methods, see the three papers cited at the top of this notebook."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "### Tau Estimation and Enrichment Analysis\n",
+    "\n",
+    "Goal: quantify the contribution of functional annotations to trait heritability and assess statistical significance, accounting for LD structure and (for continuous annotations) annotation scale.\n",
+    "\n",
+    "The pipeline has two computational layers:\n",
+    "\n",
+    "- **Regression layer** — the S-LDSC regression itself, performed by the [polyfun](https://github.com/omerwe/polyfun) engine. We do not re-implement this.\n",
+    "- **Post-processing layer** — standardization, differential per-SNP heritability, binary/continuous detection, and random-effects meta-analysis across traits. Implemented in the [`pecotmr`](https://github.com/StatFunGen/pecotmr) R package (`R/sldsc_wrapper.R`).\n",
+    "\n",
+    "The notation below tags each modeling quantity as **(polyfun)** or **(pecotmr)**.\n",
+    "\n",
+    "#### Notation\n",
+    "\n",
+    "For each annotation $C$ we use:\n",
+    "\n",
+    "- $\\pi^{h^2}_C$ = proportion of trait heritability $h^2_g$ assigned to annotation $C$.\n",
+    "- $\\pi^{M}_C$ = proportion of (effective) SNPs in annotation $C$. For binary annotations this is $M_C / M_{\\mathrm{ref}}$; for continuous annotations it is the share of total annotation weight in $C$.\n",
+    "\n",
+    "#### Reference panel and MAF cutoff\n",
+    "\n",
+    "All LD-derived quantities — partitioned LD scores for the 97 baseline annotations and for our $K$ target annotations, the LD-score-regression weights, allele frequencies, and the SNP set — are computed against our own LD reference panel. We do not mix in pre-computed quantities from external panels (e.g. 1000G); $M_{\\mathrm{ref}}$ throughout this notebook denotes the number of common SNPs in our panel.\n",
+    "\n",
+    "By default we restrict to MAF $> 5\\%$ per the sLDSC recommendation: rare-variant LD is unstable and HapMap3-style regression weights are common-variant by construction. The cutoff is exposed as the SoS parameter `maf_cutoff` (default $0.05$); the regression, the standardized $sd_C$, and $M_{\\mathrm{ref}}$ are all evaluated on the same MAF $>$ cutoff SNP set. If allele-frequency files are not available the pipeline fails; the user must explicitly set `maf_cutoff = 0` to opt out (not recommended).\n",
+    "\n",
+    "#### Quantities from the regression layer (polyfun)\n",
+    "\n",
+    "Solving Equation (2) jointly across annotations, with 200-block genomic jackknife for inference, is performed by polyfun's `ldsc.py`. From each polyfun run we obtain, per annotation:\n",
+    "\n",
+    "- $\\tau_C$ and its standard error — **(polyfun)**.\n",
+    "- $\\pi^{h^2}_C$ and $\\pi^{M}_C$ — **(polyfun)**.\n",
+    "- $E_C = \\pi^{h^2}_C / \\pi^{M}_C$ and its standard error — **(polyfun)**.\n",
+    "- The p-value of the differential per-SNP heritability test (defined below) — **(polyfun)**, computed internally with the full coefficient covariance matrix.\n",
+    "\n",
+    "We also obtain, per run:\n",
+    "\n",
+    "- The total trait heritability $h^2_g$ — **(polyfun)**.\n",
+    "- The 200-block jackknife delete-values of $\\tau_C$ — **(polyfun)**.\n",
+    "\n",
+    "#### Quantities from the post-processing layer (pecotmr)\n",
+    "\n",
+    "From the polyfun outputs above plus our reference panel, the post-processing layer computes:\n",
+    "\n",
+    "- $sd_C$ — per-annotation standard deviation over MAF $>$ cutoff SNPs — **(pecotmr: `compute_sldsc_annot_sd`)**.\n",
+    "- $M_{\\mathrm{ref}}$ — reference SNP count at the MAF cutoff — **(pecotmr: `compute_sldsc_M_ref`)**.\n",
+    "- Whether each annotation is binary or continuous — **(pecotmr: `is_binary_sldsc_annot`)**.\n",
+    "- $\\tau^*_C$ point estimate and per-block $\\tau^*_C$ — **(pecotmr: `standardize_sldsc_trait`)**.\n",
+    "- EnrichStat point estimate and its standard error (formula below) — **(pecotmr: `standardize_sldsc_trait`)**.\n",
+    "- DerSimonian-Laird random-effects meta-analysis of $\\tau^*_C$, $E_C$, or EnrichStat across traits — **(pecotmr: `meta_sldsc_random`)**.\n",
+    "\n",
+    "The top-level entry point `pecotmr::sldsc_postprocessing_pipeline` orchestrates all of the above.\n",
+    "\n",
+    "#### Standardized tau ($\\tau^*$)  —  (pecotmr)\n",
+    "\n",
+    "$\\tau_C$ has units that depend on the scale of the annotation and on the total heritability of the trait, so raw $\\tau$ is not directly comparable across annotations or across traits. We compute the standardized version (Gazal et al. 2017)\n",
+    "\n",
+    "$$\\tau^*_C \\;=\\; \\tau_C \\cdot \\frac{sd_C \\cdot M_{\\mathrm{ref}}}{h^2_g}$$\n",
+    "\n",
+    "interpreted as the additive change in per-SNP heritability associated with a 1 standard deviation increase in annotation $C$, divided by the average per-SNP heritability across all SNPs. $\\tau^*_C$ is dimensionless and comparable across annotations and across traits. In a joint multi-annotation regression it is the *independent contribution* of annotation $C$ after controlling for overlapping effects of the others.\n",
+    "\n",
+    "Here $sd_C$ is the standard deviation of annotation $C$ across reference SNPs (MAF $>$ cutoff), $M_{\\mathrm{ref}}$ is the count of those SNPs, and $h^2_g$ is the trait heritability. Applying the same scaling to each of the 200 jackknife blocks yields per-block $\\tau^*_C$ values; their sample variance gives the jackknife standard error\n",
+    "$$SE^{\\text{jackknife}}(\\tau^*_C) \\;=\\; \\sqrt{\\,\\tfrac{(B-1)^2}{B}\\, \\mathrm{Var}_b(\\tau^*_{C,(b)})\\,}$$\n",
+    "with $B = 200$, used as the per-trait input to cross-trait meta-analysis.\n",
+    "\n",
+    "#### Differential per-SNP heritability (\"EnrichStat\")  —  (polyfun + pecotmr)\n",
+    "\n",
+    "To test whether the per-SNP heritability *inside* annotation $C$ differs from *outside* it (Finucane et al. 2015):\n",
+    "\n",
+    "$$\\text{EnrichStat}_C \\;=\\; \\frac{h^2_g}{M_{\\mathrm{ref}}} \\!\\left[\\, \\frac{\\pi^{h^2}_C}{\\pi^{M}_C} \\;-\\; \\frac{1 - \\pi^{h^2}_C}{1 - \\pi^{M}_C} \\,\\right]$$\n",
+    "\n",
+    "The point-estimate p-value of this test is computed by polyfun internally using the full coefficient covariance and reported as `Enrichment_p`. Its standard error is recovered from the reported p-value:\n",
+    "\n",
+    "$$|Z_C| \\;=\\; \\Phi^{-1}\\!\\left(1 - \\tfrac{p_C}{2}\\right), \\qquad SE_{\\text{EnrichStat}_C} \\;=\\; \\frac{|\\text{EnrichStat}_C|}{|Z_C|}.$$\n",
+    "\n",
+    "This per-trait point + SE is the input to cross-trait meta-analysis.\n",
+    "\n",
+    "#### Reporting: binary vs. continuous annotations  —  (pecotmr)\n",
+    "\n",
+    "The estimation machinery applies to both annotation types, but the *headline* quantity to report **within each type** differs.\n",
+    "\n",
+    "For a **binary annotation** (e.g. enhancer indicator, exon, in/out of a cell-type peak), $\\pi^{M}_C = M_C / M_{\\mathrm{ref}}$ has a direct interpretation and $E_C$ reads as \"the category explains $E_C$-fold more heritability than its share of SNPs.\" The within-type headline quantities are therefore $E_C$ and the EnrichStat p-value; $\\tau^*_C$ is reported alongside.\n",
+    "\n",
+    "For a **continuous annotation** (e.g. gene-specificity score, conservation score, continuous epigenomic signal), $E_C$ depends on the scale of the annotation: rescaling the annotation by a constant changes $E_C$ even though the underlying biology is unchanged. The within-type headline quantities are therefore $\\tau^*_C$ and its p-value; $E_C$ is reported alongside but should not be interpreted for continuous annotations.\n",
+    "\n",
+    "The pipeline determines whether an annotation is binary by inspecting whether its values lie in $\\{0, 1\\}$ and selects the appropriate within-type headline statistic automatically (pecotmr).\n",
+    "\n",
+    "> **From the official LDSC tutorial** ([Partitioned Heritability from Continuous Annotations](https://github.com/bulik/ldsc/wiki/Partitioned-Heritability-from-Continuous-Annotations)):\n",
+    ">\n",
+    "> *\"Enrichment is (Prop. heritability) / (Prop. SNPs). These outputs make sense only for binary annotations. Do not try to interpret them for continuous annotations. Using `--print-coefficients` outputs the regression coefficients and corresponding standard errors and Z score for each annotation. These coefficients measure the additional contribution of one annotation to the model and are interpretable for both binary and continuous annotations.\"*\n",
+    ">\n",
+    "> The pipeline always passes `--print-coefficients` to polyfun for this reason.\n",
+    "\n",
+    "#### Cross-type comparison: always use $\\tau^*_C$  —  (pecotmr)\n",
+    "\n",
+    "For an apple-to-apple comparison **across binary and continuous annotations** — ranking annotations on a single axis, meta-analyzing a mixed set, or reporting a leaderboard that pools both types — use $\\tau^*_C$. The standardization in Gazal et al. (2017) was designed for exactly this purpose: $sd_C = \\sqrt{p(1-p)}$ for a binary annotation (where $p$ is the proportion in the category) and $sd_C = $ empirical standard deviation for a continuous annotation, so the resulting $\\tau^*_C$ is dimensionless and has the same interpretation in both cases — additive change in per-SNP heritability per 1 SD increase in the annotation, normalized by the average per-SNP heritability. $E_C$ does not have this property and must not be compared across types.\n",
+    "\n",
+    "The pipeline emits both $E_C$ and $\\tau^*_C$ for every annotation, with the binary/continuous flag, so callers can pick the right column for the comparison they are making.\n",
+    "\n",
+    "#### Joint analysis  —  (polyfun runs the regression; pecotmr standardizes both modes)\n",
+    "\n",
+    "For **joint analysis** (multiple annotations fit together), both $\\tau$ and $E$ are conditional on the other annotations in the model. We report joint $\\tau^*_C$ as the independent contribution of annotation $C$ after controlling for the others. The annotation-prep step exposes two independent toggles, `compute_single` and `compute_joint` (both default `True`), so the user can produce the $N$ single-target outputs, the joint output, or both in one invocation. With both defaults the post-processing layer reads all $N+1$ regression outputs per trait and presents single + joint side-by-side. When the joint subset is decided after looking at single-target results (exploratory $\\rightarrow$ conditional workflow), the user runs the annotation-prep step a second time with `compute_single=False` on the curated subset.\n",
+    "\n",
+    "### Meta-Analysis across Traits (Random Effects)  —  (pecotmr)\n",
+    "\n",
+    "DerSimonian-Laird random-effects meta-analysis of per-annotation estimates across traits, implemented in `pecotmr::meta_sldsc_random` (which delegates the numerics to `rmeta::meta.summaries(..., method = \"random\")`):\n",
+    "\n",
+    "$$\\hat\\theta_{\\mathrm{meta}} \\;=\\; \\frac{\\sum_i w_i\\, \\hat\\theta_i}{\\sum_i w_i}, \\qquad SE_{\\mathrm{meta}} \\;=\\; \\sqrt{\\frac{1}{\\sum_i w_i}}, \\qquad w_i \\;=\\; \\frac{1}{SE_i^2 + \\hat\\sigma^2}$$\n",
+    "\n",
+    "where $\\hat\\theta_i$ is the per-trait estimate and $SE_i$ its standard error:\n",
+    "\n",
+    "- **For $\\tau^*_C$ meta**: $SE_i$ is the jackknife SE from the per-block $\\tau^*_C$ values.\n",
+    "- **For $E_C$ meta**: $SE_i$ is the polyfun-reported `Enrichment_std_error`.\n",
+    "- **For EnrichStat meta**: $SE_i$ is the back-solved SE from polyfun's `Enrichment_p`.\n",
+    "\n",
+    "For binary-annotation enrichment reporting we use a two-channel meta: the **effect size and SE** come from the meta on $E_C$ (interpretable on the original enrichment-fold scale), while the **p-value** comes from the meta on EnrichStat (the appropriate hypothesis test). The pipeline produces a default meta over all supplied traits; users can re-run meta on any subset of traits without re-running the regression layer.\n",
+    "\n",
+    "$$Z_{\\mathrm{meta}} \\;=\\; \\frac{\\hat\\theta_{\\mathrm{meta}}}{SE_{\\mathrm{meta}}}, \\qquad p \\;=\\; 2\\,\\Phi(-|Z_{\\mathrm{meta}}|)$$"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Minimal Working Example (MWE)\n",
+    "\n",
+    "The steps below run the four pipeline workflows end to end on the example data. Each step lists what it does, then the `sos run` command to execute it.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Step 1. `make_annotation_files_ldscore`\n",
+    "\n",
+    "*Annotation preparation and S-LDSC regression (polyfun).* This step accepts a single annotation file for a single-tau analysis (one annotation as input) or several annotation files for a joint-tau analysis (multiple annotations as input)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "#### **Inputs**\n",
+    "\n",
+    "##### 1. Target Annotation File\n",
+    "\n",
+    "- **Purpose**: Specifies the user-provided (\"target\") genome annotation files. The pipeline supports both binary and continuous annotations; the type is auto-detected per annotation column.\n",
+    "- **Formats**:\n",
+    "    - Text file (`.txt`) listing per-chromosome paths to annotation files. Annotation files can be `.rds`/`.tsv`/`.txt`.\n",
+    "    - Alternatively, files for specific chromosomes can be provided directly.\n",
+    "    - **Multiple target annotations** are supported in one input file (one column per annotation, prefixed `path`, `path1`, `path2`, ...). Single-target and joint-target analyses are produced automatically in one pipeline pass.\n",
+    "    - **Format** (the score column is optional; if absent, score is set to 1):\n",
+    "        - `is_range = False`:\n",
+    "        ```\n",
+    "        chr   pos   score\n",
+    "        1    10001   1\n",
+    "        1    10002   1\n",
+    "        ```\n",
+    "        - `is_range = True`:\n",
+    "        ```\n",
+    "        chr   start   end   score\n",
+    "        1    10001  20001  1\n",
+    "        1    30001  40001  1\n",
+    "        ```\n",
+    "\n",
+    "##### 2. Reference Annotation File (baseline-LD)\n",
+    "\n",
+    "- **Purpose**: Provides the baseline annotations (typically the 97-annotation baseline-LD model from Gazal et al. 2017) in `.annot.gz` format for each chromosome. The baseline conditions every regression.\n",
+    "- **Formats**:\n",
+    "    - Text file listing baseline annotation files for all chromosomes.\n",
+    "    - Alternatively, files for specific chromosomes can be provided directly.\n",
+    "\n",
+    "##### 3. Genome Reference File\n",
+    "\n",
+    "- **Purpose**: PLINK-format `.bed/.bim/.fam` files for our LD reference panel, per chromosome. This is the panel against which all LD-derived quantities (target LD scores, baseline LD scores, regression weights, allele frequencies) must be computed. **Do not mix files derived from different panels** (e.g. 1000G vs ADSP).\n",
+    "- **Formats**:\n",
+    "    - Text file listing per-chromosome reference files, or files for specific chromosomes.\n",
+    "\n",
+    "##### 4. SNP List\n",
+    "\n",
+    "- **Purpose**: Specifies the SNPs to include in LDSC analysis (typically a HapMap3-style list).\n",
+    "- **Format**: A list of `rsid`s, one per line.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "kernel": "Bash"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n",
+      "  import pkg_resources\n",
+      "INFO: Running \u001b[32mmake_annotation_files_ldscore\u001b[0m: \n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=1) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=3) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=2) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=0) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=5) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=6) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=4) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=7) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=9) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=10) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=8) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=11) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=14) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=13) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=12) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=15) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=18) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=16) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=17) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=19) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=21) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m (index=20) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmake_annotation_files_ldscore\u001b[0m output:   \u001b[32m/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_ldscore/protocol_example_single_1/protocol_example_single_1.1.annot.gz /restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_ldscore/protocol_example_single_1/protocol_example_single_1.1.l2.ldscore.parquet... (66 items in 22 groups)\u001b[0m\n",
+      "INFO: Workflow make_annotation_files_ldscore (ID=weae0ca3fdf468fd8) is executed successfully with 1 completed step and 22 completed substeps.\n"
+     ]
+    }
+   ],
+   "source": [
+    "sos run pipeline/sldsc_enrichment.ipynb make_annotation_files_ldscore \\\n",
+    "  --annotation_file input/enrichment/sldsc/colocboost_test_annotation_path.txt \\\n",
+    "  --reference_anno_file input/enrichment/sldsc/reference_annotation0.txt \\\n",
+    "  --genome_ref_file input/enrichment/sldsc/genome_reference_bfile.txt \\\n",
+    "  --annotation_name protocol_example \\\n",
+    "  --plink_name reference. --baseline_name annotations. --weight_name weights. \\\n",
+    "  --python_exec python \\\n",
+    "  --polyfun_path polyfun \\\n",
+    "  --cwd output/sldsc_ldscore -j 4\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "### Munge summary statistics (preprocessing, run before Step 2)\n",
+    "\n",
+    "Before estimating heritability, each raw GWAS summary-statistics file must be converted into the LDSC-compatible format consumed by `get_heritability`. Run `munge_sumstats_polyfun` once per trait; the munged files are then collected in the directory passed to `get_heritability` via `--sumstat_dir`.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "# sos run pipeline/sldsc_enrichment.ipynb munge_sumstats_polyfun \\\n",
+    "#     --sumstats data/polyfun_new/example_data/trait_raw_sumstats.tsv \\\n",
+    "#     --n 0 \\\n",
+    "#     --min-info 0.6 \\\n",
+    "#     --min-maf 0.001 \\\n",
+    "#     --chi2-cutoff 30 \\\n",
+    "#     --polyfun_path data/github/polyfun \\\n",
+    "#     --cwd data/polyfun_new/example_data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Step 2. `get_heritability`\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "**Inputs**\n",
+    "\n",
+    "##### 1. Allele Frequency Files (`.frq`, our panel)\n",
+    "\n",
+    "- **Purpose**: PLINK `.frq` files for the reference panel, used to enforce the MAF cutoff. **Required** when `maf_cutoff > 0` (default `0.05`); the pipeline fails if missing unless `maf_cutoff = 0` is explicitly set.\n",
+    "\n",
+    "##### 2. GWAS Summary Statistics\n",
+    "\n",
+    "- **Purpose**: One munged sumstats file per trait, listed in a text file (`all_traits_file`). The pipeline runs the regression once per trait per single/joint mode.\n",
+    "- **Format**:\n",
+    "    ```\n",
+    "    CAD_META.filtered.sumstats.gz\n",
+    "    UKB.Lym.BOLT.sumstats.gz\n",
+    "    ```\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {
+    "kernel": "Bash"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n",
+      "  import pkg_resources\n",
+      "INFO: Running \u001b[32mget_heritability\u001b[0m: \n",
+      "maf_cutoff = 0: skipping MAF filter (--not-M-5-50)\n",
+      "maf_cutoff = 0: skipping MAF filter (--not-M-5-50)\n",
+      "maf_cutoff = 0: skipping MAF filter (--not-M-5-50)\n",
+      "python: can't open file '/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/../polyfun/ldsc.py': [Errno 2] No such file or directory\n",
+      "python: can't open file '/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/../polyfun/ldsc.py': [Errno 2] No such file or directory\n",
+      "python: can't open file '/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/../polyfun/ldsc.py': [Errno 2] No such file or directory\n",
+      "INFO: \u001b[32mget_heritability\u001b[0m (index=1) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mget_heritability\u001b[0m (index=0) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mget_heritability\u001b[0m (index=2) is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mget_heritability\u001b[0m output:   \u001b[32m/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_heritability/protocol_example_single_1/sumstats.parquet.log /restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_heritability/protocol_example_single_1/sumstats.parquet.results... (6 items in 3 groups)\u001b[0m\n",
+      "INFO: Workflow get_heritability (ID=wa79eac1662f5dd2d) is executed successfully with 1 completed step and 3 completed substeps.\n"
+     ]
+    }
+   ],
+   "source": [
+    "sos run pipeline/sldsc_enrichment.ipynb get_heritability \\\n",
+    "  --target_anno_dirs output/sldsc_ldscore/protocol_example_single_1 \\\n",
+    "  --all_traits_file input/enrichment/sldsc/sumstats_test_all.txt \\\n",
+    "  --sumstat_dir input/enrichment/sldsc \\\n",
+    "  --baseline_ld_dir input/enrichment/sldsc \\\n",
+    "  --weights_dir input/enrichment/sldsc \\\n",
+    "  --plink_name reference. --baseline_name annotations. --weight_name weights. \\\n",
+    "  --annotation_name protocol_example --python_exec python \\\n",
+    "  --polyfun_path ../polyfun \\\n",
+    "  --maf_cutoff 0 --cwd output/sldsc_heritability -j 4\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Step 3. `Post-processing (pecotmr) and meta-analysis`\n",
+    "\n",
+    "*Post-Processing (`pecotmr::sldsc_postprocessing_pipeline`)*\n",
+    "\n",
+    "A single R function call consumes all polyfun outputs for the run and produces the final tables:\n",
+    "\n",
+    "- Reads each polyfun output and extracts $\\tau$, $E$, $h^2_g$, EnrichStat p-value, and per-block jackknife $\\tau$ values.\n",
+    "- Computes annotation $sd_C$ and $M_{\\mathrm{ref}}$ over the same MAF $>$ cutoff SNP set as the regression.\n",
+    "- Standardizes $\\tau \\to \\tau^*$ for both single-tau and joint-tau modes, including the per-block versions for jackknife SE.\n",
+    "- Auto-detects whether each annotation is binary or continuous and tags every output row accordingly.\n",
+    "- Reports the number and names of baseline annotations encountered (via `message()`) for transparency.\n",
+    "- Runs the default DerSimonian-Laird random-effects meta-analysis across all supplied traits, producing three meta tables: $\\tau^*$ (cross-type comparable), $E$ (within-binary), and EnrichStat (within-type).\n",
+    "\n",
+    "Outputs are returned as an R list with two top-level entries: `per_trait` (one tidy data frame per trait, single + joint estimates side-by-side per target) and `meta` (three tables, one per quantity, with rows = target annotations and columns = single/joint mean/SE/p plus an `is_binary` flag).\n",
+    "\n",
+    "The `[postprocess]` step reads all polyfun outputs under `heritability_cwd`\n",
+    "(which contains the $N$ single-target subdirectories and optionally the\n",
+    "joint subdirectory) and calls `pecotmr::sldsc_postprocessing_pipeline()`\n",
+    "to produce per-trait standardized tables and the default random-effects\n",
+    "meta across all traits.\n",
+    "\n",
+    "Use `--target-categories-label` (same order as `--target-categories`) to give the target annotations friendly names in the output — e.g. `--target-categories ANNOT_1_0 ANNOT_2_0 --target-categories-label quantile_eQTL eQTL` makes the `target` column read `quantile_eQTL` / `eQTL` instead of `ANNOT_1_0` / `ANNOT_2_0` (the original names are kept in `params$target_categories_orig`). Omit it to keep the polyfun `.results` names.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {
+    "kernel": "Bash"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n",
+      "  import pkg_resources\n",
+      "INFO: Running \u001b[32mpostprocess\u001b[0m: \n",
+      "INFO: \u001b[32mpostprocess\u001b[0m is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mpostprocess\u001b[0m output:   \u001b[32m/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_postprocess/protocol_example.sldsc_postprocess.rds\u001b[0m\n",
+      "INFO: Workflow postprocess (ID=wb64dc2b84958960c) is executed successfully with 1 completed step.\n"
+     ]
+    }
+   ],
+   "source": [
+    "sos run pipeline/sldsc_enrichment.ipynb postprocess \\\n",
+    "  --traits_file input/enrichment/sldsc/sumstats_test_all.txt \\\n",
+    "  --heritability_cwd output/sldsc_heritability \\\n",
+    "  --target_categories ANNOT_0 --target_categories_label protocol_example_annotation \\\n",
+    "  --target_anno_dir output/sldsc_ldscore/protocol_example_single_1 \\\n",
+    "  --annotation_name protocol_example --python_exec python \\\n",
+    "  --polyfun_path ../polyfun \\\n",
+    "  --maf_cutoff 0 --cwd output/sldsc_postprocess -j 4\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Step 4. `Subset Meta-Analysis (`pecotmr::meta_sldsc_random`)` (optional)\n",
+    "\n",
+    "The default meta in Step 2 pools all traits the user supplied. To re-run the meta on a subset (e.g., neurodegenerative traits only, or autoimmune traits only) without re-running the regression layer:\n",
+    "\n",
+    "\n",
+    "```r\n",
+    "res <- readRDS(\"sldsc_results.rds\")\n",
+    "neuro <- c(\"AD_GWAX\", \"PD_meta\", \"ALS_meta\")\n",
+    "meta_neuro_taustar <- pecotmr::meta_sldsc_random(\n",
+    "  res$per_trait[neuro], category = \"my_target_anno\", quantity = \"tau_star\"\n",
+    ")\n",
+    "```\n",
+    "\n",
+    "This step is light-weight and can be run interactively.\n",
+    "\n",
+    "\n",
+    "The default meta in step 3 pools all traits supplied to `[postprocess]`. Use `[meta_subset]` to re-run the meta on a user-defined trait subset (e.g., neurodegenerative traits only, autoimmune traits only) without re-running the regression or the per-trait standardization. The subset operates on the cached `.sldsc_postprocess.rds` output; it is light-weight and can be run interactively or in batch.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {
+    "kernel": "Bash"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n",
+      "  import pkg_resources\n",
+      "INFO: Running \u001b[32mmeta_subset\u001b[0m: \n",
+      "INFO: \u001b[32mmeta_subset\u001b[0m is \u001b[32mcompleted\u001b[0m.\n",
+      "INFO: \u001b[32mmeta_subset\u001b[0m output:   \u001b[32m/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_postprocess/protocol_example.category1.meta.rds\u001b[0m\n",
+      "INFO: Workflow meta_subset (ID=w09a2a0530119f1d2) is executed successfully with 1 completed step.\n"
+     ]
+    }
+   ],
+   "source": [
+    "sos run pipeline/sldsc_enrichment.ipynb meta_subset \\\n",
+    "  --postprocess_rds output/sldsc_postprocess/protocol_example.sldsc_postprocess.rds \\\n",
+    "  --subset_traits_file input/enrichment/sldsc/sumstats_test_category1.txt \\\n",
+    "  --subset_name category1 --target_categories ANNOT_0 \\\n",
+    "  --annotation_name protocol_example --python_exec python \\\n",
+    "  --polyfun_path ../polyfun \\\n",
+    "  --maf_cutoff 0 --cwd output/sldsc_postprocess -j 4\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Output\n",
+    "\n",
+    "### Output summary\n",
+    "\n",
+    "| Stage | Cached on disk | Recomputable from | Purpose |\n",
+    "|---|---|---|---|\n",
+    "| Target LD scores | per-annotation, once | annotation + reference panel | input to every regression |\n",
+    "| polyfun `.results` per (trait, mode) | yes | regression run | $\\tau$, $E$, EnrichStat |\n",
+    "| Per-trait standardized table | yes (RDS) | polyfun outputs + $sd_C$ + $M_{\\mathrm{ref}}$ | reporting + meta |\n",
+    "| Default meta tables | yes (RDS) | per-trait standardized | headline figures |\n",
+    "| Subset meta | re-run on demand | per-trait standardized | custom analyses |\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "### Per-stage outputs\n",
+    "\n",
+    "Each workflow writes into its `--cwd`:\n",
+    "\n",
+    "- **make_annotation_files_ldscore** — polyfun `.annot.gz` files plus per-annotation LD-score directories (`.l2.ldscore.{gz,parquet}`, `.l2.M`, `.l2.M_5_50`). One single-target directory per annotation, plus (when more than one annotation) a joint directory.\n",
+    "- **get_heritability** — per trait and per target directory, the S-LDSC regression outputs `<trait>.{results,log,part_delete}`. The `.results` `Category` column carries the annotation name with a `_<ref-ld-index>` suffix.\n",
+    "- **postprocess** — a single `<annotation_name>.sldsc_postprocess.rds` containing per-trait tables (Gazal-style tau*, EnrichStat with back-solved jackknife SE) and three DerSimonian–Laird random-effects meta tables (tau*, E, EnrichStat).\n",
+    "- **meta_subset** — a re-meta of the cached `.sldsc_postprocess.rds` over a user-defined trait subset (lightweight; no regression re-run).\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Anticipated Results\n",
+    "\n",
+    "Produces per-annotation enrichment statistics (tau, enrichment, p-value) from stratified LD score regression."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Command interface\n",
+    "\n",
+    "List all workflows and their options:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "Bash"
+   },
+   "outputs": [],
+   "source": [
+    "sos run pipeline/sldsc_enrichment.ipynb -h"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Workflow implementation\n",
+    "\n",
+    "The cells below are the pipeline definition (preserved from the original notebook): the `[global]` parameter block and the workflow step bodies.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "[global]\n",
+    "# Path to the work directory of the analysis.\n",
+    "parameter: cwd = path('output')\n",
+    "# Prefix for the analysis output\n",
+    "parameter: annotation_name = str\n",
+    "parameter: python_exec = \"python\" # e.g. \"/home/you/.conda/envs/polyfun/bin/python\"\n",
+    "parameter: polyfun_path   = path # e.g. \"/home/you/tools/polyfun\"\n",
+    "\n",
+    "# MAF cutoff for sLDSC. Default 0.05 per sLDSC recommendation (rare-variant LD is unstable\n",
+    "# and HapMap3-style regression weights are common-variant by construction).\n",
+    "# Set to 0 to opt out of MAF filtering (NOT recommended; only use if you understand the implications).\n",
+    "# Other values would require recomputing LD scores at that cutoff.\n",
+    "parameter: maf_cutoff = 0.05\n",
+    "\n",
+    "# for make_annotation_files_ldscore workflow:\n",
+    "parameter: annotation_file = path()\n",
+    "parameter: reference_anno_file = path()\n",
+    "parameter: genome_ref_file = path() # with .bed\n",
+    "parameter: chromosome = []\n",
+    "parameter: snp_list = path()\n",
+    "parameter: ld_wind_kb = 0 # use kb if the value is provided\n",
+    "parameter: ld_wind_cm = 1.0 # default using ld_wind_cm\n",
+    "\n",
+    "# for get_heritability workflow.\n",
+    "# Note: all LD-derived inputs (baseline LD scores, target LD scores, regression weights,\n",
+    "# allele frequencies) must be computed against the same reference panel as `genome_ref_file`.\n",
+    "# Do not mix files derived from different reference panels (e.g., 1000G vs ADSP).\n",
+    "parameter: all_traits_file = path() # txt file, each row contains all GWAS summary statistics name: e.g. CAD_META.filtered.sumstats.gz\n",
+    "parameter: sumstat_dir = path() # Directory containing GWAS summary statistics\n",
+    "parameter: target_anno_dir = path()  # Directory containing target annotation files: output of ldscore\n",
+    "parameter: baseline_ld_dir = path()  # Directory containing baseline LD score files (computed against our panel)\n",
+    "parameter: frqfile_dir = path()  # Directory containing allele frequency files (.frq, our panel)\n",
+    "parameter: plink_name = \"ADSP_chr\"\n",
+    "parameter: weights_dir = path()  # Directory containing LD weights (computed against our panel)\n",
+    "parameter: baseline_name = \"baseline_chr\"  # Prefix of baseline annotation files\n",
+    "parameter: weight_name = \"weights_chr\"  # Prefix of LD weights files\n",
+    "parameter: n_blocks = 200\n",
+    "\n",
+    "# Number of threads\n",
+    "parameter: numThreads = 16\n",
+    "# For cluster jobs, number commands to run per job\n",
+    "parameter: job_size = 1\n",
+    "parameter: walltime = '12h'\n",
+    "parameter: mem = '16G'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "Python 3 (ipykernel)"
+   },
+   "source": [
+    "## Make Annotation File"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "[make_annotation_files_ldscore]\n",
+    "# Annotation preparation. Takes one annotation_file with N target annotations\n",
+    "# and produces, in one invocation, any combination of:\n",
+    "#   - N single-target LD-score directories (when compute_single = TRUE, default)\n",
+    "#   - 1 joint LD-score directory containing all N (when compute_joint = TRUE\n",
+    "#     and N >= 2, default)\n",
+    "#\n",
+    "# Outputs per chromosome <chr>:\n",
+    "#   <cwd>/<annotation_name>_single_<i>/<annotation_name>_single_<i>.<chr>.annot.gz   (i in 1..N, when compute_single)\n",
+    "#   <cwd>/<annotation_name>_single_<i>/<annotation_name>_single_<i>.<chr>.l2.ldscore.{parquet|gz}\n",
+    "#   <cwd>/<annotation_name>_single_<i>/<annotation_name>_single_<i>.<chr>.l2.M\n",
+    "#   <cwd>/<annotation_name>_single_<i>/<annotation_name>_single_<i>.<chr>.l2.M_5_50  (when .frq present)\n",
+    "#\n",
+    "#   <cwd>/<annotation_name>_joint/<annotation_name>_joint.<chr>.{...}                (when compute_joint and N>=2)\n",
+    "#\n",
+    "# Workflows:\n",
+    "#   - Workflow A (\"all at once\"): compute_single=TRUE, compute_joint=TRUE (defaults).\n",
+    "#     Produces both, fits the case where you have already chosen the joint set.\n",
+    "#   - Workflow B (\"exploratory then conditional\"):\n",
+    "#       Step 1: compute_single=TRUE, compute_joint=FALSE.\n",
+    "#               Run on N candidate annotations -> N single-target dirs.\n",
+    "#               Inspect single-target results, identify K significant ones.\n",
+    "#       Step 2: compute_single=FALSE, compute_joint=TRUE.\n",
+    "#               Run on a NEW annotation_file with the K selected annotations\n",
+    "#               -> 1 joint dir with the conditional model.\n",
+    "\n",
+    "#\n",
+    "# --- snplist (--snp_list) vs no-snplist: which polyfun script, output format,\n",
+    "#     column name, and the CM requirement ---\n",
+    "#   --snp_list given  -> ldsc.py --l2 --print-snps   -> output .l2.ldscore.gz\n",
+    "#   --snp_list absent -> compute_ldscores.py         -> output .l2.ldscore.parquet\n",
+    "#\n",
+    "#   LD-score column name (this is what becomes the .results \"Category\" in\n",
+    "#   [get_heritability], with a \"_<ref-ld-index>\" suffix appended there):\n",
+    "#     * compute_ldscores.py  ALWAYS keeps the annot column name(s):\n",
+    "#         single annot column \"ANNOT\"          -> ldscore column \"ANNOT\"\n",
+    "#         joint  annot columns \"ANNOT_1\",\"ANNOT_2\",...  -> \"ANNOT_1\",\"ANNOT_2\",...\n",
+    "#     * ldsc.py --l2 has a quirk: with EXACTLY ONE annotation (n_annot == 1) it\n",
+    "#       HARD-CODES the ldscore column name to \"L2\" and DROPS the annot's original\n",
+    "#       column name. With >=2 annotations it uses \"<annot_name>L2\"\n",
+    "#       (\"ANNOT_1L2\",\"ANNOT_2L2\",...).\n",
+    "#     => a single-target snplist run reports \"L2_0\" in .results, while a\n",
+    "#        single-target no-snplist run reports \"ANNOT_0\".  [postprocess] auto-\n",
+    "#        detects either; only matters if you pass --target-categories explicitly.\n",
+    "#\n",
+    "#   CM column requirement for snplist:  ldsc.py --l2 --print-snps requires the\n",
+    "#   target annot to (a) carry a \"CM\" (centimorgan) column and (b) line up with\n",
+    "#   the plink .bim (same SNP set, same row order). This step handles both\n",
+    "#   internally (normalize_for_ldsc: takes CM from the .bim 4th column, re-expands\n",
+    "#   the annot onto the .bim rows, filling 0). Therefore the plink .bim files MUST\n",
+    "#   carry genetic-map (cM) positions when using --ld-wind-cm (the default);\n",
+    "#   if your .bim has 0 in the cM column, switch to --ld-wind-kb instead.\n",
+    "#\n",
+    "parameter: compute_single = True\n",
+    "parameter: compute_joint = True\n",
+    "parameter: score_column = 3\n",
+    "parameter: is_range = False\n",
+    "\n",
+    "import pandas as pd\n",
+    "import os\n",
+    "\n",
+    "if not (compute_single or compute_joint):\n",
+    "    raise ValueError(\"[make_annotation_files_ldscore] at least one of compute_single or compute_joint must be TRUE\")\n",
+    "\n",
+    "def adapt_file_path(file_path, reference_file):\n",
+    "    reference_path = os.path.dirname(reference_file)\n",
+    "    if os.path.isfile(file_path):\n",
+    "        return file_path\n",
+    "    file_name = os.path.basename(file_path)\n",
+    "    if os.path.isfile(file_name):\n",
+    "        return file_name\n",
+    "    file_in_ref_dir = os.path.join(reference_path, file_name)\n",
+    "    if os.path.isfile(file_in_ref_dir):\n",
+    "        return file_in_ref_dir\n",
+    "    file_prefixed = os.path.join(reference_path, file_path)\n",
+    "    if os.path.isfile(file_prefixed):\n",
+    "        return file_prefixed\n",
+    "    raise FileNotFoundError(f\"No valid path found for file: {file_path}\")\n",
+    "\n",
+    "\n",
+    "# ---- Parse inputs and determine N ----\n",
+    "if (str(annotation_file).endswith(('rds', 'tsv', 'txt', 'tsv.gz', 'txt.gz')) and\n",
+    "    str(reference_anno_file).endswith('annot.gz')):\n",
+    "    # Case 1: direct file paths (single-chromosome run). Multiple target files separated by ','.\n",
+    "    target_files_direct = str(annotation_file).split(',')\n",
+    "    N_targets = len(target_files_direct)\n",
+    "    target_names = [f\"target_{i+1}\" for i in range(N_targets)]\n",
+    "    input_files = [[*target_files_direct, str(reference_anno_file), str(genome_ref_file)]]\n",
+    "    if len(chromosome) > 0:\n",
+    "        input_chroms = [int(x) for x in chromosome]\n",
+    "    else:\n",
+    "        input_chroms = [0]\n",
+    "else:\n",
+    "    # Case 2: txt list with #id and one or more 'path' columns\n",
+    "    target_files_df = pd.read_csv(annotation_file, sep=\"\\t\")\n",
+    "    reference_files = pd.read_csv(reference_anno_file, sep=\"\\t\")\n",
+    "    genome_ref_files = pd.read_csv(genome_ref_file, sep=\"\\t\")\n",
+    "\n",
+    "    target_files_df[\"#id\"] = [x.replace(\"chr\", \"\") for x in target_files_df[\"#id\"].astype(str)]\n",
+    "    reference_files[\"#id\"]  = [x.replace(\"chr\", \"\") for x in reference_files[\"#id\"].astype(str)]\n",
+    "    genome_ref_files[\"#id\"] = [x.replace(\"chr\", \"\") for x in genome_ref_files[\"#id\"].astype(str)]\n",
+    "\n",
+    "    path_columns = [c for c in target_files_df.columns if c.startswith('path')]\n",
+    "    N_targets = len(path_columns)\n",
+    "    target_names = path_columns[:]   # 'path', 'path1', 'path2', ...\n",
+    "\n",
+    "    for col in path_columns:\n",
+    "        target_files_df[col] = target_files_df[col].apply(lambda x: adapt_file_path(x, str(annotation_file)))\n",
+    "    reference_files[\"path\"] = reference_files[\"path\"].apply(lambda x: adapt_file_path(x, str(reference_anno_file)))\n",
+    "    genome_ref_files[\"path\"] = genome_ref_files[\"path\"].apply(lambda x: adapt_file_path(x, str(genome_ref_file)))\n",
+    "\n",
+    "    merged = target_files_df.merge(reference_files, on=\"#id\").merge(genome_ref_files, on=\"#id\")\n",
+    "    if len(chromosome) > 0:\n",
+    "        merged = merged[merged[\"#id\"].isin([str(c) for c in chromosome])]\n",
+    "\n",
+    "    rows = merged.values.tolist()\n",
+    "    input_chroms = [r[0] for r in rows]\n",
+    "    input_files = [[*r[1:N_targets+1], r[-2], r[-1]] for r in rows]\n",
+    "\n",
+    "# ---- Determine output format ----\n",
+    "use_print_snps = snp_list.is_file()\n",
+    "ldscore_ext = \"l2.ldscore.gz\" if use_print_snps else \"l2.ldscore.parquet\"\n",
+    "\n",
+    "if ld_wind_kb > 0:\n",
+    "    use_kb_window = True\n",
+    "    ld_window_param = ld_wind_kb\n",
+    "    ld_window_flag = \"--ld-wind-kb\"\n",
+    "else:\n",
+    "    use_kb_window = False\n",
+    "    ld_window_param = ld_wind_cm\n",
+    "    ld_window_flag = \"--ld-wind-cm\"\n",
+    "\n",
+    "emit_single = compute_single\n",
+    "emit_joint  = compute_joint and N_targets >= 2\n",
+    "\n",
+    "# ---- Build per-chromosome output list ----\n",
+    "def chrom_outputs(chrom):\n",
+    "    outs = []\n",
+    "    if emit_single:\n",
+    "        for i in range(N_targets):\n",
+    "            name = f\"{annotation_name}_single_{i+1}\"\n",
+    "            prefix = f\"{cwd:a}/{name}/{name}.{chrom}\"\n",
+    "            outs += [f\"{prefix}.annot.gz\", f\"{prefix}.{ldscore_ext}\", f\"{prefix}.l2.M\"]\n",
+    "    if emit_joint:\n",
+    "        name = f\"{annotation_name}_joint\"\n",
+    "        prefix = f\"{cwd:a}/{name}/{name}.{chrom}\"\n",
+    "        outs += [f\"{prefix}.annot.gz\", f\"{prefix}.{ldscore_ext}\", f\"{prefix}.l2.M\"]\n",
+    "    return outs\n",
+    "\n",
+    "input: input_files, group_by = N_targets + 2, group_with = \"input_chroms\"\n",
+    "\n",
+    "output: chrom_outputs(input_chroms[_index])\n",
+    "\n",
+    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bnn}'\n",
+    "\n",
+    "# ----------------------------------------------------------------------------\n",
+    "# Step A: write the requested .annot files for this chromosome.\n",
+    "# ----------------------------------------------------------------------------\n",
+    "R: expand = \"${ }\", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'\n",
+    "    library(data.table)\n",
+    "\n",
+    "    clean_chr <- function(x) as.numeric(gsub(\"^chr\", \"\", x))\n",
+    "\n",
+    "    process_range_data <- function(data, chr_value) {\n",
+    "        data$chr <- clean_chr(data$chr)\n",
+    "        data <- data[data$chr == chr_value,]\n",
+    "        if (nrow(data) == 0) return(NULL)\n",
+    "        expanded <- lapply(seq_len(nrow(data)), function(j) {\n",
+    "            row <- data[j,]\n",
+    "            pos_seq <- seq(row$start, row$end - 1)\n",
+    "            result <- data.frame(chr = rep(row$chr, length(pos_seq)), pos = pos_seq)\n",
+    "            if (ncol(data) > 3) {\n",
+    "                for (col in 4:ncol(data))\n",
+    "                    result[[names(data)[col]]] <- rep(row[[col]], length(pos_seq))\n",
+    "            }\n",
+    "            result\n",
+    "        })\n",
+    "        unique(rbindlist(expanded))\n",
+    "    }\n",
+    "\n",
+    "    process_annotation <- function(target_anno, ref_anno, score_column_value) {\n",
+    "        target_anno <- as.data.frame(target_anno)\n",
+    "        ref_anno    <- as.data.frame(ref_anno)\n",
+    "        target_anno$chr <- clean_chr(target_anno$chr)\n",
+    "        ref_anno$CHR    <- clean_chr(ref_anno$CHR)\n",
+    "        chr_value <- unique(ref_anno$CHR)\n",
+    "        anno_scores <- rep(0, nrow(ref_anno))\n",
+    "        match_pos <- match(target_anno$pos, ref_anno$BP)\n",
+    "        valid_pos <- as.numeric(na.omit(match_pos))\n",
+    "        if (score_column_value <= ncol(target_anno)) {\n",
+    "            anno_scores[valid_pos] <- target_anno[[score_column_value]][!is.na(match_pos)]\n",
+    "        } else {\n",
+    "            anno_scores[valid_pos] <- 1\n",
+    "            print(\"Warning: score column does not exist; setting scores to 1\")\n",
+    "        }\n",
+    "        anno_scores\n",
+    "    }\n",
+    "\n",
+    "    read_target_anno <- function(file_path, ref_anno) {\n",
+    "        if (endsWith(file_path, \"rds\")) {\n",
+    "            target_anno <- readRDS(file_path)\n",
+    "            return(process_annotation(target_anno, ref_anno, ${score_column}))\n",
+    "        }\n",
+    "        target_anno <- fread(file_path)\n",
+    "        if (${\"TRUE\" if is_range else \"FALSE\"}) {\n",
+    "            names(target_anno)[1:3] <- c(\"chr\", \"start\", \"end\")\n",
+    "            target_anno <- process_range_data(target_anno, unique(ref_anno$CHR))\n",
+    "            if (is.null(target_anno)) return(rep(0, nrow(ref_anno)))\n",
+    "        } else {\n",
+    "            names(target_anno)[1:2] <- c(\"chr\", \"pos\")\n",
+    "        }\n",
+    "        process_annotation(target_anno, ref_anno, ${score_column})\n",
+    "    }\n",
+    "\n",
+    "    # ---- Read reference annotation ----\n",
+    "    ref_anno <- as.data.frame(fread(${_input[-2]:ar}))\n",
+    "    if (\"ANNOT\" %in% colnames(ref_anno)) ref_anno <- ref_anno[, -which(colnames(ref_anno) == \"ANNOT\")]\n",
+    "\n",
+    "    # ---- Compute per-target annotation scores ----\n",
+    "    target_files <- c(${\",\".join('\"%s\"' % str(p.absolute()) for p in _input[:-2])})\n",
+    "    N_local <- length(target_files)\n",
+    "    score_list <- lapply(target_files, read_target_anno, ref_anno = ref_anno)\n",
+    "\n",
+    "    emit_single_local <- ${\"TRUE\" if emit_single else \"FALSE\"}\n",
+    "    emit_joint_local  <- ${\"TRUE\" if emit_joint  else \"FALSE\"}\n",
+    "    use_print_snps_local <- ${\"TRUE\" if use_print_snps else \"FALSE\"}\n",
+    "    bfile_prefix         <- \"${_input[-1]:na}\"\n",
+    "\n",
+    "    # Reshape annot to match .bim panel for ldsc.py --l2 --print-snps\n",
+    "    # (drop A1/A2/MAF, expand to .bim rows filling 0, take CM from .bim).\n",
+    "    normalize_for_ldsc <- function(df) {\n",
+    "        if (!use_print_snps_local) return(df)\n",
+    "        df <- df[, !names(df) %in% c(\"A1\", \"A2\", \"MAF\", \"CM\"), drop = FALSE]\n",
+    "        annot_cols <- setdiff(names(df), c(\"CHR\", \"BP\", \"SNP\"))\n",
+    "        bim <- as.data.frame(fread(paste0(bfile_prefix, \".bim\"), header = FALSE,\n",
+    "                                   col.names = c(\"CHR\", \"SNP\", \"CM\", \"BP\", \"A1\", \"A2\")))\n",
+    "        bim$CHR <- as.character(bim$CHR); df$CHR <- as.character(df$CHR)\n",
+    "        idx <- match(bim$SNP, df$SNP)\n",
+    "        out <- data.frame(CHR = bim$CHR, BP = bim$BP, SNP = bim$SNP, CM = bim$CM,\n",
+    "                          stringsAsFactors = FALSE)\n",
+    "        for (col in annot_cols) {\n",
+    "            v <- rep(0, nrow(bim))\n",
+    "            non_na <- !is.na(idx)\n",
+    "            v[non_na] <- df[[col]][idx[non_na]]\n",
+    "            out[[col]] <- v\n",
+    "        }\n",
+    "        out\n",
+    "    }\n",
+    "\n",
+    "    # ---- Write N single-target .annot files (when requested) ----\n",
+    "    if (emit_single_local) {\n",
+    "        for (i in seq_len(N_local)) {\n",
+    "            out_anno <- ref_anno\n",
+    "            out_anno$ANNOT <- score_list[[i]]\n",
+    "            out_anno <- normalize_for_ldsc(out_anno)\n",
+    "            name <- paste0(\"${annotation_name}\", \"_single_\", i)\n",
+    "            out_path_gz  <- file.path(\"${cwd:a}\", name, paste0(name, \".${input_chroms[_index]}.annot.gz\"))\n",
+    "            out_path_tsv <- sub(\"\\\\.gz$\", \"\", out_path_gz)\n",
+    "            dir.create(dirname(out_path_gz), showWarnings = FALSE, recursive = TRUE)\n",
+    "            fwrite(out_anno, out_path_tsv, quote = FALSE, col.names = TRUE, row.names = FALSE, sep = \"\\t\")\n",
+    "        }\n",
+    "    }\n",
+    "\n",
+    "    # ---- Optionally write joint .annot ----\n",
+    "    if (emit_joint_local) {\n",
+    "        joint_anno <- ref_anno\n",
+    "        for (i in seq_len(N_local)) {\n",
+    "            joint_anno[[paste0(\"ANNOT_\", i)]] <- score_list[[i]]\n",
+    "        }\n",
+    "        joint_anno <- normalize_for_ldsc(joint_anno)\n",
+    "        joint_name   <- paste0(\"${annotation_name}\", \"_joint\")\n",
+    "        joint_out_gz <- file.path(\"${cwd:a}\", joint_name, paste0(joint_name, \".${input_chroms[_index]}.annot.gz\"))\n",
+    "        joint_out_tsv <- sub(\"\\\\.gz$\", \"\", joint_out_gz)\n",
+    "        dir.create(dirname(joint_out_gz), showWarnings = FALSE, recursive = TRUE)\n",
+    "        fwrite(joint_anno, joint_out_tsv, quote = FALSE, col.names = TRUE, row.names = FALSE, sep = \"\\t\")\n",
+    "    }\n",
+    "\n",
+    "# ----------------------------------------------------------------------------\n",
+    "# Step B: gzip all annot files. Uses expand=\"$[ ]\" so bash ${var} survives.\n",
+    "# ----------------------------------------------------------------------------\n",
+    "bash: expand = \"$[ ]\", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'\n",
+    "    set -e\n",
+    "    annots=()\n",
+    "    if [ \"$[str(emit_single)]\" = \"True\" ]; then\n",
+    "        for i in $(seq 1 $[N_targets]); do\n",
+    "            annots+=(\"$[cwd:a]/$[annotation_name]_single_$i/$[annotation_name]_single_$i.$[input_chroms[_index]].annot\")\n",
+    "        done\n",
+    "    fi\n",
+    "    if [ \"$[str(emit_joint)]\" = \"True\" ]; then\n",
+    "        annots+=(\"$[cwd:a]/$[annotation_name]_joint/$[annotation_name]_joint.$[input_chroms[_index]].annot\")\n",
+    "    fi\n",
+    "    for a in \"${annots[@]}\"; do\n",
+    "        gzip -f \"$a\"\n",
+    "    done\n",
+    "\n",
+    "# ----------------------------------------------------------------------------\n",
+    "# Step C: run polyfun's LD-score computation for each emitted annotation file.\n",
+    "# ----------------------------------------------------------------------------\n",
+    "bash: expand = \"$[ ]\", stderr = f'{_output[1]}.stderr', stdout = f'{_output[1]}.stdout'\n",
+    "    set -e\n",
+    "    chrom=\"$[input_chroms[_index]]\"\n",
+    "\n",
+    "    run_polyfun() {\n",
+    "        local annot=\"$1\"\n",
+    "        local out_prefix=\"$2\"\n",
+    "        if [ \"$[str(use_print_snps)]\" = \"True\" ]; then\n",
+    "            $[python_exec] $[polyfun_path]/ldsc.py \\\n",
+    "                --print-snps $[snp_list] \\\n",
+    "                $[ld_window_flag] $[ld_window_param] \\\n",
+    "                --out \"$out_prefix\" \\\n",
+    "                --bfile $[_input[-1]:nar] \\\n",
+    "                --yes-really \\\n",
+    "                --annot \"$annot\" \\\n",
+    "                --l2\n",
+    "        else\n",
+    "            $[python_exec] $[polyfun_path]/compute_ldscores.py \\\n",
+    "                --annot \"$annot\" \\\n",
+    "                --bfile $[_input[-1]:nar] \\\n",
+    "                $[ld_window_flag] $[ld_window_param] \\\n",
+    "                --out \"${out_prefix}.$[ldscore_ext]\" \\\n",
+    "                --allow-missing\n",
+    "        fi\n",
+    "    }\n",
+    "\n",
+    "    if [ \"$[str(emit_single)]\" = \"True\" ]; then\n",
+    "        for i in $(seq 1 $[N_targets]); do\n",
+    "            name=\"$[annotation_name]_single_$i\"\n",
+    "            annot=\"$[cwd:a]/$name/$name.$chrom.annot.gz\"\n",
+    "            prefix=\"$[cwd:a]/$name/$name.$chrom\"\n",
+    "            run_polyfun \"$annot\" \"$prefix\"\n",
+    "        done\n",
+    "    fi\n",
+    "    if [ \"$[str(emit_joint)]\" = \"True\" ]; then\n",
+    "        name=\"$[annotation_name]_joint\"\n",
+    "        annot=\"$[cwd:a]/$name/$name.$chrom.annot.gz\"\n",
+    "        prefix=\"$[cwd:a]/$name/$name.$chrom\"\n",
+    "        run_polyfun \"$annot\" \"$prefix\"\n",
+    "    fi\n",
+    "\n",
+    "# ----------------------------------------------------------------------------\n",
+    "# Step D: write .l2.M and .l2.M_5_50 files for each emitted annotation directory.\n",
+    "# ----------------------------------------------------------------------------\n",
+    "R: expand = \"${ }\", stderr = f'{_output[2]}.stderr', stdout = f'{_output[2]}.stdout'\n",
+    "    suppressPackageStartupMessages({ library(data.table); library(dplyr) })\n",
+    "    use_print_snps <- ${str(use_print_snps).upper()}\n",
+    "\n",
+    "    chrom <- \"${input_chroms[_index]}\"\n",
+    "    # Look up .frq file under frqfile_dir, using plink_name + chrom (matches cell 25).\n",
+    "    frq_file <- file.path(\"${frqfile_dir}\", paste0(\"${plink_name}\", chrom, \".frq\"))\n",
+    "    has_frq  <- file.exists(frq_file)\n",
+    "    frq_dt <- if (has_frq) fread(frq_file)[, .(SNP, MAF)] else NULL\n",
+    "\n",
+    "    write_M_files <- function(annot_path, ldscore_path, m_path) {\n",
+    "        if (use_print_snps && file.exists(m_path) && file.exists(paste0(m_path, \"_5_50\"))) {\n",
+    "            cat(\"M files already exist for\", m_path, \"\\n\"); return(invisible())\n",
+    "        }\n",
+    "        ldscore_dt <- if (endsWith(ldscore_path, \".parquet\")) {\n",
+    "            suppressPackageStartupMessages(library(arrow)); arrow::read_parquet(ldscore_path)\n",
+    "        } else fread(ldscore_path)\n",
+    "        annot_dt <- fread(annot_path)\n",
+    "        annot_filtered <- annot_dt[annot_dt$SNP %in% ldscore_dt$SNP, ]\n",
+    "        merged <- if (has_frq) merge(annot_filtered, frq_dt, by = \"SNP\", all.x = TRUE) else annot_filtered\n",
+    "        std_cols <- c(\"CHR\", \"SNP\", \"BP\", \"CM\", \"A1\", \"A2\", if (has_frq) \"MAF\")\n",
+    "        annot_cols <- setdiff(names(merged), std_cols)\n",
+    "        if (length(annot_cols) == 0L) { merged[, ANNOT := 1L]; annot_cols <- \"ANNOT\" }\n",
+    "        M <- merged[, lapply(.SD, sum, na.rm = TRUE), .SDcols = annot_cols]\n",
+    "        writeLines(paste(as.numeric(M), collapse = \" \"), m_path)\n",
+    "        if (has_frq) {\n",
+    "            common <- merged[!is.na(MAF) & MAF > 0.05, ]\n",
+    "            M5 <- common[, lapply(.SD, sum, na.rm = TRUE), .SDcols = annot_cols]\n",
+    "            writeLines(paste(as.numeric(M5), collapse = \" \"), paste0(m_path, \"_5_50\"))\n",
+    "        }\n",
+    "    }\n",
+    "\n",
+    "    targets <- c()\n",
+    "    if (${\"TRUE\" if emit_single else \"FALSE\"}) {\n",
+    "        for (i in seq_len(${N_targets})) {\n",
+    "            targets <- c(targets, paste0(\"${annotation_name}\", \"_single_\", i))\n",
+    "        }\n",
+    "    }\n",
+    "    if (${\"TRUE\" if emit_joint else \"FALSE\"}) {\n",
+    "        targets <- c(targets, paste0(\"${annotation_name}\", \"_joint\"))\n",
+    "    }\n",
+    "    for (name in targets) {\n",
+    "        annot_path   <- file.path(\"${cwd:a}\", name, paste0(name, \".\", chrom, \".annot.gz\"))\n",
+    "        ldscore_path <- file.path(\"${cwd:a}\", name, paste0(name, \".\", chrom, \".${ldscore_ext}\"))\n",
+    "        m_path       <- file.path(\"${cwd:a}\", name, paste0(name, \".\", chrom, \".l2.M\"))\n",
+    "        write_M_files(annot_path, ldscore_path, m_path)\n",
+    "    }\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "Python 3 (ipykernel)"
+   },
+   "source": [
+    "## Calculate Functional Enrichment using Annotations"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "[get_heritability]\n",
+    "# Per-trait sLDSC regression via polyfun. Fans out across target_anno_dirs:\n",
+    "# each (trait, target_dir) pair becomes one polyfun invocation. Outputs go to\n",
+    "# <cwd>/<basename(target_dir)>/<trait>.{results,log,part_delete}.\n",
+    "#\n",
+    "# `target_anno_dirs` is the list produced by [make_annotation_files_ldscore]:\n",
+    "# typically N _single_<i> directories plus optionally one _joint directory.\n",
+    "\n",
+    "#\n",
+    "# --- about the \".results\" Category column and the \"_0 / _1\" suffix ---\n",
+    "# Each (trait, target_dir) pair is ONE polyfun call; its `ldsc.py --ref-ld-chr`\n",
+    "# always gets exactly two LD-score sources, in this order:\n",
+    "#     \"<target_dir>/<target>.\"   (index 0)  ,  \"<baseline_dir>/<baseline>\"   (index 1)\n",
+    "# With --overlap-annot, every annotation column in the .results \"Category\" is\n",
+    "# named  <ldscore_column_name>_<ref-ld-index>:\n",
+    "#     index 0 = the target file   -> \"ANNOT_0\"  (no-snplist; compute_ldscores.py keeps the annot col name)\n",
+    "#                                  -> \"L2_0\"    (snplist + single annot; ldsc.py hard-codes \"L2\", see below)\n",
+    "#                                  -> \"ANNOT_1_0\",\"ANNOT_2_0\"      (no-snplist joint dir, N>=2 annot cols)\n",
+    "#                                  -> \"ANNOT_1L2_0\",\"ANNOT_2L2_0\"  (snplist joint dir, N>=2 -> \"<name>L2\")\n",
+    "#     index 1 = the baseline file -> \"base_1\",\"Coding_UCSC_1\", ...  (the 97 baseline annots)\n",
+    "# So in this pipeline the suffix is only ever 0 (target) or 1 (baseline); it would\n",
+    "# continue 0,1,2,... only if you handed `ldsc.py --ref-ld-chr` more than two sources.\n",
+    "# (Why ANNOT_0 vs L2_0: see the [make_annotation_files_ldscore] header — ldsc.py's\n",
+    "#  \"n_annot == 1 -> column name 'L2'\" quirk vs compute_ldscores.py keeping the annot\n",
+    "#  column name.)  [postprocess] auto-detects the target Category; if you instead pass\n",
+    "# --target-categories, the names must match this column exactly.\n",
+    "#\n",
+    "parameter: target_anno_dirs = paths()\n",
+    "parameter: all_traits = []\n",
+    "\n",
+    "import os\n",
+    "\n",
+    "with open(all_traits_file, 'r') as f:\n",
+    "    trait_paths = [os.path.join(sumstat_dir, line.strip()) for line in f if line.strip()]\n",
+    "\n",
+    "# Build (trait, target_dir) Cartesian product as parallel flat lists.\n",
+    "input_list  = []\n",
+    "target_meta = []\n",
+    "for td in target_anno_dirs:\n",
+    "    for t in trait_paths:\n",
+    "        input_list.append(t)\n",
+    "        target_meta.append(str(td))\n",
+    "\n",
+    "input: input_list, group_by = 1, group_with = \"target_meta\"\n",
+    "\n",
+    "output: f\"{cwd:a}/{os.path.basename(target_meta[_index])}/{os.path.basename(_input[0])}.log\",  \\\n",
+    "        f\"{cwd:a}/{os.path.basename(target_meta[_index])}/{os.path.basename(_input[0])}.results\"\n",
+    "\n",
+    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'\n",
+    "\n",
+    "bash: expand = \"${ }\"\n",
+    "    target_dir=\"${target_meta[_index]}\"\n",
+    "    target_name=\"$(basename ${target_meta[_index]})\"\n",
+    "    trait=\"$(basename ${_input[0]})\"\n",
+    "    output_dir=\"${cwd:a}/$target_name\"\n",
+    "    mkdir -p \"$output_dir\"\n",
+    "\n",
+    "    # MAF cutoff handling. Only 0 (disabled) or 0.05 (sLDSC default) are supported;\n",
+    "    # other values would require recomputing LD scores at that cutoff.\n",
+    "    frq_file_check=\"${frqfile_dir}/${plink_name}22.frq\"\n",
+    "    if [ \"${maf_cutoff}\" = \"0\" ] || [ \"${maf_cutoff}\" = \"0.0\" ]; then\n",
+    "        echo \"maf_cutoff = 0: skipping MAF filter (--not-M-5-50)\"\n",
+    "        frq_option=\"--not-M-5-50\"\n",
+    "    elif [ \"${maf_cutoff}\" = \"0.05\" ]; then\n",
+    "        if [ -f \"$frq_file_check\" ]; then\n",
+    "            echo \"maf_cutoff = 0.05: using --frqfile-chr (MAF > 5%)\"\n",
+    "            frq_option=\"--frqfile-chr ${frqfile_dir}/${plink_name}\"\n",
+    "        else\n",
+    "            echo \"ERROR: maf_cutoff=0.05 requires .frq files for the reference panel,\"\n",
+    "            echo \"       but none found at ${frqfile_dir}/${plink_name}*.frq.\"\n",
+    "            echo \"       Provide .frq files in frqfile_dir, or set maf_cutoff=0 (NOT recommended).\"\n",
+    "            exit 1\n",
+    "        fi\n",
+    "    else\n",
+    "        echo \"ERROR: maf_cutoff=${maf_cutoff} is not supported. Only 0 (no filter) or\"\n",
+    "        echo \"       0.05 (sLDSC default) are accepted. Other values would require\"\n",
+    "        echo \"       recomputing LD scores at that cutoff.\"\n",
+    "        exit 1\n",
+    "    fi\n",
+    "\n",
+    "    run_ldsc() {\n",
+    "        local extra_args=\"$1\"\n",
+    "        ${python_exec} ${polyfun_path}/ldsc.py \\\n",
+    "            --h2 ${sumstat_dir}/$trait \\\n",
+    "            --ref-ld-chr \"$target_dir/$target_name.\",\"${baseline_ld_dir}/${baseline_name}\" \\\n",
+    "            --out \"$output_dir/$trait\" \\\n",
+    "            --overlap-annot \\\n",
+    "            --w-ld-chr ${weights_dir}/${weight_name} \\\n",
+    "            $frq_option \\\n",
+    "            --print-coefficients \\\n",
+    "            --print-delete-vals \\\n",
+    "            --n-blocks ${n_blocks} \\\n",
+    "            $extra_args\n",
+    "    }\n",
+    "\n",
+    "    run_ldsc \"\"\n",
+    "    log_file=\"$output_dir/$trait.log\"\n",
+    "\n",
+    "    # FloatingPointError retry ladder (preserved from original): 30 -> 20 -> 10\n",
+    "    for max in 30 20 10; do\n",
+    "        if [ -f \"$log_file\" ] && grep -q \"FloatingPointError\\|invalid value encountered in sqrt\" \"$log_file\"; then\n",
+    "            echo \"FloatingPointError detected, retrying with --chisq-max $max...\"\n",
+    "            run_ldsc \"--chisq-max $max\"\n",
+    "        else\n",
+    "            break\n",
+    "        fi\n",
+    "    done\n",
+    "\n",
+    "    if [ -f \"$log_file\" ] && grep -q \"FloatingPointError\\|invalid value encountered in sqrt\" \"$log_file\"; then\n",
+    "        echo \"ERROR: FloatingPointError persists for trait $trait at target $target_name even with --chisq-max 10\"\n",
+    "        echo \"This trait may have severe numerical instability issues in the summary statistics.\"\n",
+    "    fi\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "[munge_sumstats_polyfun]\n",
+    "parameter: sumstats  = path\n",
+    "parameter: n       = 0\n",
+    "parameter: min_info = 0.6\n",
+    "parameter: min_maf  = 0.001\n",
+    "parameter: keep_hla = False\n",
+    "parameter: chi2_cut = 30\n",
+    "input: sumstats\n",
+    "output: f\"{_input:n}.munged.parquet\"\n",
+    "bash: expand=True, stderr=f'{_output:nn}.stderr', stdout=f'{_output:nn}.stdout'\n",
+    "    {python_exec} {polyfun_path}/munge_polyfun_sumstats.py \\\n",
+    "        --sumstats {_input} \\\n",
+    "        --out {_output} \\\n",
+    "        {'--n {}'.format(n) if n>0 else ''} \\\n",
+    "        {'--min-info {}'.format(min_info)} \\\n",
+    "        {'--min-maf {}'.format(min_maf)} \\\n",
+    "        {'--chi2-cutoff {}'.format(chi2_cut)} \\\n",
+    "        {'--keep-hla' if keep_hla else ''} \\\n",
+    "        --remove-strand-ambig"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "[postprocess]\n",
+    "# Post-processing of polyfun outputs via pecotmr::sldsc_postprocessing_pipeline.\n",
+    "# Reads .results / .log / .part_delete for all traits in `traits_file`, both\n",
+    "# single-target and (when present) joint-target runs, computes Gazal-style\n",
+    "# tau*, EnrichStat with back-solved jackknife SE, and runs the default\n",
+    "# DerSimonian-Laird random-effects meta across all supplied traits. Writes\n",
+    "# one RDS containing per-trait tables and three meta tables (tau*, E, EnrichStat).\n",
+    "\n",
+    "parameter: traits_file = path()             # text file: one trait sumstats filename per line\n",
+    "parameter: heritability_cwd = path()        # parent directory of [get_heritability] outputs (contains <annotation_name>_single_<i>/ subdirs and optionally <annotation_name>_joint/)\n",
+    "parameter: target_categories = []           # target annotation names. Auto-detected from the joint-run results if empty.\n",
+    "parameter: target_categories_label = []     # optional display names, same order as target_categories;\n",
+    "                                            # when given, every \"target\" column / tau*-block colname in\n",
+    "                                            # the output RDS is renamed to these (params$target_categories\n",
+    "                                            # holds the labels, params$target_categories_orig the originals).\n",
+    "parameter: target_anno_dir = path()         # directory of target .annot.gz files used for sd_C and binary detection (typically the joint dir, since it carries all target columns)\n",
+    "\n",
+    "input: traits_file\n",
+    "output: f\"{cwd:a}/{annotation_name}.sldsc_postprocess.rds\"\n",
+    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n",
+    "\n",
+    "R: expand = \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'\n",
+    "    library(pecotmr)\n",
+    "\n",
+    "    traits <- readLines(\"${traits_file}\")\n",
+    "    target_cats <- c(${\",\".join('\"%s\"' % c for c in target_categories)})\n",
+    "    target_lab  <- c(${\",\".join('\"%s\"' % c for c in target_categories_label)})\n",
+    "\n",
+    "    # Auto-detect single-target and joint-target output directories.\n",
+    "    her_root  <- \"${heritability_cwd}\"\n",
+    "    all_subdirs <- list.dirs(her_root, recursive = FALSE)\n",
+    "    single_pattern <- paste0(\"^\", \"${annotation_name}\", \"_single_([0-9]+)$\")\n",
+    "    joint_name     <- paste0(\"${annotation_name}\", \"_joint\")\n",
+    "    single_dirs <- all_subdirs[grepl(single_pattern, basename(all_subdirs))]\n",
+    "    single_indices <- as.integer(sub(single_pattern, \"\\\\1\", basename(single_dirs)))\n",
+    "    single_dirs <- single_dirs[order(single_indices)]\n",
+    "    joint_dir   <- file.path(her_root, joint_name)\n",
+    "    has_joint   <- dir.exists(joint_dir)\n",
+    "\n",
+    "    message(sprintf(\"Detected %d single-target dirs%s\",\n",
+    "                    length(single_dirs),\n",
+    "                    if (has_joint) \"; joint-target dir present\" else \"; no joint-target dir\"))\n",
+    "\n",
+    "    # Build per-trait prefix maps. Each trait's polyfun output is at <dir>/<trait>\n",
+    "    # (polyfun appends .results / .log / .part_delete).\n",
+    "    trait_single_prefixes <- lapply(traits, function(t) file.path(single_dirs, t))\n",
+    "    names(trait_single_prefixes) <- traits\n",
+    "\n",
+    "    if (has_joint) {\n",
+    "        trait_joint_prefix <- setNames(file.path(joint_dir, traits), traits)\n",
+    "    } else {\n",
+    "        trait_joint_prefix <- setNames(rep(NA_character_, length(traits)), traits)\n",
+    "    }\n",
+    "\n",
+    "    res <- sldsc_postprocessing_pipeline(\n",
+    "        trait_single_prefixes = trait_single_prefixes,\n",
+    "        trait_joint_prefix    = trait_joint_prefix,\n",
+    "        target_anno_dir       = \"${target_anno_dir}\",\n",
+    "        frqfile_dir          = \"${frqfile_dir}\",\n",
+    "        plink_name           = \"${plink_name}\",\n",
+    "        maf_cutoff           = ${maf_cutoff},\n",
+    "        target_categories    = if (length(target_cats) > 0) target_cats else NULL,\n",
+    "        target_labels        = if (length(target_lab)  > 0) target_lab  else NULL\n",
+    "    )\n",
+    "\n",
+    "    saveRDS(res, \"${_output[0]}\")\n",
+    "    message(\"S-LDSC post-processing complete; results written to ${_output[0]}\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "[meta_subset]\n",
+    "# Optional: re-run random-effects meta on a user-defined subset of traits, using\n",
+    "# the cached per-trait standardized results from [postprocess]. No regression rerun.\n",
+    "\n",
+    "parameter: postprocess_rds = path()           # output of [postprocess]\n",
+    "parameter: subset_traits_file = path()        # text file: one trait id per line, subset of those passed to [postprocess]\n",
+    "parameter: subset_name = str                  # label used in the output filename\n",
+    "parameter: target_categories = []             # target annotation names to meta on; if empty, uses all from postprocess output\n",
+    "# If [postprocess] was run with --target-categories-label, the cached RDS already\n",
+    "# carries the display names (params$target_categories = the labels), so leave\n",
+    "# --target-categories empty here (or pass the labels, not the original ANNOT_* names).\n",
+    "\n",
+    "input: postprocess_rds, subset_traits_file\n",
+    "output: f\"{cwd:a}/{annotation_name}.{subset_name}.meta.rds\"\n",
+    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n",
+    "\n",
+    "R: expand = \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'\n",
+    "    library(pecotmr)\n",
+    "\n",
+    "    res <- readRDS(\"${postprocess_rds}\")\n",
+    "    subset_traits <- readLines(\"${subset_traits_file}\")\n",
+    "    target_cats <- c(${\",\".join([f'\"{c}\"' for c in target_categories])})\n",
+    "    if (length(target_cats) == 0) target_cats <- res$params$target_categories\n",
+    "\n",
+    "    subset_per_trait <- res$per_trait[subset_traits]\n",
+    "\n",
+    "    # Map wide names (tau_star_single/joint) to bare names meta_sldsc_random expects.\n",
+    "    view_single <- pecotmr:::.sldsc_view_for_meta(subset_per_trait, \"single\")\n",
+    "    view_joint  <- pecotmr:::.sldsc_view_for_meta(subset_per_trait, \"joint\")\n",
+    "\n",
+    "    out <- list(\n",
+    "        tau_star_single = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_single, c, \"tau_star\")),   target_cats),\n",
+    "        tau_star_joint  = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_joint,  c, \"tau_star\")),   target_cats),\n",
+    "        enrichment      = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_single, c, \"enrichment\")), target_cats),\n",
+    "        enrichstat      = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_single, c, \"enrichstat\")), target_cats)\n",
+    "    )\n",
+    "\n",
+    "    saveRDS(out, \"${_output[0]}\")\n",
+    "    message(\"Subset meta complete; results written to ${_output[0]}\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "SoS",
+   "language": "sos",
+   "name": "sos"
+  },
+  "language_info": {
+   "codemirror_mode": "sos",
+   "file_extension": ".sos",
+   "mimetype": "text/x-sos",
+   "name": "sos",
+   "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter",
+   "pygments_lexer": "sos"
+  },
+  "sos": {
+   "kernels": [
+    [
+     "Bash",
+     "calysto_bash",
+     "Bash",
+     "#E6EEFF",
+     "shell"
+    ],
+    [
+     "R",
+     "ir",
+     "R",
+     "#DCDCDA",
+     "r"
+    ],
+    [
+     "SoS",
+     "sos",
+     "",
+     "",
+     "sos"
+    ]
+   ],
+   "panel": {
+    "displayed": true,
+    "height": 0
+   },
+   "version": "0.22.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}