From e3450af6beeb84da861f6304b355c960a8acd159 Mon Sep 17 00:00:00 2001
From: Daniel Nachun <daniel.nachun@gmail.com>
Date: Wed, 24 Jun 2026 22:11:29 -0700
Subject: [PATCH 1/3] colocboost fixes

---
 .../mnm_methods/colocboost_mnm.ipynb          | 702 ++++++++++++++++++
 code/script/pecotmr_integration/colocboost.R  |  52 +-
 .../pecotmr_integration/colocboost_manifest.R | 249 +++++++
 .../gwas_sumstats_construct.R                 |  34 +-
 .../qtl_dataset_construct.R                   |  12 +-
 5 files changed, 1032 insertions(+), 17 deletions(-)
 create mode 100644 code/SoS/mnm_analysis/mnm_methods/colocboost_mnm.ipynb
 create mode 100644 code/script/pecotmr_integration/colocboost_manifest.R

diff --git a/code/SoS/mnm_analysis/mnm_methods/colocboost_mnm.ipynb b/code/SoS/mnm_analysis/mnm_methods/colocboost_mnm.ipynb
new file mode 100644
index 00000000..ad046432
--- /dev/null
+++ b/code/SoS/mnm_analysis/mnm_methods/colocboost_mnm.ipynb
@@ -0,0 +1,702 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "# Multi-trait colocalization using ColocBoost"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "Requires fine-mapping outputs (`.susie.rds` files) from `mnm_regression` or `rss_analysis`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Description"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "We use ColocBoost to perform colocalization analyses [cf. Cao et al (2025)](https://www.medrxiv.org/content/10.1101/2025.04.17.25326042v1). In short, ColocBoost performs colocalization while accounting for multiple causal variants within a genomic region of interest and can scale to hundreds of traits. We allow users to include or omit GWAS summary statistics. Required inputs are individual xQTL data from the same cohort (multiple phenotypes, same genotype). Linkage disequilibrium reference data is required if GWAS summary statistics are used."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS",
+    "tags": []
+   },
+   "source": [
+    "## Input\n",
+    "\n",
+    "1. A list of regions to be analyzed (optional); the last column of this file should be region name.\n",
+    "2. Either a list of per chromosome genotype files, or one file for genotype data of the entire genome. Genotype data has to be in PLINK `bed` format. \n",
+    "3. Vector of lists of phenotype files per region to be analyzed, in UCSC `bed.gz` with index in `bed.gz.tbi` formats.\n",
+    "4. Vector of covariate files corresponding to the lists above.\n",
+    "5. Customized association windows file for variants (cis or trans). If it is not provided, a fixed sized window will be used around the region (a cis-window)\n",
+    "6. Optionally a vector of names of the phenotypic conditions in the form of `cond1 cond2 cond3` separated with whitespace.\n",
+    "7. Optionally summary statistics meta-data file and LD reference meta-data file.\n",
+    "\n",
+    "Input 2 and 3 should be outputs from `genotype_per_region` and `annotate_coord` modules in previous preprocessing steps. 4 should be output of `covariate_preprocessing` pipeline that contains genotype PC, phenotypic hidden confounders and fixed covariates."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS",
+    "tags": []
+   },
+   "source": [
+    "### Example genotype, phenotype and association analysis windows \n",
+    "\n",
+    "See [this page](mnm_regression) for example inputs of these information."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "### Example summary statistics and LD reference\n",
+    "\n",
+    "See [rss_analysis](rss_analysis) for LD reference formats (pre-computed blocks and PLINK genotype files),\n",
+    "per-study LD reference via the `ld_meta_data` column in GWAS meta-data, and mixture LD panels.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "### About indels\n",
+    "\n",
+    "Use `--no-indel` to remove indel variants from analysis. This applies to both individual-level genotype loading and summary statistics QC. Default keeps indels.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "jp-MarkdownHeadingCollapsed": true,
+    "kernel": "SoS",
+    "tags": []
+   },
+   "source": [
+    "## Output\n",
+    "\n",
+    "For each analysis region the pipeline fits ColocBoost and saves the model in RDS format — one RDS per gene of interest (and per GWAS, when GWAS integration is used). Each RDS contains:\n",
+    "\n",
+    "- `ucos_summary` — a summary table of the colocalization events.\n",
+    "- `vpa` — the variable colocalized probability for each variant (the probability of a variant being colocalized with at least one trait).\n",
+    "- `data_info` — information on the input data.\n",
+    "- `model_info` — information on the fitted ColocBoost model.\n",
+    "- `ucos_details` — trait-specific (uncolocalized) effects information.\n",
+    "- `region_info` — information on the region(s) of interest.\n",
+    "- `computing_time` — time taken for each step (loading, QC and analysis).\n",
+    "\n",
+    "**Note:** the suggested output naming convention is `cohort_...`.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Minimal Working Example Steps"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "Timing: ~X min"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "### xQTL only analysis"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "Below we duplicate the examples for phenotype and covariates to demonstrate that when there are multiple phenotypes for the same genotype it is possible to use this pipeline to analyze all of them (more than two is accepted as well).\n",
+    "\n",
+    "Here using `--region-name` we focus the analysis on 3 genes. In practice if this parameter is dropped, the union of all regions in all phenotype region lists will be analyzed. It is possible for some of the regions there are no genotype data, in which case the pipeline will output RDS files with a warning message to indicate the lack of genotype data to analyze."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "**Note:** Suggested output naming convention is cohort_modality, eg ROSMAP_snRNA_pseudobulk."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "### **Step 1.** ColocBoost with xQTL only\n",
+    "\n",
+    "This is the minimal working example and **runs end-to-end on the toy gene-expression data**, producing a `*.cb_xqtl.rds` colocalization result for the region. The same command also accepts peaks or any other molecular phenotype - just swap in the phenotype `.bed.gz`/manifest."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "SoS",
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Toy gene-expression example (run from the toy_example root).\n",
+    "# xQTL-only ColocBoost: builds one QtlDataset, then per-gene ColocBoost into\n",
+    "# {name}.{gene}.colocboost.rds.\n",
+    "sos run pipeline/colocboost.ipynb colocboost \\\n",
+    "    --name colocboost_xqtl --cwd output/colocboost_xqtl \\\n",
+    "    --genoFile input/colocboost/example.chr22.bed \\\n",
+    "    --phenoFile input/colocboost/pheno_manifest_multicontext.tsv \\\n",
+    "    --covFile input/colocboost/example_covariates.tsv --transpose-covariates \\\n",
+    "    --customized-association-windows input/colocboost/association_windows.bed \\\n",
+    "    --region-name ENSG00000130538 \\\n",
+    "    --no-separate-gwas --xqtl-coloc -j1\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "### **Step 2.** ColocBoost with GWAS\n",
+    "\n",
+    "Adds GWAS summary statistics (via `--gwas-meta-data`) and an LD reference (via `--ld-meta-data`). The xQTL ColocBoost completes, but the separate GWAS-xQTL colocalization is **still under development** with the installed package (see the note in the command cell)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "# Toy example with GWAS (separate-focal coloc per study), from toy_example root.\n",
+    "# Per gene a region GwasSumStats is built, then xQTL + separate-GWAS coloc are\n",
+    "# written to one combined {name}.{gene}.colocboost.rds.\n",
+    "sos run pipeline/colocboost.ipynb colocboost \\\n",
+    "    --name colocboost_gwas --cwd output/colocboost_gwas \\\n",
+    "    --genoFile input/colocboost/example.chr22.bed \\\n",
+    "    --phenoFile input/colocboost/pheno_manifest_multicontext.tsv \\\n",
+    "    --covFile input/colocboost/example_covariates.tsv --transpose-covariates \\\n",
+    "    --customized-association-windows input/colocboost/association_windows.bed \\\n",
+    "    --gwas-meta-data input/colocboost/gwas_meta.txt \\\n",
+    "    --ld-meta-data input/ld_reference/protocol_example.ld_meta_file.tsv \\\n",
+    "    --region-name ENSG00000130538 \\\n",
+    "    --separate-gwas --xqtl-coloc -j1\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "### **Step 3.** ColocBoost with multiple LD references (multi-ancestry)\n",
+    "\n",
+    "Uses a per-study LD reference declared in the GWAS meta-data. It shares the separate-GWAS code path with Step 2 and is therefore **still under development** on the toy dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "# Multi-ancestry template: each study uses the per-study LD reference named in\n",
+    "# the gwas meta ld_meta_data column (gwas_sumstats_construct.R resolves a single\n",
+    "# shared panel per region, so genuinely distinct per-study panels are not merged\n",
+    "# here). Same separate-GWAS code path as the example above.\n",
+    "sos run pipeline/colocboost.ipynb colocboost \\\n",
+    "    --name colocboost_multi_ld --cwd output/colocboost_multi_ld \\\n",
+    "    --genoFile input/colocboost/example.chr22.bed \\\n",
+    "    --phenoFile input/colocboost/pheno_manifest_multicontext.tsv \\\n",
+    "    --covFile input/colocboost/example_covariates.tsv --transpose-covariates \\\n",
+    "    --customized-association-windows input/colocboost/association_windows.bed \\\n",
+    "    --gwas-meta-data input/colocboost/gwas_meta.txt \\\n",
+    "    --ld-meta-data input/ld_reference/protocol_example.ld_meta_file.tsv \\\n",
+    "    --region-name ENSG00000130538 \\\n",
+    "    --separate-gwas --xqtl-coloc -j1\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Command interface"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "sos run pipeline/colocboost.ipynb -h\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS",
+    "tags": []
+   },
+   "source": [
+    "```\n",
+    "usage: sos run /restricted/projectnb/xqtl/xqtl_protocol/xqtl-protocol/code/mnm_analysis/mnm_methods/colocboost.ipynb\n",
+    "               [workflow_name | -t targets] [options] [workflow_options]\n",
+    "  workflow_name:        Single or combined workflows defined in this script\n",
+    "  targets:              One or more targets to generate\n",
+    "  options:              Single-hyphen sos parameters (see \"sos run -h\" for details)\n",
+    "  workflow_options:     Double-hyphen workflow-specific parameters\n",
+    "\n",
+    "Workflows:\n",
+    "  get_analysis_regions\n",
+    "  get_rss_analysis_regions\n",
+    "  colocboost\n",
+    "\n",
+    "Global Workflow Options:\n",
+    "  --name VAL (as str, required)\n",
+    "                        It is required to input the name of the analysis\n",
+    "  --cwd output (as path)\n",
+    "  --genoFile VAL (as path, required)\n",
+    "                        A list of file paths for genotype data, or the genotype data itself.\n",
+    "  --phenoFile  paths\n",
+    "\n",
+    "                        One or multiple lists of file paths for phenotype data.\n",
+    "  --phenoIDFile  paths()\n",
+    "\n",
+    "                        One or multiple lists of file paths for phenotype ID mapping file. The first column should be the original ID, the 2nd column should be the ID to be mapped to.\n",
+    "  --covFile  paths\n",
+    "\n",
+    "                        Covariate file path\n",
+    "  --gwas-meta-data . (as path)\n",
+    "                        Summary statistics interface, see `rss_analysis.ipynb` for details\n",
+    "  --ld-meta-data . (as path)\n",
+    "  --gwas-name  (as list)\n",
+    "  --gwas-data  (as list)\n",
+    "  --column-mapping  (as list)\n",
+    "  --region-list . (as path)\n",
+    "                        Optional: if a region list is provide the analysis will be focused on provided region. The LAST column of this list will contain the ID of regions to focus on when\n",
+    "                        region_name is given Otherwise, all regions with both genotype and phenotype files will be analyzed\n",
+    "  --region-name  (as list)\n",
+    "                        Optional: if a region name is provided the analysis would be focused on the union of provides region list and region names\n",
+    "  --keep-samples . (as path)\n",
+    "                        Only focus on a subset of samples\n",
+    "  --keep-variants . (as path)\n",
+    "                        Only focus on a subset of variants\n",
+    "  --customized-association-windows . (as path)\n",
+    "                        An optional list documenting the custom association window for each region to analyze, with four column, chr, start, end, region ID (eg gene ID). If this list is not\n",
+    "                        provided, the default `window` parameter (see below) will be used.\n",
+    "  --cis-window -1 (as int)\n",
+    "                        Specify the cis window for the up and downstream radius to analyze around the region of interest in units of bp When this is set to negative, we will rely on using\n",
+    "                        customized_association_windows\n",
+    "  --[no-]save-data (default to False)\n",
+    "                        save data object or not\n",
+    "  --phenotype-names  [f'{x:bn}' for x in phenoFile]\n",
+    "\n",
+    "                        Name of phenotypes\n",
+    "  --[no-]trans-analysis (default to False)\n",
+    "                        And indicator whether it is trans-analysis, ie, not using phenotypic coordinate information\n",
+    "  --imiss 1.0 (as float)\n",
+    "                        remove a variant if it has more than imiss missing individual level data\n",
+    "  --maf 0.0025 (as float)\n",
+    "                        MAF and variance of X cutoff\n",
+    "  --xvar-cutoff 0.0 (as float)\n",
+    "  --mac 5 (as int)\n",
+    "                        MAC cutoff, on top of MAF cutoff\n",
+    "  --[no-]indel (default to True)\n",
+    "                        Remove indels if indel = False\n",
+    "  --event-filter-rules . (as path)\n",
+    "  --skip-analysis-pip-cutoff  (as list)\n",
+    "                        If this value is not 0, then an initial single effect analysis will be performed to determine if follow up analysis will be continued or to simply return NULL If this is\n",
+    "                        negative we use a default way to determine this cutoff which is conservative but still useful\n",
+    "  --skip-sumstats-analysis-pip-cutoff -1.0 (as float)\n",
+    "  --[no-]xqtl-coloc (default to True)\n",
+    "                        Perform xQTL colocalization\n",
+    "  --[no-]joint-gwas (default to False)\n",
+    "                        Perform joint colocalization with many traits\n",
+    "  --[no-]separate-gwas (default to True)\n",
+    "                        Perform separate GWAS targeted anlaysis one at a time\n",
+    "  --[no-]impute (default to False)\n",
+    "                        Whether to impute the sumstat for all the snp in LD but not in sumstat.\n",
+    "  --rcond 0.01 (as float)\n",
+    "  --lamb 0.01 (as float)\n",
+    "  --R2-threshold 0.6 (as float)\n",
+    "  --minimum-ld 5 (as int)\n",
+    "  --qc-method NULL\n",
+    "                        summary stats QC methods: slalom, dentist\n",
+    "  --extract-sumstats-region-name NULL\n",
+    "  --sumstats-region-name-col NULL\n",
+    "                        Either an index number in str format or \"NULL\" as a string\n",
+    "  --comment-string NULL\n",
+    "  --ld-data-name  (as list)\n",
+    "  --container ''\n",
+    "                        Analysis environment settings\n",
+    "  --job-size 200 (as int)\n",
+    "                        For cluster jobs, number commands to run per job\n",
+    "  --walltime 1h\n",
+    "                        Wall clock time expected\n",
+    "  --mem 20G\n",
+    "                        Memory expected\n",
+    "  --numThreads 1 (as int)\n",
+    "                        Number of threads\n",
+    "\n",
+    "Sections\n",
+    "  get_analysis_regions:\n",
+    "  get_rss_analysis_regions:\n",
+    "  colocboost:\n",
+    "  ```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Workflow implementation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "[global]\n",
+    "parameter: modular_script_dir = path('code/script')  # override with --modular-script-dir\n",
+    "# It is required to input the name of the analysis\n",
+    "parameter: name = str\n",
+    "parameter: cwd = path(\"output\")\n",
+    "# PLINK1 genotype: pass the .bed file (.bim/.fam are read automatically).\n",
+    "parameter: genoFile = path\n",
+    "# QtlDataset phenotype manifest TSV. Columns: ID, #chr, start, end, path,\n",
+    "# cond, and an optional cov_path (per-context covariates). The first path is\n",
+    "# used. Replaces the legacy list-of-phenotype-BEDs input.\n",
+    "parameter: phenoFile = paths\n",
+    "# Legacy phenotype ID-mapping file(s); accepted for CLI compatibility but not\n",
+    "# consumed by the manifest-based flow (the manifest carries the IDs).\n",
+    "parameter: phenoIDFile = paths()\n",
+    "# Uniform covariate file applied across all contexts (QTLtools-format\n",
+    "# rows=covariates supported via --transpose-covariates). Per-context\n",
+    "# covariates instead come from the manifest cov_path column.\n",
+    "parameter: covFile = paths()\n",
+    "# GWAS summary-statistics meta TSV. Columns: study_id, chrom, file_path;\n",
+    "# optional n_sample, n_case, n_control, ld_meta_data, column_mapping_file.\n",
+    "parameter: gwas_meta_data = path()\n",
+    "# Default LD-reference meta (#chr, start, end, path). Per-study overrides come\n",
+    "# from the gwas meta ld_meta_data column.\n",
+    "parameter: ld_meta_data = path()\n",
+    "# Optional region list whose LAST column lists region/gene IDs to analyze.\n",
+    "parameter: region_list = path()\n",
+    "# Region/gene IDs to analyze (space separated on the CLI).\n",
+    "parameter: region_name = []\n",
+    "# Optional per-gene association windows (BED: #chr start end ID) overriding the\n",
+    "# default cis_window for the GWAS extraction window.\n",
+    "parameter: customized_association_windows = path()\n",
+    "# cis window (bp) on each side of the gene for fine-mapping and the default\n",
+    "# GWAS extraction window.\n",
+    "parameter: cis_window = 1000000\n",
+    "# Save the per-gene colocboost input bundle (currently unused; reserved).\n",
+    "parameter: save_data = False\n",
+    "# QtlDataset study label and covariate handling (legacy CLI retained).\n",
+    "parameter: study = name\n",
+    "parameter: transpose_covariates = False\n",
+    "parameter: genotype_covariates = path('.')\n",
+    "# Remove a variant if it has more than imiss fraction missing genotypes.\n",
+    "parameter: imiss = 1.0\n",
+    "# MAF and variance-of-X cutoffs.\n",
+    "parameter: maf = 0.0025\n",
+    "parameter: xvar_cutoff = 0.0\n",
+    "# MAC cutoff, on top of the MAF cutoff.\n",
+    "parameter: mac = 5\n",
+    "# Keep indels when True.\n",
+    "parameter: indel = True\n",
+    "# Only keep a subset of samples / variants (whitespace/newline-separated IDs).\n",
+    "parameter: keep_samples = path()\n",
+    "parameter: keep_variants = path()\n",
+    "# Per-context single-effect skip cutoff passed to colocboost.R\n",
+    "# (--pip-cutoff-to-skip): a scalar applied to every context, or one or more\n",
+    "# context=value items (negative -> data-driven 3/ncol(X); empty -> 0).\n",
+    "parameter: skip_analysis_pip_cutoff = []\n",
+    "# Summary-stats single-effect skip cutoff passed to summaryStatsQc via\n",
+    "# gwas_sumstats_construct.R.\n",
+    "parameter: skip_sumstats_analysis_pip_cutoff = -1.0\n",
+    "# ColocBoost variants to run.\n",
+    "parameter: xqtl_coloc = True\n",
+    "parameter: joint_gwas = False\n",
+    "parameter: separate_gwas = True\n",
+    "# Summary-stats LD-mismatch QC method (none | slalom | dentist).\n",
+    "parameter: qc_method = \"none\"\n",
+    "# Analysis environment settings.\n",
+    "parameter: container = \"\"\n",
+    "import re\n",
+    "parameter: entrypoint= ('micromamba run -a \"\" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\\.sif)$', '', container.split('/')[-1])) if container else \"\"\n",
+    "# For cluster jobs, number of commands to run per job.\n",
+    "parameter: job_size = 200\n",
+    "parameter: walltime = \"1h\"\n",
+    "parameter: mem = \"20G\"\n",
+    "parameter: numThreads = 1\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "[colocboost_1]\n",
+    "# Build one pecotmr::QtlDataset from the phenotype manifest + shared genotype\n",
+    "# and serialize to RDS (mirrors mnm_regression's qtl_dataset_construct). No\n",
+    "# fan-out: the per-gene [colocboost_3] step loads this single RDS and selects\n",
+    "# the focal gene/contexts at analysis time. Per-context covariates come from\n",
+    "# the manifest cov_path column; a uniform --covFile (or --genotype-covariates)\n",
+    "# applies across all contexts. ColocBoost runs on unscaled residuals\n",
+    "# (--no-scale-residuals), matching the legacy pipeline.\n",
+    "cov_arg = covFile[0] if (len(covFile) > 0 and covFile[0].is_file()) else genotype_covariates\n",
+    "input: None\n",
+    "output: f\"{cwd:a}/qtl_dataset/{study}.qtl_dataset.rds\"\n",
+    "task: trunk_workers = 1, trunk_size = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\n",
+    "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container, entrypoint = entrypoint\n",
+    "    Rscript ${modular_script_dir}/pecotmr_integration/qtl_dataset_construct.R \\\n",
+    "        --study ${study} \\\n",
+    "        --genotype-prefix ${genoFile:n} \\\n",
+    "        --phenotype-manifest ${phenoFile[0]} \\\n",
+    "        --genotype-covariates ${cov_arg if cov_arg.is_file() else '\"\"'} \\\n",
+    "        ${'--transpose-covariates' if transpose_covariates else ''} \\\n",
+    "        --maf-cutoff ${maf} \\\n",
+    "        --xvar-cutoff ${xvar_cutoff} \\\n",
+    "        --mac-cutoff ${mac} \\\n",
+    "        --imiss-cutoff ${imiss} \\\n",
+    "        ${('--keep-samples ' + str(keep_samples)) if keep_samples.is_file() else ''} \\\n",
+    "        ${('--keep-variants ' + str(keep_variants)) if keep_variants.is_file() else ''} \\\n",
+    "        ${'' if indel else '--drop-indel'} \\\n",
+    "        --no-scale-residuals \\\n",
+    "        --output ${_output}\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "[colocboost_2]\n",
+    "# Resolve per-gene analysis units (gene -> ld_block + grouped GWAS sources)\n",
+    "# into one manifest TSV via colocboost_manifest.R, so the next step can fan out\n",
+    "# over its rows with inline csv.DictReader (no notebook-local Python parsing).\n",
+    "input: None\n",
+    "output: f\"{cwd:a}/colocboost/{name}.colocboost_manifest.tsv\"\n",
+    "task: trunk_workers = 1, trunk_size = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\n",
+    "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container, entrypoint = entrypoint\n",
+    "    Rscript ${modular_script_dir}/pecotmr_integration/colocboost_manifest.R \\\n",
+    "        --pheno-manifest ${phenoFile[0]} \\\n",
+    "        --cis-window ${cis_window} \\\n",
+    "        ${('--customized-association-windows ' + str(customized_association_windows)) if customized_association_windows.is_file() else ''} \\\n",
+    "        ${('--region-name ' + ','.join(region_name)) if len(region_name) > 0 else ''} \\\n",
+    "        ${('--region-list ' + str(region_list)) if region_list.is_file() else ''} \\\n",
+    "        ${('--gwas-meta ' + str(gwas_meta_data)) if gwas_meta_data.is_file() else ''} \\\n",
+    "        ${('--ld-meta ' + str(ld_meta_data)) if ld_meta_data.is_file() else ''} \\\n",
+    "        --output ${_output}\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## ColocBoost analysis"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "SoS",
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "[colocboost_3]\n",
+    "# Per-gene ColocBoost over the pre-built QtlDataset. Fans out over the manifest\n",
+    "# rows; per gene (1) builds a region GwasSumStats via gwas_sumstats_construct.R\n",
+    "# when the gene has GWAS studies, then (2) runs colocboost.R (xQTL /\n",
+    "# joint-GWAS / separate-GWAS variants) into ONE combined RDS. Replaces the\n",
+    "# legacy load_multitask_regional_data + colocboost_analysis_pipeline path.\n",
+    "import csv\n",
+    "manifest = f\"{cwd:a}/colocboost/{name}.colocboost_manifest.tsv\"\n",
+    "jobs = list(csv.DictReader(open(manifest), delimiter='\\t'))\n",
+    "stop_if(len(jobs) == 0, \"colocboost: empty manifest; check --region-name / phenotype manifest.\")\n",
+    "input: f\"{cwd:a}/qtl_dataset/{name}.qtl_dataset.rds\", for_each = \"jobs\"\n",
+    "output: f\"{cwd:a}/colocboost/{name}.{_jobs['region_id']}.colocboost.rds\"\n",
+    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\n",
+    "bash: expand = '${ }', stdout = f\"{_output:n}.stdout\", stderr = f\"{_output:n}.stderr\", container = container, entrypoint = entrypoint\n",
+    "    set -e\n",
+    "    gene=${_jobs['gene_id']}\n",
+    "    studies=\"${_jobs['studies']}\"\n",
+    "    gss_arg=\"\"\n",
+    "    if [ -n \"$studies\" ]; then\n",
+    "        gss=\"${_output:nn}.gwas_sumstats.rds\"\n",
+    "        Rscript ${modular_script_dir}/pecotmr_integration/gwas_sumstats_construct.R \\\n",
+    "            --study \"$studies\" \\\n",
+    "            --gwas-tsv \"${_jobs['gwas_tsvs']}\" \\\n",
+    "            --ld-block \"${_jobs['ld_block']}\" \\\n",
+    "            --ld-meta \"${_jobs['ld_meta']}\" \\\n",
+    "            --n-case \"${_jobs['n_cases']}\" \\\n",
+    "            --n-control \"${_jobs['n_controls']}\" \\\n",
+    "            --pip-cutoff-to-skip ${skip_sumstats_analysis_pip_cutoff} \\\n",
+    "            --qc-method ${qc_method} \\\n",
+    "            ${('--column-mapping \"' + _jobs['column_mappings'] + '\"') if _jobs['column_mappings'] else ''} \\\n",
+    "            --output \"$gss\"\n",
+    "        gss_arg=\"--gwas-sumstats $gss\"\n",
+    "    fi\n",
+    "    Rscript ${modular_script_dir}/pecotmr_integration/colocboost.R \\\n",
+    "        --qtl-dataset ${_input} \\\n",
+    "        --gene-id $gene \\\n",
+    "        --cis-window ${cis_window} \\\n",
+    "        $gss_arg \\\n",
+    "        ${'' if xqtl_coloc else '--no-xqtl-coloc'} \\\n",
+    "        ${'--joint-gwas' if joint_gwas else ''} \\\n",
+    "        ${'--separate-gwas' if separate_gwas else ''} \\\n",
+    "        --pip-cutoff-to-skip \"${\",\".join([str(x) for x in skip_analysis_pip_cutoff])}\" \\\n",
+    "        --output ${_output}\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Troubleshooting"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "| Step | Substep | Problem | Possible Reason | Solution |\n",
+    "|------|---------|---------|------------------|---------|\n",
+    "|  |  |  |  |  |\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Anticipated Results\n",
+    "\n",
+    "For the toy chr22 dataset, produces `protocol_example.colocboost.rds` with ColocBoost posteriors. Expect a convergence message and sparse result for the single-region toy example."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "SoS",
+   "language": "sos",
+   "name": "sos"
+  },
+  "language_info": {
+   "codemirror_mode": "sos",
+   "file_extension": ".sos",
+   "mimetype": "text/x-sos",
+   "name": "sos",
+   "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter",
+   "pygments_lexer": "sos"
+  },
+  "sos": {
+   "kernels": [
+    [
+     "Bash",
+     "calysto_bash",
+     "Bash",
+     "#E6EEFF",
+     "shell"
+    ],
+    [
+     "Markdown",
+     "markdown",
+     "markdown",
+     "",
+     ""
+    ],
+    [
+     "SoS",
+     "sos",
+     "",
+     "",
+     "sos"
+    ]
+   ],
+   "version": "0.24.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/code/script/pecotmr_integration/colocboost.R b/code/script/pecotmr_integration/colocboost.R
index ba40c6da..23c1a8da 100644
--- a/code/script/pecotmr_integration/colocboost.R
+++ b/code/script/pecotmr_integration/colocboost.R
@@ -30,6 +30,13 @@
 #   --no-xqtl-coloc  Flag: disable xQTL-only cross-context coloc
 #   --joint-gwas     Flag: run xQTL+GWAS joint coloc
 #   --separate-gwas  Flag: run per-GWAS-study separate-focal coloc
+#   --pip-cutoff-to-skip
+#                    Optional per-context single-effect skip cutoff passed to
+#                    colocboostPipeline(pipCutoffToSkip = ). Either a single
+#                    number applied to every context, or comma-separated
+#                    context=value pairs (e.g. "context1=0.1,context2=0").
+#                    A negative value selects the data-driven 3/ncol(X)
+#                    default. Empty (default) disables the skip (0).
 #   --output         Output RDS path (one colocboost-pipeline list)
 
 suppressPackageStartupMessages({
@@ -64,10 +71,27 @@ parser <- add_argument(parser, "--joint-gwas",
 parser <- add_argument(parser, "--separate-gwas",
                        help = "Run per-GWAS-study separate-focal coloc (requires --gwas-sumstats)",
                        flag = TRUE)
+parser <- add_argument(parser, "--pip-cutoff-to-skip",
+                       help = "Per-context single-effect skip cutoff: a scalar or comma-separated context=value pairs (negative = data-driven default; empty = 0)",
+                       type = "character", default = "")
 parser <- add_argument(parser, "--output",
                        help = "Output RDS path", type = "character")
 argv <- parse_args(parser)
 
+# Parse --pip-cutoff-to-skip into the scalar / context-named-vector shape
+# colocboostPipeline(pipCutoffToSkip = ) accepts. Mirrors susie_twas.R.
+parse_pip_cutoff <- function(s) {
+  if (is.null(s) || !nzchar(s)) return(0)
+  pairs <- strsplit(s, ",")[[1L]]
+  if (any(grepl("=", pairs, fixed = TRUE))) {
+    setNames(as.numeric(sub(".*=", "", pairs)),
+             gsub("'", "", sub("=.*", "", pairs)))
+  } else {
+    as.numeric(pairs[[1L]])
+  }
+}
+pip_cutoff_to_skip <- parse_pip_cutoff(argv$pip_cutoff_to_skip)
+
 parse_region <- function(s) {
   m <- regmatches(s, regexec("^([^:]+):([0-9]+)-([0-9]+)$", s))[[1L]]
   if (length(m) != 4L)
@@ -103,22 +127,24 @@ gss <- if (has_gwas) readRDS(gwas_path) else NULL
 res <- if (has_region) {
   colocboostPipeline(
     qd,
-    gwasSumStats = gss,
-    region       = parse_region(argv$region),
-    cisWindow    = argv$cis_window,
-    xqtlColoc    = xqtl_coloc,
-    jointGwas    = joint_gwas,
-    separateGwas = separate_gwas)
+    gwasSumStats    = gss,
+    region          = parse_region(argv$region),
+    cisWindow       = argv$cis_window,
+    xqtlColoc       = xqtl_coloc,
+    jointGwas       = joint_gwas,
+    separateGwas    = separate_gwas,
+    pipCutoffToSkip = pip_cutoff_to_skip)
 } else {
   colocboostPipeline(
     qd,
-    gwasSumStats = gss,
-    traitId      = argv$gene_id,
-    cisWindow    = argv$cis_window,
-    focalTrait   = argv$gene_id,
-    xqtlColoc    = xqtl_coloc,
-    jointGwas    = joint_gwas,
-    separateGwas = separate_gwas)
+    gwasSumStats    = gss,
+    traitId         = argv$gene_id,
+    cisWindow       = argv$cis_window,
+    focalTrait      = argv$gene_id,
+    xqtlColoc       = xqtl_coloc,
+    jointGwas       = joint_gwas,
+    separateGwas    = separate_gwas,
+    pipCutoffToSkip = pip_cutoff_to_skip)
 }
 
 dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE)
diff --git a/code/script/pecotmr_integration/colocboost_manifest.R b/code/script/pecotmr_integration/colocboost_manifest.R
new file mode 100644
index 00000000..d6dde2e0
--- /dev/null
+++ b/code/script/pecotmr_integration/colocboost_manifest.R
@@ -0,0 +1,249 @@
+#!/usr/bin/env Rscript
+# colocboost_manifest.R
+#
+# Resolve `colocboost_mnm.ipynb`'s per-gene analysis units into a single
+# manifest TSV, so the downstream [colocboost] step can fan out over its
+# rows via inline csv.DictReader without any notebook-local Python parsing.
+# One row per gene (= one colocboost focal-trait task).
+#
+# Gene coordinates come from the QtlDataset phenotype manifest (the same
+# file passed to qtl_dataset_construct.R), aggregated across its context
+# rows. The GWAS analysis window (`ld_block`) defaults to the gene body
+# +/- --cis-window; an optional --customized-association-windows BED
+# overrides it per gene. GWAS studies whose chromosome matches the gene's
+# (a meta `chrom` of 0 matches every chromosome) are grouped onto the
+# gene's row as comma-separated lists, ready for one multi-study
+# gwas_sumstats_construct.R call.
+#
+# Inputs:
+#   --pheno-manifest   QtlDataset phenotype manifest TSV. Columns:
+#                        ID            gene / trait identifier
+#                        #chr|chrom    chromosome
+#                        start, end    gene-body coordinates
+#                      (cond/path/cov_path are ignored here). Multiple
+#                      rows per gene (one per context) are aggregated:
+#                      start = min(start), end = max(end).
+#   --cis-window       bp added on each side of the gene body to form the
+#                      default ld_block. Default 1000000.
+#   --customized-association-windows
+#                      Optional BED (#chr start end ID). When a gene is
+#                      present, its row defines ld_block directly,
+#                      overriding --cis-window.
+#   --region-name      Optional comma-separated gene IDs to restrict to.
+#   --region-list      Optional file whose LAST column lists gene IDs to
+#                      restrict to (header / comment lines skipped).
+#   --gwas-meta        Optional colocboost GWAS meta TSV. Columns:
+#                        study_id, chrom, file_path; optional n_sample,
+#                        n_case, n_control, ld_meta_data,
+#                        column_mapping_file. Relative paths resolve
+#                        against the meta file's own directory.
+#   --ld-meta          Optional default LD-meta path used when a study's
+#                      ld_meta_data is absent/empty.
+#   --output           Output manifest TSV path.
+#
+# Output TSV columns (one row per gene):
+#   region_id, gene_id, chr, start, end, ld_block, studies, gwas_tsvs,
+#   column_mappings, n_cases, n_controls, ld_meta
+# The grouped GWAS columns are empty when no study covers the gene's
+# chromosome (xQTL-only coloc).
+
+suppressPackageStartupMessages({
+  library(argparser)
+})
+
+parser <- arg_parser("Resolve colocboost per-gene analysis units into a manifest TSV")
+parser <- add_argument(parser, "--pheno-manifest",
+                       help = "QtlDataset phenotype manifest TSV (ID, #chr, start, end)",
+                       type = "character")
+parser <- add_argument(parser, "--cis-window",
+                       help = "bp added on each side of the gene body for the default ld_block",
+                       type = "numeric", default = 1000000)
+parser <- add_argument(parser, "--customized-association-windows",
+                       help = "Optional BED (#chr start end ID) overriding ld_block per gene",
+                       type = "character", default = "")
+parser <- add_argument(parser, "--region-name",
+                       help = "Optional comma-separated gene IDs to restrict to",
+                       type = "character", default = "")
+parser <- add_argument(parser, "--region-list",
+                       help = "Optional file whose last column lists gene IDs to restrict to",
+                       type = "character", default = "")
+parser <- add_argument(parser, "--gwas-meta",
+                       help = "Optional colocboost GWAS meta TSV",
+                       type = "character", default = "")
+parser <- add_argument(parser, "--ld-meta",
+                       help = "Optional default LD-meta path",
+                       type = "character", default = "")
+parser <- add_argument(parser, "--output",
+                       help = "Output manifest TSV path", type = "character")
+argv <- parse_args(parser)
+
+# ----- Small helpers --------------------------------------------------------
+norm_chr <- function(x) {
+  x <- as.character(x)
+  ifelse(is.na(x) | !nzchar(x), x,
+         ifelse(startsWith(x, "chr"), x, paste0("chr", x)))
+}
+resolve_against <- function(p, dir) {
+  if (is.na(p) || !nzchar(p)) return("")
+  if (startsWith(p, "/")) return(p)
+  file.path(dir, p)
+}
+read_tsv <- function(path) {
+  read.table(path, header = TRUE, sep = "\t", stringsAsFactors = FALSE,
+             check.names = FALSE, comment.char = "")
+}
+
+# ----- Gene coordinates from the phenotype manifest -------------------------
+if (is.null(argv$pheno_manifest) || !nzchar(argv$pheno_manifest) ||
+    !file.exists(argv$pheno_manifest))
+  stop("--pheno-manifest is required and must exist: ", argv$pheno_manifest)
+pm <- read_tsv(argv$pheno_manifest)
+id_col    <- intersect(c("ID", "gene_id", "phenotype_id"), names(pm))[1L]
+chr_col   <- intersect(c("#chr", "chrom", "chr"),          names(pm))[1L]
+start_col <- intersect(c("start", "Start"),                names(pm))[1L]
+end_col   <- intersect(c("end", "End"),                    names(pm))[1L]
+if (any(is.na(c(id_col, chr_col, start_col, end_col))))
+  stop("--pheno-manifest needs ID, #chr/chrom, start, end columns (got: ",
+       paste(names(pm), collapse = ", "), ").")
+
+genes  <- as.character(pm[[id_col]])
+gchr   <- norm_chr(pm[[chr_col]])
+gstart <- suppressWarnings(as.integer(pm[[start_col]]))
+gend   <- suppressWarnings(as.integer(pm[[end_col]]))
+uniqGenes <- unique(genes)
+coords <- lapply(uniqGenes, function(g) {
+  idx <- which(genes == g)
+  list(chr = gchr[idx][[1L]],
+       start = min(gstart[idx], na.rm = TRUE),
+       end   = max(gend[idx],   na.rm = TRUE))
+})
+names(coords) <- uniqGenes
+
+# ----- Optional customized association windows ------------------------------
+custom <- list()
+if (nzchar(argv$customized_association_windows) &&
+    argv$customized_association_windows != "." &&
+    file.exists(argv$customized_association_windows)) {
+  caw <- read.table(argv$customized_association_windows, header = FALSE,
+                    sep = "", stringsAsFactors = FALSE, comment.char = "#")
+  # columns: chr start end ID
+  for (i in seq_len(nrow(caw))) {
+    g <- as.character(caw[[4L]][[i]])
+    custom[[g]] <- list(chr = norm_chr(caw[[1L]][[i]]),
+                        start = as.integer(caw[[2L]][[i]]),
+                        end   = as.integer(caw[[3L]][[i]]))
+  }
+}
+
+# ----- Resolve the gene list ------------------------------------------------
+selected <- character(0)
+rn <- trimws(strsplit(argv$region_name, ",")[[1L]])
+rn <- rn[nzchar(rn)]
+if (length(rn) > 0L) {
+  selected <- rn
+} else if (nzchar(argv$region_list) && argv$region_list != "." &&
+           file.exists(argv$region_list)) {
+  for (line in readLines(argv$region_list)) {
+    line <- trimws(line)
+    if (!nzchar(line) || startsWith(line, "#")) next
+    parts <- strsplit(line, "\\s+")[[1L]]
+    selected <- c(selected, parts[[length(parts)]])
+  }
+  selected <- unique(selected)
+} else {
+  selected <- uniqGenes
+}
+
+# Coordinate lookup: customized window wins, else gene body +/- cis-window.
+cis <- as.integer(argv$cis_window)
+gene_window <- function(g) {
+  if (!is.null(custom[[g]])) {
+    w <- custom[[g]]
+    return(list(chr = w$chr, gstart = NA_integer_, gend = NA_integer_,
+                wstart = max(w$start, 0L), wend = w$end))
+  }
+  if (!is.null(coords[[g]])) {
+    c0 <- coords[[g]]
+    return(list(chr = c0$chr, gstart = c0$start, gend = c0$end,
+                wstart = max(c0$start - cis, 0L), wend = c0$end + cis))
+  }
+  stop("Gene '", g, "' not found in --pheno-manifest or ",
+       "--customized-association-windows; cannot determine its window.")
+}
+
+# ----- GWAS studies from the meta -------------------------------------------
+gwasRows <- list()
+if (nzchar(argv$gwas_meta) && argv$gwas_meta != "." &&
+    file.exists(argv$gwas_meta)) {
+  gm <- read_tsv(argv$gwas_meta)
+  req <- c("study_id", "chrom", "file_path")
+  miss <- setdiff(req, names(gm))
+  if (length(miss) > 0L)
+    stop("--gwas-meta missing required column(s): ",
+         paste(miss, collapse = ", "), " (got: ",
+         paste(names(gm), collapse = ", "), ").")
+  metaDir <- dirname(normalizePath(argv$gwas_meta))
+  getcol <- function(nm) if (nm %in% names(gm)) as.character(gm[[nm]]) else NULL
+  cmCol  <- getcol("column_mapping_file")
+  ncCol  <- getcol("n_case")
+  nnCol  <- getcol("n_control")
+  ldCol  <- getcol("ld_meta_data")
+  for (i in seq_len(nrow(gm))) {
+    gwasRows[[length(gwasRows) + 1L]] <- list(
+      study   = as.character(gm$study_id[[i]]),
+      chrom   = as.character(gm$chrom[[i]]),  # "0" = all chromosomes
+      tsv     = resolve_against(as.character(gm$file_path[[i]]), metaDir),
+      cmap    = if (!is.null(cmCol)) resolve_against(cmCol[[i]], metaDir) else "",
+      ncase   = if (!is.null(ncCol)) ncCol[[i]] else "",
+      ncontrol= if (!is.null(nnCol)) nnCol[[i]] else "",
+      ldmeta  = if (!is.null(ldCol)) resolve_against(ldCol[[i]], metaDir) else "")
+  }
+}
+default_ld <- if (nzchar(argv$ld_meta) && argv$ld_meta != ".") argv$ld_meta else ""
+
+studies_for_chr <- function(chr) {
+  Filter(function(r) r$chrom == "0" || norm_chr(r$chrom) == chr, gwasRows)
+}
+
+# ----- Build the per-gene manifest ------------------------------------------
+rows <- list()
+for (g in selected) {
+  w <- gene_window(g)
+  hits <- studies_for_chr(w$chr)
+  joinf <- function(field) paste(vapply(hits, function(r) {
+    v <- r[[field]]; if (is.null(v) || is.na(v)) "" else as.character(v)
+  }, character(1)), collapse = ",")
+  # One LD-meta per region (gwas_sumstats_construct.R takes a single panel):
+  # first non-empty study ld_meta, else the --ld-meta default.
+  ldset <- unique(Filter(nzchar, vapply(hits, function(r) r$ldmeta, character(1))))
+  ldmeta <- if (length(ldset) >= 1L) ldset[[1L]] else default_ld
+  if (length(ldset) > 1L)
+    message("NOTE: gene '", g, "' has ", length(ldset),
+            " distinct study LD-meta files; using the first (",
+            ldmeta, ") for the shared region panel.")
+  rows[[length(rows) + 1L]] <- data.frame(
+    region_id       = gsub("[^A-Za-z0-9_]", "_", g),
+    gene_id         = g,
+    chr             = w$chr,
+    start           = if (is.na(w$gstart)) w$wstart else w$gstart,
+    end             = if (is.na(w$gend))   w$wend   else w$gend,
+    ld_block        = sprintf("%s:%d-%d", w$chr, w$wstart, w$wend),
+    studies         = joinf("study"),
+    gwas_tsvs       = joinf("tsv"),
+    column_mappings = joinf("cmap"),
+    n_cases         = joinf("ncase"),
+    n_controls      = joinf("ncontrol"),
+    ld_meta         = if (length(hits) > 0L) ldmeta else "",
+    stringsAsFactors = FALSE)
+}
+if (length(rows) == 0L)
+  stop("No genes selected; check --region-name / --region-list against the ",
+       "phenotype manifest.")
+out <- do.call(rbind, rows)
+
+dir.create(dirname(argv$output), showWarnings = FALSE, recursive = TRUE)
+write.table(out, file = argv$output, sep = "\t", quote = FALSE,
+            row.names = FALSE, na = "")
+nWithGwas <- sum(nzchar(out$studies))
+cat(sprintf("Wrote colocboost manifest with %d gene(s) (%d with GWAS) to %s\n",
+            nrow(out), nWithGwas, argv$output))
diff --git a/code/script/pecotmr_integration/gwas_sumstats_construct.R b/code/script/pecotmr_integration/gwas_sumstats_construct.R
index 360df60a..077057d1 100644
--- a/code/script/pecotmr_integration/gwas_sumstats_construct.R
+++ b/code/script/pecotmr_integration/gwas_sumstats_construct.R
@@ -78,6 +78,15 @@ parser <- add_argument(parser, "--genome",
 parser <- add_argument(parser, "--column-mapping",
                        help = "Optional YAML mapping file(s), comma-separated (none / one-for-all / one-per-study)",
                        type = "character", default = "")
+parser <- add_argument(parser, "--n-case",
+                       help = "Optional per-study case counts, comma-separated (one per study; NA for quantitative). Stored on GwasSumStats for case/control effective-N downstream.",
+                       type = "character", default = "")
+parser <- add_argument(parser, "--n-control",
+                       help = "Optional per-study control counts, comma-separated (one per study; NA for quantitative).",
+                       type = "character", default = "")
+parser <- add_argument(parser, "--pip-cutoff-to-skip",
+                       help = "Skip a study whose single-trait max PIP is below this cutoff (summaryStatsQc pipCutoffToSkip); 0 disables, <0 uses 3/n_variants.",
+                       type = "numeric", default = 0)
 parser <- add_argument(parser, "--qc-method",
                        help = "LD-mismatch QC: 'none' (default), 'slalom', or 'dentist'",
                        type = "character", default = "none")
@@ -124,6 +133,22 @@ mappings <- if (length(mappings) == 0L) {
        " for ", length(studies), " studies).")
 }
 
+# ----- Optional per-study case/control counts -------------------------------
+# Comma-separated, one per study; "NA"/"" entries allowed for quantitative
+# studies in a mixed collection. NULL (unset) leaves the columns off entirely.
+parseCounts <- function(s, nm) {
+  v <- splitCsv(s)
+  if (length(v) == 0L) return(NULL)
+  num <- suppressWarnings(as.numeric(v))
+  if (length(num) == 1L) num <- rep(num, length(studies))
+  if (length(num) != length(studies))
+    stop("--", nm, " must supply one value per study (got ", length(num),
+         " for ", length(studies), " studies).")
+  num
+}
+nCase    <- parseCounts(argv$n_case, "n-case")
+nControl <- parseCounts(argv$n_control, "n-control")
+
 # ----- Parse --qc-args JSON -------------------------------------------------
 qc_extra <- if (nzchar(argv$qc_args) && argv$qc_args != "." &&
                 argv$qc_args != "{}") {
@@ -299,14 +324,17 @@ gss <- GwasSumStats(
   study    = studies,
   entry    = entries,
   genome   = argv$genome,
-  ldSketch = ld_handle)
+  ldSketch = ld_handle,
+  nCase    = nCase,
+  nControl = nControl)
 gss_out <- if (argv$skip_qc) {
   message("--skip-qc set; serialising raw GwasSumStats without summaryStatsQc().")
   gss
 } else {
   qc_call_args <- c(list(gss,
-                          zMismatchQc = qc_method,
-                          impute      = isTRUE(argv$impute)),
+                          zMismatchQc     = qc_method,
+                          impute          = isTRUE(argv$impute),
+                          pipCutoffToSkip = argv$pip_cutoff_to_skip),
                     qc_extra)
   do.call(summaryStatsQc, qc_call_args)
 }
diff --git a/code/script/pecotmr_integration/qtl_dataset_construct.R b/code/script/pecotmr_integration/qtl_dataset_construct.R
index 606b024d..cc387386 100644
--- a/code/script/pecotmr_integration/qtl_dataset_construct.R
+++ b/code/script/pecotmr_integration/qtl_dataset_construct.R
@@ -31,6 +31,12 @@
 #                          columns are samples.
 #   --maf-cutoff           Pass-through MAF cutoff for QtlDataset().
 #   --xvar-cutoff          Pass-through variance cutoff for QtlDataset().
+#   --mac-cutoff           Pass-through minor-allele-count cutoff.
+#   --imiss-cutoff         Pass-through per-variant missingness cutoff.
+#   --keep-samples         Optional file of sample IDs to keep.
+#   --keep-variants        Optional file of variant IDs to keep.
+#   --drop-indel           Drop indels (QtlDataset keepIndel = FALSE).
+#   --no-scale-residuals   Disable residual scaling (scaleResiduals = FALSE).
 #   --output               Output RDS path.
 
 suppressPackageStartupMessages({
@@ -76,6 +82,9 @@ parser <- add_argument(parser, "--keep-samples",
 parser <- add_argument(parser, "--keep-variants",
                        help = "Path to a whitespace-delimited file of variant IDs to restrict to (QtlDataset keepVariants)",
                        type = "character", default = "")
+parser <- add_argument(parser, "--no-scale-residuals",
+                       help = "Do not scale residuals (QtlDataset scaleResiduals = FALSE; default scales)",
+                       flag = TRUE)
 parser <- add_argument(parser, "--output",
                        help = "Output RDS path", type = "character")
 argv <- parse_args(parser)
@@ -222,7 +231,8 @@ qd_args <- list(
   macCutoff          = argv$mac_cutoff,
   xvarCutoff         = argv$xvar_cutoff,
   imissCutoff        = argv$imiss_cutoff,
-  keepIndel          = !isTRUE(argv$drop_indel))
+  keepIndel          = !isTRUE(argv$drop_indel),
+  scaleResiduals     = !isTRUE(argv$no_scale_residuals))
 # keepSamples / keepVariants only when a file was given (else the constructor
 # default = keep all).
 if (length(keep_samples)  > 0L) qd_args$keepSamples  <- keep_samples

From 7ccbca180affc2cc32593b9ae3b216f3f37300b7 Mon Sep 17 00:00:00 2001
From: Daniel Nachun <daniel.nachun@gmail.com>
Date: Wed, 24 Jun 2026 23:33:53 -0700
Subject: [PATCH 2/3] remove irrelevant colocboost.ipynb

---
 code/SoS/pecotmr_integration/colocboost.ipynb | 100 ------------------
 1 file changed, 100 deletions(-)
 delete mode 100644 code/SoS/pecotmr_integration/colocboost.ipynb

diff --git a/code/SoS/pecotmr_integration/colocboost.ipynb b/code/SoS/pecotmr_integration/colocboost.ipynb
deleted file mode 100644
index c66822cf..00000000
--- a/code/SoS/pecotmr_integration/colocboost.ipynb
+++ /dev/null
@@ -1,100 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "# Per-gene / per-region colocboost across contexts (and optionally GWAS) in a single study\n\n## Description\n\nFor a single study's `QtlDataset` RDS, run `pecotmr::colocboostPipeline()` per fan-out unit (gene or region). Three pipeline variants can be enabled independently:\n\n- `--xqtl-coloc` (default ON; opt out with `--no-xqtl-coloc=true` or `xqtl_coloc=False` in SoS): within-study cross-context QTL coloc.\n- `--joint-gwas` (default OFF): xQTL + GWAS joint coloc with the focal outcome from the GWAS. Requires `--gwas-sumstats`.\n- `--separate-gwas` (default OFF): one focal-GWAS coloc per GWAS study. Requires `--gwas-sumstats`.\n\nWhen `--gwas-sumstats` is supplied it must point at a QC'd `GwasSumStats` RDS (output of `summaryStatsQc()` upstream).\n\n## Inputs\n\n- `--qtl-dataset` &mdash; QtlDataset RDS produced by `qtl_dataset.ipynb`.\n- `--genes ID1 ID2 \u2026` **or** `--regions chr:start-end \u2026` (mutually exclusive).\n- `--cis-window` &mdash; bp window (gene mode default 1e6).\n- `--gwas-sumstats` &mdash; QC'd GwasSumStats RDS path (required for joint/separate variants).\n- `--xqtl-coloc` / `--joint-gwas` / `--separate-gwas` &mdash; pipeline-variant toggles.\n- `--modular-script-dir` &mdash; directory containing the per-gene worker R scripts. Default `code/script`.\n\n## Output\n\n- `{cwd}/{study}.{gene|region}.colocboost.rds` per fan-out unit. The RDS holds the colocboostPipeline list with slots `xqtl_coloc`, `joint_gwas`, `separate_gwas`, `computing_time`; unused-variant slots are NULL.\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Example\n\n```bash\n# xQTL-only (default)\nsos run pipeline/colocboost.ipynb colocboost \\\n    --cwd output --modular-script-dir /path/to/code/script \\\n    --study TEST_STUDY \\\n    --qtl-dataset output/TEST_STUDY.qtl_dataset.rds \\\n    --genes ENSG00000060237\n\n# xQTL + joint-GWAS\nsos run pipeline/colocboost.ipynb colocboost \\\n    --cwd output --modular-script-dir /path/to/code/script \\\n    --study TEST_STUDY \\\n    --qtl-dataset output/TEST_STUDY.qtl_dataset.rds \\\n    --gwas-sumstats output/GWAS.qcd.rds \\\n    --joint-gwas True \\\n    --genes ENSG00000060237\n```\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": [
-    "[global]\nparameter: cwd = path('output')\nparameter: study = str\nparameter: qtl_dataset = path\nparameter: modular_script_dir = path('code/script')\n# Mutually exclusive fan-out sources.\nparameter: genes = []\nparameter: regions = []\nparameter: cis_window = 1000000\n# GWAS-side input: a QC'd GwasSumStats RDS (build it upstream).\nparameter: gwas_sumstats = path('.')\n# Pipeline variants.\nparameter: xqtl_coloc = True       # within-study cross-context (default ON)\nparameter: joint_gwas = False      # xQTL + GWAS joint (requires gwas_sumstats)\nparameter: separate_gwas = False   # per-GWAS-study separate-focal (requires gwas_sumstats)\nparameter: container = ''\nparameter: job_size = 1\nparameter: walltime = '30m'\nparameter: mem = '8G'\nparameter: numThreads = 1\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": [
-    "[colocboost]\n",
-    "if bool(genes) == bool(regions):\n",
-    "    raise ValueError(\"Specify exactly one of --genes (trait IDs) or --regions (chr:start-end strings).\")\n",
-    "if (joint_gwas or separate_gwas) and not gwas_sumstats.is_file():\n",
-    "    raise ValueError(\"--joint-gwas / --separate-gwas require --gwas-sumstats pointing at a QC'd GwasSumStats RDS.\")\n",
-    "if not (xqtl_coloc or joint_gwas or separate_gwas):\n",
-    "    raise ValueError(\"Enable at least one of xqtl_coloc, joint_gwas, separate_gwas.\")\n",
-    "fanout_items = genes if genes else regions\n",
-    "fanout_kind  = 'gene' if genes else 'region'\n",
-    "input: qtl_dataset, for_each = 'fanout_items'\n",
-    "output: f\"{cwd}/{study}.{_fanout_items.replace(':', '_').replace('-', '_')}.colocboost.rds\"\n",
-    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f\"{step_name}_{_output:bn}\"\n",
-    "bash: expand = '${ }', stderr = f\"{_output}.stderr\", stdout = f\"{_output}.stdout\", container = container\n",
-    "    Rscript ${modular_script_dir}/pecotmr_integration/colocboost.R \\\n",
-    "        --qtl-dataset ${_input} \\\n",
-    "        ${('--gene-id ' + _fanout_items + ' --cis-window ' + str(cis_window)) if fanout_kind == 'gene' else ('--region ' + _fanout_items)} \\\n",
-    "        ${'--gwas-sumstats ' + str(gwas_sumstats) if gwas_sumstats.is_file() else ''} \\\n",
-    "        ${'--no-xqtl-coloc' if not xqtl_coloc else ''} \\\n",
-    "        ${'--joint-gwas' if joint_gwas else ''} \\\n",
-    "        ${'--separate-gwas' if separate_gwas else ''} \\\n",
-    "        --output ${_output}\n"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "SoS",
-   "language": "sos",
-   "name": "sos"
-  },
-  "language_info": {
-   "codemirror_mode": "sos",
-   "file_extension": ".sos",
-   "mimetype": "text/x-sos",
-   "name": "sos",
-   "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter",
-   "pygments_lexer": "sos"
-  },
-  "sos": {
-   "kernels": [
-    [
-     "Bash",
-     "bash",
-     "Bash",
-     "#E6EEFF",
-     "shell"
-    ],
-    [
-     "SoS",
-     "sos",
-     "",
-     "",
-     "sos"
-    ]
-   ],
-   "version": "0.24.3"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
\ No newline at end of file

From c0ba7bba781c600ab2de2c18c6f8909b9d3a2426 Mon Sep 17 00:00:00 2001
From: Daniel Nachun <daniel.nachun@gmail.com>
Date: Wed, 24 Jun 2026 23:34:42 -0700
Subject: [PATCH 3/3] fix notebooks links

---
 pipeline/colocboost.ipynb     | 1 -
 pipeline/colocboost_mnm.ipynb | 1 +
 2 files changed, 1 insertion(+), 1 deletion(-)
 delete mode 120000 pipeline/colocboost.ipynb
 create mode 120000 pipeline/colocboost_mnm.ipynb

diff --git a/pipeline/colocboost.ipynb b/pipeline/colocboost.ipynb
deleted file mode 120000
index 739d2890..00000000
--- a/pipeline/colocboost.ipynb
+++ /dev/null
@@ -1 +0,0 @@
-../code/SoS/pecotmr_integration/colocboost.ipynb
\ No newline at end of file
diff --git a/pipeline/colocboost_mnm.ipynb b/pipeline/colocboost_mnm.ipynb
new file mode 120000
index 00000000..1b3d4288
--- /dev/null
+++ b/pipeline/colocboost_mnm.ipynb
@@ -0,0 +1 @@
+../code/SoS/mnm_analysis/mnm_methods/colocboost_mnm.ipynb
\ No newline at end of file