Skip to content

Pairwise Qₓ statistic computed using a high-performance implementation of the two-population simplification of Berg & Coop (2014), with effect-size permutation null for bias correction following Sohail et al. (2019).

License

Notifications You must be signed in to change notification settings

pasq-cat/FastQx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPU-Accelerated QX Test for Population Genetics

Python R CUDA 13 License


Fast Pairwise Qₓ Statistic ( GPU)

Fast implementation of the pairwise Qₓ polygenic selection statistic introduced by Berg & Coop (2014) for detecting directional polygenic adaptation between population pairs. Achieves 13-31x speedup over CPU baseline using NVIDIA GPU acceleration with CuPy.

Method

FastQx implements the pairwise Qₓ statistic (Berg & Coop, 2014, PLoS Genetics) for detecting signatures of directional polygenic adaptation between two populations, using the Hudson (1992) F_ST estimator reported by Bhatia et al. (2013).

$$ \boxed{Q_x = \frac{\left( \sum_i \beta_i (p_{1i} - p_{2i}) \right)^2}{\sum_i \beta_i^2 \cdot 2 p_{1i} p_{2i} \cdot \hat{F}_{\text{ST},i}}} $$

where:

  • Numerator: squared difference in polygenic scores

  • Denominator: expected variance under neutral drift

  • $$\hat{F}_{\text{ST},i}$$ = Hudson (1992) estimator as per Bhatia et al. (2013, Genome Research, Eq. S7):

    $$\hat{F}{\text{ST},i} = \frac{(p{1i} - p_{2i})^2 - p1*(1-p1)/n1 - p2*(1-p2)/n2 }{p_{1i}(1-p_{2i}) + p_{2i}(1-p_{1i})}$$

v1.1.1

  • fixed support for neutrals when n1 and n2 sample sizes are used

v1.1

  • added support to the correction terms in the hudson's fst as in eq. 10 of bathia (2103)

v1.0

  • Optional neutral SNP panel → global neutral F_ST replaces per-locus F_ST
  • Eliminates ascertainment bias from GWAS SNPs
  • closer to the full multivariate Qₓ even for closely related populations
  • batched version Still 13–31× faster than CPU baseline (CUDA) depending on the dataset

v0.9 behavior preserved (when no neutral SNPs provided):

  • Falls back to per-locus Hudson F_ST from GWAS SNPs (fast screening mode)

Empirical p-values via effect-size permutation null (sample(beta)) — robust to GWAS stratification bias (Sohail et al., 2019, eLife).

📊 Performance Summary

original 300 tests dataset with each having around 500 snps

Method Time (300 tests, 10k perms) Speedup (vs cpu)
CPU Baseline 105 seconds 1x
version 1 (Sequential GPU) ~11 seconds ~10x
version 2 (Batch GPU) ~8 sec ~13x

out of memory error tests between sequential and batched gpu versions

| Method | Time (303 tests ( same 300 + 3 problematic tests), 10k perms) | Speedup | Batch size for a 16 gb 5060 ti | |--------|------------------------------|---------| | version 1 (Sequential GPU) | 210 seconds | 1x | ---------| | version 2 (Batch GPU) | ~77 seconds | ~3.7x | 7|

Stress test dataset: 100 tests with 2000-5000 SNPs each, 10,000 permutations

Method Time (100 tests, 10k perms) Speedup Batch size for a 16 gb 5060 ti
version 1 (Sequential GPU) ~603.85 seconds 1x with out of memory errors
version 2 (Batch GPU) ~19 sec ~32x 7

✨ Key Features

Two Optimized GPU Implementations

version 1: Sequential GPU (Adaptive)Recommended for small datasets and debugging cases

  • Processes tests one-by-one on GPU
  • Each test gets perfectly-sized permutation matrix (n_snps × n_perm)
  • Adaptive OOM handling: Automatically excludes tests too large for GPU
  • Best for: small datasets, debugging phase.

version 2: Batch GPU 🚀 Fastest for uniform datasets

  • Processes tests in batches (configurable batch size)
  • Each test gets its own matrix (no memory waste!)
  • True parallel compute: Queues all operations, single sync per batch
  • faster option
  • Best for: Large uniform datasets, sufficient GPU memory (16GB+)

Key Innovations

Zero Memory Waste: Per-test matrix generation (not max-size matrices)
Pure GPU Computation: Permutation matrices generated directly on GPU
Adaptive OOM Recovery: Automatic exclusion of problematic tests
Numerical Accuracy: < 1e-5 difference from CPU baseline


🚀 Quick Start

Installation

# Install R packages
install.packages("reticulate")

# Install CuPy (GPU computing library)
reticulate::py_install("cupy-cuda12x")  # For CUDA 12.x
# OR
reticulate::py_install("cupy-cuda11x")  # For CUDA 11.x

Basic Usage

# Load library and import CuPy
library(reticulate)
cp <- import("cupy")

# Run Sequential GPU (V10) - Memory-efficient
result <- compute_Qx_sequential_gpu_pure_adaptive(
  maf1 = population1_allele_freqs,
  maf2 = population2_allele_freqs,
  beta = gwas_effect_sizes,
  n_perm = 10000,
  seed = 123
)

# Run Batch GPU (V11) - Maximum speed
results <- compute_Qx_batch_gpu_adaptive(
  maf1_list = list_of_maf1_vectors,
  maf2_list = list_of_maf2_vectors,
  beta_list = list_of_beta_vectors,
  n_perm = 10000,
  seed = 123,
  batch_size = 8  # Tune based on GPU memory
)

📈 Batch Size Optimization

Critical parameter: batch_size must fit in GPU memory to avoid OOM thrashing.

Memory Estimation Formula

GPU Memory per batch ≈ batch_size × avg_n_snps × n_perm × 4 bytes × 7 (overhead)

Recommended Batch Sizes (16GB GPU, 10k permutations)

Avg SNPs per Test Recommended Batch Size Memory Usage
400-600 60 ~1.5 GB
1000-2000 20-30 ~8-12 GB
2000-5000 7-8 ~12-14 GB
5000-10000 3-5 ~10-15 GB

⚠️ Warning: Using too large a batch size causes severe performance degradation due to OOM retry loops!

Performance Cliff Example

Stress test (300 tests, 2000-5000 SNPs, 16GB GPU):

Batch Size Time Result
7-8 17 sec ✅ Optimal
9+ 60+ min ❌ OOM thrashing

🎯 When to Use Each Version

Scenario Recommended Version Why
small dataset with Mixed test sizes (100-5000 SNPs) version 1 Handles variety well, adaptive OOM
Uniform test sizes + large dataset version 2 Maximum speed via batching
Speed critical + GPU with large enough memory version 2 Fastest option (tune batch size!)
Testing/debugging version 1 Simpler, easier to debug

🧹 GPU Memory Management

CuPy caches GPU memory for performance. Use gpu_cleanup() between runs:

gpu_cleanup <- function() {
  cp$cuda$Stream$null$synchronize()
  mempool <- cp$get_default_memory_pool()
  mempool$free_all_free()
  pinned_mempool <- cp$get_default_pinned_memory_pool()
  pinned_mempool$free_all_blocks()
  gc();
}

# Use between function calls
results1 <- compute_Qx_batch_gpu_adaptive(...)
gpu_cleanup()
results2 <- compute_Qx_batch_gpu_adaptive(...)

🔍 Verification

All GPU implementations produce numerically identical results to the CPU baseline:

  • Qx (observed statistic): < 1e-5 difference (floating-point precision)
  • Fst: Exact match
  • p-values: extremely close (uses different random seeds for permutations, expected)

📁 Repository Structure

.
├── gpu_qx_demo.ipynb          # Main demonstration notebook
├── README.md                  # This file
├── gpu_qx_functions.R                  #  R script with the functions

⚠️ Important Notes

NA Handling

Following the original implementation:

  1. Pre-filter entire SNPs (rows) with NA using na.omit()
  2. Ensures observed and permuted calculations use identical SNP sets
  3. Safety: na.rm = TRUE in sums as failsafe

Random Seeds

  • CPU and GPU use different random number generators
  • Qx (observed) values must match across methods
  • p-values will slightly differ (different permutation samples)
  • This is expected behavior, not a bug!

Demo Data

The notebook uses simulated data to demonstrate the GPU optimization technique. Real genomic data is subject to data use agreements and is not included in this repository.


Limitations of the two-population approximation

The implementations in FastQx use a pairwise (two-population) simplification of the full multivariate Qₓ statistic (Berg & Coop, 2014). While mathematically valid and widely used for exploratory scans, this approximation has known limitations:

Limitation Explanation Practical consequence
Ignores hierarchical population structure The full Qₓ uses a complete covariance matrix F estimated from neutral SNPs across all populations. The pairwise version implicitly assumes a star phylogeny (no shared drift between the two populations after their split). Overestimates drift variance (and thus underestimates Qₓ) when the two populations are closely related (e.g., French vs. British, Tuscans vs. Spaniards). Signals can be diluted or lost.
Less powerful than full multivariate Qₓ for >2 populations With many populations, the full Qₓ (or PCA-based versions) extracts more signal by jointly modeling all covariances. Pairwise scans require multiple-testing correction across ~1,200 pairs (Bonferroni ≈ 4×10⁻⁵), reducing sensitivity compared to a single multivariate test.
Permutation null assumes exchangeable effect sizes Shuffling betas destroys true signal but assumes the distribution of ascertained effect sizes is representative — a standard and accepted assumption (Sohail et al., 2019). Remains robust to the main GWAS stratification biases that plagued early Qₓ studies.

When the approximation works well

  • Distant population pairs (e.g., Yoruba vs. French, Han vs. Papuan) → shared drift is negligible → pairwise ≈ full Qₓ
  • Large allele frequency differences (strong clines: height, pigmentation, lactose persistence)
  • Exploratory genome-wide or trait-wide screens where speed is critical

Bottom line
FastQx is designed as a high-throughput screening tool. Significant pairwise hits should ideally be confirmed with the full multivariate Qₓ (e.g., using the original Berg & Coop code, Colbran’s Julia implementation, or PCA-based tests) before claiming strong evidence of polygenic adaptation.

License

MIT License

Citation

If you use this tool in published research, please cite:

Berg JJ, Coop G (2014). A population genetic signal of polygenic adaptation. PLoS Genetics 10(8):e1004412. https://doi.org/10.1371/journal.pgen.1004412
Sohail M, et al. (2019). Polygenic adaptation on height is overestimated due to uncorrected stratification. eLife 8:e39702. https://doi.org/10.7554/eLife.39702

If the software contributed substantially, please also consider citing:

Pasquale Caterino (2025). FastQx: High-performance GPU implementation of the pairwise Qₓ statistic. https://github.com/pasq-cat/FastQx


🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.


📧 Contact

For questions or issues, please open an issue on GitHub or contact the repository maintainer.


🙏 Acknowledgments

  • Original QX test implementation: #toadd
  • GPU optimization: This implementation

About

Pairwise Qₓ statistic computed using a high-performance implementation of the two-population simplification of Berg & Coop (2014), with effect-size permutation null for bias correction following Sohail et al. (2019).

Resources

License

Stars

Watchers

Forks

Packages

No packages published