Skip to content

This tutorial is a guide to perform TWAS with the data used in Torres‐Rodríguez, J. Vladimir, et al. "Population‐level gene expression can repeatedly link genes to functions in maize." The Plant Journal (2024).

Notifications You must be signed in to change notification settings

corn2code/TWAStutorial

Repository files navigation

Transcriptome-Wide Association Study (TWAS) Tutorial in maize

RNA-seq processing to TWAS

Workflow Platform Languages

Quick Navigation

Section Focus Main outputs
Part 1 RNA-seq QC, trimming, and quantification abundance.tsv, merged_gene_tpms.csv
Part 2 Association testing and visualization TWAS.CMLM_*.csv, associatedGenes.csv

Reference

This tutorial is based on data used in: Torres-Rodriguez, J. Vladimir, et al. (2024), "Population-level gene expression can repeatedly link genes to functions in maize"

For missing files or questions, contact vladimir.torres@unl.edu.

Prerequisites

  • Familiarity with Bash, Python, and R.
  • Access to fastqc, multiqc, trimmomatic, and kallisto.
  • Optional HPC/Slurm environment for parallel runs.
  • Large supporting datasets on Figshare: TWAS_tutorial

Part 1: Obtain gene expression levels from RNA-seq

Step 1: Setup working folders

Use a consistent project directory name to avoid path confusion:

export TWAS_WORK="$HOME/TWAS_tutorial"
mkdir -p "$TWAS_WORK"/{scripts,input,output,log.out,fasta,fasta.trimmed}
cd "$TWAS_WORK/scripts"

Copy helper files from this repository into scripts/:

cp /path/to/TWAStutorial/downloadSamples.sh .
cp /path/to/TWAStutorial/samples.txt .

Step 2: Download example FASTQ files (12 files, 6 individuals)

Data source: ENA study PRJEB67964.

chmod +x downloadSamples.sh
./downloadSamples.sh > download_log.txt 2>&1

# Verify download
ls ../fasta -lh
ls ../fasta/*.fastq.gz | wc -l

# Peek into one file
zcat ../fasta/2369_1.fastq.gz | head

Step 3: Quality control with FastQC and MultiQC

Create report folders:

mkdir -p ../fasta/fastqc_reports ../fasta/multiqc_report

Run a single file first:

ml fastqc/0.12   # or module load fastqc/0.12
fastqc ../fasta/2369_1.fastq.gz -o ../fasta/fastqc_reports

Run all 12 files:

fastqc ../fasta/*.fastq.gz -o ../fasta/fastqc_reports

Optional Slurm array example (fastqc.slurm):

#!/bin/sh
#SBATCH --array=1-12
#SBATCH --job-name=fastqc
#SBATCH --time=3-00:00:00
#SBATCH --mem-per-cpu=10GB
#SBATCH --output=../log.out/%x_%a.out
#SBATCH --partition=schnablelab,batch
#SBATCH --mail-type=ALL

ml fastqc/0.12

SAMPLE_FILE="samples.txt"
SAMPLE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" "${SAMPLE_FILE}")
echo "${SAMPLE}"

fastqc "${SAMPLE}" -o ../fasta/fastqc_reports

Submit:

sbatch fastqc.slurm

Aggregate reports:

ml multiqc/py37/1.8   # or module load multiqc
multiqc ../fasta/fastqc_reports -o ../fasta/multiqc_report

Step 4: Remove adapters and low-quality bases (Trimmomatic)

Trimmomatic adapter files are available here: Trimmomatic adapters

Create file prefixes for paired-end files:

find ../fasta -name "*_1.fastq.gz" | sed 's/_1.fastq.gz$//g' > files.path.txt
mkdir -p ../fasta.trimmed

Create trimmomatic.slurm:

#!/bin/sh
#SBATCH --array=1-6
#SBATCH --job-name=trimm
#SBATCH --time=1-00:00:00
#SBATCH --mem-per-cpu=10GB
#SBATCH --output=../log.out/%x_%a.out
#SBATCH --partition=schnablelab,batch
#SBATCH --mail-type=ALL

ml trimmomatic/0.33
ml java/12

samplesheet="files.path.txt"
f=$(sed -n "${SLURM_ARRAY_TASK_ID}p" "${samplesheet}" | awk '{print $1}')
o=$(echo "${f}" | cut -d'/' -f3-)

mkdir -p ../fasta.trimmed/"${o}"

java -jar "${TM_HOME}/trimmomatic.jar" PE -phred33 \
  "${f}_1.fastq.gz" "${f}_2.fastq.gz" \
  ../fasta.trimmed/"${o}"/"${o}"_1_paired.fastq.gz \
  ../fasta.trimmed/"${o}"/"${o}"_1_unpaired.fastq.gz \
  ../fasta.trimmed/"${o}"/"${o}"_2_paired.fastq.gz \
  ../fasta.trimmed/"${o}"/"${o}"_2_unpaired.fastq.gz \
  ILLUMINACLIP:../TruSeq3-PE.fa:2:30:10 \
  LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:35

Submit:

sbatch trimmomatic.slurm

Step 5: QC after trimming

mkdir -p ../fasta.trimmed/fastqc_reports ../fasta.trimmed/multiqc_report
find ../fasta.trimmed -name "*_paired.fastq.gz" > samples.trimmed.txt

Create fastqc2.slurm:

#!/bin/sh
#SBATCH --array=1-12
#SBATCH --job-name=fastqc2
#SBATCH --time=3-00:00:00
#SBATCH --mem-per-cpu=10GB
#SBATCH --output=../log.out/%x_%a.out
#SBATCH --partition=schnablelab,batch
#SBATCH --mail-type=ALL

ml fastqc/0.12

SAMPLE_FILE="samples.trimmed.txt"
SAMPLE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" "${SAMPLE_FILE}")
echo "${SAMPLE}"

fastqc "${SAMPLE}" -o ../fasta.trimmed/fastqc_reports

Submit and aggregate:

sbatch fastqc2.slurm
ml multiqc
multiqc ../fasta.trimmed/fastqc_reports -o ../fasta.trimmed/multiqc_report

Step 6: Quantify expression with Kallisto

Kallisto manual: https://pachterlab.github.io/kallisto/manual

Download transcript FASTA: Zmays_833_Zm-B73-REFERENCE-NAM-5.0.55.transcript_primaryTranscriptOnly.fa.gz

Build index (example in input/):

cd ../input
ml kallisto/0.46
kallisto index \
  -i Zm-B73-REFERENCE-NAM-5.0.55.transcript_primaryTranscriptOnly.idx \
  Zmays_833_Zm-B73-REFERENCE-NAM-5.0.55.transcript_primaryTranscriptOnly.fa.gz

Back in scripts/, create kallisto.slurm:

#!/bin/sh
#SBATCH --array=1-6
#SBATCH --job-name=quant
#SBATCH --time=1-00:00:00
#SBATCH --mem-per-cpu=10GB
#SBATCH --output=../log.out/%x_%a.out
#SBATCH --partition=schnablelab,batch
#SBATCH --mail-type=ALL

ml kallisto/0.46

samplesheet="files.path.txt"
f=$(sed -n "${SLURM_ARRAY_TASK_ID}p" "${samplesheet}" | awk '{print $1}')
o=$(echo "${f}" | cut -d'/' -f3-)

mkdir -p ../input/out.kallisto/"${o}"
kallisto quant --threads=10 \
  -i ../input/Zm-B73-REFERENCE-NAM-5.0.55.transcript_primaryTranscriptOnly.idx \
  -o ../input/out.kallisto/"${o}" \
  ../fasta.trimmed/"${o}"/"${o}"_1_paired.fastq.gz \
  ../fasta.trimmed/"${o}"/"${o}"_2_paired.fastq.gz

Submit:

sbatch kallisto.slurm

Each sample folder in ../input/out.kallisto/ contains:

  1. abundance.tsv (transcript-level estimates, including TPM)
  2. abundance.h5 (binary format for downstream tools)
  3. run_info.json (run metadata)

Step 7: Build a merged TPM matrix

Create make_rna_ss.py:

import os
import sys

mydir = sys.argv[1]
if not os.path.exists(mydir):
    sys.exit(f"{mydir} is not a valid directory")

gene_exp_dict = {}
sample_list = []

for asample in os.listdir(mydir):
    sample_dir = os.path.join(mydir, asample)
    abundance = os.path.join(sample_dir, "abundance.tsv")
    if not os.path.isdir(sample_dir):
        continue
    if not os.path.exists(abundance):
        continue

    sample_list.append(asample)
    with open(abundance) as fh:
        fh.readline()
        for x in fh:
            y = x.strip().split("\t")
            mygene = y[0]
            mytpm = float(y[-1])
            if mygene not in gene_exp_dict:
                gene_exp_dict[mygene] = {}
            gene_exp_dict[mygene][asample] = mytpm

with open("merged_gene_tpms.csv", "w") as fh:
    myheader = ["GeneID"] + sorted(sample_list)
    fh.write(",".join(myheader) + "\n")
    for agene in sorted(gene_exp_dict):
        plist = [agene]
        for asample in sorted(sample_list):
            plist.append(gene_exp_dict[agene][asample])
        fh.write(",".join(map(str, plist)) + "\n")

Run:

python3 make_rna_ss.py ../input/out.kallisto

Output: merged_gene_tpms.csv

Full expression matrix on Figshare: merged_gene_tpms_longestT_csv_zip


Part 2: Run TWAS with GAPIT

Step 1: Pre-process data

To save time, pre-processed files are provided in InputFiles.zip on Figshare: TWAS_tutorial

Required files:

  • Filtered_gene_expression: rows are taxa, columns are genes.
  • Gene_information: gene ID, chromosome, and position.
  • Phenotype: trait values.
  • Covariates (optional): additional model covariates.

Filtering criteria used in the original paper:

  1. Remove outlier samples based on PCA from gene expression.
  2. Remove genes with TPM < 0.1 in at least 50% of samples.

Step 2: Run TWAS

setwd("/Users/vladimir/TWAS_tutorial")

source("https://zzlab.net/GAPIT/gapit_functions.txt")

# install.packages("data.table")
library(data.table)

# load phenotype data
phe <- read.table("pheno_693.txt", head = TRUE)
trait <- which(colnames(phe) == "Anthesis.sp.NE")
myY <- phe[, c(1, trait)]

# covariates
myCV <- read.csv("sampling_693.order.csv", head = TRUE)

# load filtered expression
counts <- fread("counts.NE2020.693.filtered.txt", data.table = FALSE)
row.names(counts) <- counts$taxa

# check matching taxa
NROW(merge(counts[, 1], phe, by = 1))

# quantile transform to 0-2
Quantile <- apply(
  counts[, -1], 2,
  function(A) {
    min_x <- as.numeric(quantile(A, 0.05))
    max_x <- as.numeric(quantile(A, 0.95))
    out <- (2 * (A - min_x) / (max_x - min_x))
    out[out > 2] <- 2
    out[out < 0] <- 0
    out
  }
)

Quantile.t <- as.data.frame(Quantile)
Quantile.t$taxa <- row.names(Quantile.t)
myGD <- Quantile.t[, c(ncol(Quantile.t), 1:(ncol(Quantile.t) - 1))]

myGM <- read.table("gene_info_693.txt", head = TRUE)
unique(myGM$chr)

myGAPIT <- GAPIT(
  Y = myY,
  GD = myGD,
  GM = myGM,
  CV = myCV,
  PCA.total = 3,
  model = "CMLM",
  SNP.MAF = 0,
  file.output = FALSE
)

values <- data.frame(myGAPIT$GWAS)
values$FDR <- p.adjust(values$P.value, method = "BH")
write.csv(values, paste0("TWAS.CMLM_", colnames(phe[trait]), ".csv"), row.names = FALSE)

Step 3: Manhattan plot and significant genes

setwd("/Users/vladimir/TWAS_tutorial")

library(data.table)
library(dplyr)
library(ggplot2)
library(ggrastr)

theme_set(theme_classic(base_size = 19))
theme_update(
  axis.text.x = element_text(colour = "black"),
  axis.text.y = element_text(colour = "black"),
  plot.title = element_text(hjust = 0.5),
  plot.subtitle = element_text(hjust = 0.5)
)

trait <- "Anthesis.sp.NE"
gwas.datTWAS <- fread(paste0("TWAS.CMLM_", trait, ".csv"), data.table = FALSE)

gwas.datTWAS$BPcum <- NA
s <- 0
nbp <- c()
for (i in sort(unique(gwas.datTWAS$Chromosome))) {
  nbp[i] <- max(gwas.datTWAS[gwas.datTWAS$Chromosome == i, ]$Position.)
  gwas.datTWAS[gwas.datTWAS$Chromosome == i, "BPcum"] <- gwas.datTWAS[gwas.datTWAS$Chromosome == i, "Position."] + s
  s <- s + nbp[i]
}

axis.set <- gwas.datTWAS %>%
  group_by(Chromosome) %>%
  summarize(center = (max(BPcum) + min(BPcum)) / 2, maxBP = max(BPcum))

gTWAS.anthesis.NE <- ggplot(
  data = gwas.datTWAS,
  aes(BPcum, -log10(FDR), colour = factor(Chromosome, levels = c(1:10)))
) +
  geom_point(size = 2) +
  geom_hline(yintercept = -log10(0.05), linetype = 2) +
  scale_color_manual(values = c("#03193F", "#28708C", "#BF930F", "#0f3bbf", "#295E52", "#e31a1c", "#fdbf6f", "#ff7f00", "#cab2d6", "#6a3d9a")) +
  annotate("text", label = "Anthesis Nebraska", y = 19.5, x = 100, size = 5, hjust = "inward") +
  scale_x_continuous(label = axis.set$Chromosome, breaks = axis.set$center) +
  theme(legend.position = "none", axis.title.x = element_blank()) +
  ylab(expression(-log[10](FDR))) +
  xlab("Chromosome") +
  scale_y_continuous(limits = c(0, 20), breaks = seq(1, 20, 2))

gTWAS.anthesis.NE

associatedGenes <- gwas.datTWAS[which(gwas.datTWAS$FDR < 0.05), ]
fwrite(associatedGenes, "associatedGenes.csv")

Manhattan plot

Step 4: Annotate associated genes

Use IDs from associatedGenes.csv (for example, Zm00001eb057540) and search maize resources such as:

In maize, a gene can have multiple identifiers across reference versions.

About

This tutorial is a guide to perform TWAS with the data used in Torres‐Rodríguez, J. Vladimir, et al. "Population‐level gene expression can repeatedly link genes to functions in maize." The Plant Journal (2024).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages