QCFlowR is a comprehensive bioinformatics workflow for RNA-seq data quality control and analysis. It automates the entire pipeline from raw FASTQ file quality assessment through read trimming, genome alignment, and gene quantification.
QCFlowR performs the following sequential steps on your RNA-seq data:
- Pre-Trimming Quality Control - Analyzes raw FASTQ files using FastQC to assess read quality before processing
- Read Trimming - Removes low-quality bases and short reads using FASTP
- Trims bases until average quality score ≥ 20
- Discards reads shorter than 30 bp
- Post-Trimming Quality Control - Re-evaluates FASTQ quality after trimming
- Genome Alignment - Aligns reads to a reference genome using HISAT2
- Automatically builds or reuses a HISAT2 index
- Generates sorted BAM files with index files
- Gene Quantification - Counts aligned reads per gene using featureCounts
The following bioinformatics tools must be installed and available in your system PATH:
| Tool | Purpose | Installation |
|---|---|---|
| FastQC | Quality control for FASTQ files | conda install -c bioconda fastqc or download from FastQC |
| FASTP | FASTQ file trimming and filtering | conda install -c bioconda fastp |
| HISAT2 | Sequence alignment to reference genome | conda install -c bioconda hisat2 |
| SAMtools | BAM file processing and indexing | conda install -c bioconda samtools |
| featureCounts | Gene quantification from BAM files | conda install -c bioconda subread |
| R | Statistical analysis (for quality summarization) | conda install -c conda-forge r-base or install from R Project |
conda install -c bioconda fastqc fastp hisat2 samtools subread
conda install -c conda-forge r-base- Clone the repository:
git clone https://github.com/mlatinov/qcflowr.git
cd qcflowr- Make the main script executable:
chmod +x bin/qcflowr.sh- Verify external tools are installed:
fastqc --version
fastp --version
hisat2 --version
samtools --version
featureCounts -v
Rscript --version./bin/qcflowr.sh -i <input_dir> -o <output_dir> -r <ref_dir> -g <gtf_file>| Flag | Parameter | Required | Description |
|---|---|---|---|
-i |
input_dir | Yes | Directory containing input FASTQ files |
-o |
output_dir | Yes | Directory for output files (will be created if it doesn't exist) |
-r |
ref_dir | Yes | Directory containing reference genome FASTA file (*.fasta or *.fa) |
-g |
gtf_file | Yes | GTF annotation file for gene quantification |
# Prepare your directories
mkdir -p results
mkdir -p reference
# Run the workflow
./bin/qcflowr.sh \
-i ./fastq_samples \
-o ./results \
-r ./reference \
-g ./reference/annotation.gtffastq_samples/
├── sample1.fastq
├── sample2.fastq
└── sample3.fastq
- FASTQ files should have
.fastqextension - Raw, uncompressed FASTQ format
reference/
├── genome.fa # or genome.fasta
└── annotation.gtf # (optional, can be in output dir)
- Genome FASTA file (supports *.fasta or *.fa extensions)
- HISAT2 index will be auto-generated in
reference/genome_index/
The workflow generates the following output files in your output directory:
| File | Description |
|---|---|
pre_trimed_* |
FastQC reports for raw FASTQ files |
*_trimmed.fastq |
Trimmed FASTQ files after FASTP |
post_trimed_* |
FastQC reports for trimmed FASTQ files |
*_sorted.bam |
Aligned BAM files (sorted by position) |
*_sorted.bam.bai |
BAM index files |
*_hisat2.log |
HISAT2 alignment logs |
Raw FASTQ Files
↓
[FastQC] ← Pre-trimming QC
↓
[FASTP] ← Trimming
↓
[FastQC] ← Post-trimming QC
↓
[HISAT2] ← Alignment to Reference
↓
[featureCounts] ← Gene Quantification
↓
Count Matrix
- OS: Linux or macOS
- Memory: Minimum 4GB RAM (8GB+ recommended for large files)
- Disk Space: Sufficient space for input FASTQ + output BAM files (typically 2-5x input size)
- Bash: Version 4.0+
Solution: Ensure all tools are installed and in your PATH. If using conda:
conda activate your_env_name
./bin/qcflowr.sh [options]Solution: Verify input directory contains files with .fastq extension (not .fq or compressed .gz)
Solution: Ensure reference directory contains a file with .fa or .fasta extension
Solution: Ensure reference FASTA is valid and you have write permissions to the reference directory
- The workflow creates intermediate files and directories automatically
- BAM alignment uses 8 parallel threads (configurable in source code)
- Quality control output is summarized using R scripts located in the
R/directory - FastQC HTML reports are cleaned up after processing to save space