A Nextflow pipeline for long-read phasing in trio datasets, leveraging multiple k-mers and a random forest classifier.
This pipeline is containerized and integrates the following third-party software components:
- KMC: An ultra-fast k-mer counter (GitHub)
- FastKM: An efficient k-mer matcher (GitHub)
- Seqtk: Tools for fastq/fasta manipulation (GitHub)
- R-base: Implements the random forest classifier for phasing.
To run RFhap, ensure that Nextflow and Singularity are properly installed. Use the following command:
nextflow run rfhap/rfhap.nf \
--paternal_reads data/reads/hg002/father.txt \
--maternal_reads data/reads/hg002/maternal.txt \
--child_reads data/reads/hg002/child.txt \
--num_cases 400000 \
--outdir hg002hap -profile leftraru Each input file should list the paths to sequencing reads in the following format:
path
/<path>/trio_data/short-reads/paternal.fwd.fasta.gz
/<path>/trio_data/short-reads/paternal.rev.fasta.gz
path
/<path>/trio_data/short-reads/maternal.fwd.fasta.gz
/<path>/trio_data/short-reads/maternal.rev.fasta.gz
path
/<path>/trio_data/long-reads/kidONT.R94.p1p2.fasta.gz
/<path>/trio_data/long-reads/kidONT.R103.p1p2.fasta.gz
--kmersspecifies the k-mer sizes used for counting/matching.--num_casesspecifies the size of the training dataset [def: 150000, max: 400000].--outdirspecifies the output directory.
RFHap produce the following output under the --outdir :
- ukmers_per_haplotype : unique k-mers for each parent stored in uk24/uk27/uk30/uk33 hapA_only_kmers.txt and hapB_only_kmers.txt respectively.
- fastkm: directory containing matrix results of matching paternal and maternal k-mers on the long-reads ("*_fastkm_kbz.txt"), for each long-read provided for the child.
- rf: Random forest models and classification of reads in haplotypes. Tree subdirectories are created with training (train), prediction (prediction) and haplotype (haplotypes) read classification.
- hap_fastq: Fastq files for each long read set of the child, split into *.hapA.fq.gz (haplotype A), *.hapB.fq.gz (haplotype B), *.hapU.fq.gz (haplotype Unknow or common).
- pipeline_info: report of execution time and memory consuption of each step.
After long-read haplotype partition, each read partition can be assembled on haploid mode with hifiasm. The recommended options for Hifiasm (version 0.24.0-r702) are the following:
ULA= reads haplotye A
ULB= reads haplotype B
ULU= reads haplotye U # unknow haplotype
CPU=44 # number of CPUs to use
hifiasm -o asm.hifiasm.hapA.ont -l 3 --ont ${ULA} ${ULU} -t ${CPU}
hifiasm -o asm.hifiasm.hapB.ont -l 3 --ont ${ULB} ${ULU} -t ${CPU}
- Damariz Gonzalez
- Alex Di Genova