Skip to content

Random Forest for the Classification of Paternal and Maternal Haplotypes from Long-Read Sequencing Data.

License

Notifications You must be signed in to change notification settings

GabrielCabas/rfhap

 
 

Repository files navigation

RFhap

A Nextflow pipeline for long-read phasing in trio datasets, leveraging multiple k-mers and a random forest classifier.

This pipeline is containerized and integrates the following third-party software components:

  1. KMC: An ultra-fast k-mer counter (GitHub)
  2. FastKM: An efficient k-mer matcher (GitHub)
  3. Seqtk: Tools for fastq/fasta manipulation (GitHub)
  4. R-base: Implements the random forest classifier for phasing.

Running RFhap

To run RFhap, ensure that Nextflow and Singularity are properly installed. Use the following command:

nextflow run rfhap/rfhap.nf  \ 
           --paternal_reads data/reads/hg002/father.txt \ 
           --maternal_reads data/reads/hg002/maternal.txt \ 
           --child_reads data/reads/hg002/child.txt \
           --num_cases 400000 \
           --outdir hg002hap -profile leftraru 

Input File Format

Each input file should list the paths to sequencing reads in the following format:

paternal.txt (Short reads from the father)

path
/<path>/trio_data/short-reads/paternal.fwd.fasta.gz
/<path>/trio_data/short-reads/paternal.rev.fasta.gz

maternal.txt (Short reads from the mother)

path
/<path>/trio_data/short-reads/maternal.fwd.fasta.gz
/<path>/trio_data/short-reads/maternal.rev.fasta.gz

child.txt (Long reads from the child, any sequencing technology)

path
/<path>/trio_data/long-reads/kidONT.R94.p1p2.fasta.gz
/<path>/trio_data/long-reads/kidONT.R103.p1p2.fasta.gz

Additional Options

  • --kmers specifies the k-mer sizes used for counting/matching.
  • --num_cases specifies the size of the training dataset [def: 150000, max: 400000].
  • --outdir specifies the output directory.

Output Directory

RFHap produce the following output under the --outdir :

  1. ukmers_per_haplotype : unique k-mers for each parent stored in uk24/uk27/uk30/uk33 hapA_only_kmers.txt and hapB_only_kmers.txt respectively.
  2. fastkm: directory containing matrix results of matching paternal and maternal k-mers on the long-reads ("*_fastkm_kbz.txt"), for each long-read provided for the child.
  3. rf: Random forest models and classification of reads in haplotypes. Tree subdirectories are created with training (train), prediction (prediction) and haplotype (haplotypes) read classification.
  4. hap_fastq: Fastq files for each long read set of the child, split into *.hapA.fq.gz (haplotype A), *.hapB.fq.gz (haplotype B), *.hapU.fq.gz (haplotype Unknow or common).
  5. pipeline_info: report of execution time and memory consuption of each step.

Haplotype assembly

After long-read haplotype partition, each read partition can be assembled on haploid mode with hifiasm. The recommended options for Hifiasm (version 0.24.0-r702) are the following:

ULA= reads haplotye A
ULB= reads haplotype B
ULU= reads haplotye U # unknow haplotype
CPU=44 # number of CPUs to use
hifiasm -o asm.hifiasm.hapA.ont -l 3 --ont ${ULA} ${ULU} -t ${CPU} 
hifiasm -o asm.hifiasm.hapB.ont -l 3 --ont ${ULB} ${ULU} -t ${CPU} 

Developpers

  1. Damariz Gonzalez
  2. Alex Di Genova

About

Random Forest for the Classification of Paternal and Maternal Haplotypes from Long-Read Sequencing Data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 37.8%
  • R 31.9%
  • Nextflow 26.7%
  • Awk 2.7%
  • Dockerfile 0.9%