HapFold: Efficient and Accurate T2T-level Haplotype Reconstruction

Description

HapFold is the first hybrid scaffolding framework that synergistically leverages the complementary strengths of graph-based and sequence-based approaches to achieve chromosome-scale, near-T2T haplotype reconstructions for diploid genomes.

By integrating the topological accuracy of assembly graphs with proximity-guided sequence contiguity, HapFold eliminates the structural errors and chromosomal misassignments common in traditional pipelines. Notably, HapFold accelerates computation by an order of magnitude while delivering superior assembly quality. Even when working with standard Oxford Nanopore Technologies (ONT) simplex reads, it enables the reconstruction of a greater number of near-T2T assemblies, providing a robust and scalable solution for high-fidelity diploid genome assembly.

Installation and Dependencies

Prerequisites

g++ (supporting C++9.4 or later)
zlib

1. Install via Bioconda (Recommended)

HapFold is officially available on Bioconda. This is the fastest and easiest way to install the tool:

# Create and activate a new environment
conda create -n hapfold
conda activate hapfold

# Install HapFold
conda install -c bioconda hapfold

# Alternatively, you can use mamba for faster dependency resolution:
# mamba install -c bioconda hapfold

2. Install from Source Code

If you prefer to compile from source:

git clone https://github.com/LuoGroup2023/HapFold.git
cd HapFold

# Compile the source code
make -j8

🚀 Quick Start & Workflow

HapFold utilizes a two-step workflow: Mapping and Resolving.

🎯 Primary Application

HapFold is primarily designed for diploid genome scaffolding using Hi-C data. It seamlessly integrates with hifiasm outputs and requires three primary GFA files from the initial assembly:

Unphased unitig graph (*.p_utg.gfa)
Haplotype 1 contig graph (*.hap1.p_ctg.gfa)
Haplotype 2 contig graph (*.hap2.p_ctg.gfa)

General Usage

Usage: HapFold <command> <arguments> <inputs>

Commands:
  scaffolding            use Hi-C/Pore-C data to resolve haplotypes
  mapping                map Hi-C/Pore-C data to sequences in the graph
  version                print version number

Step 1: Hi-C Mapping (`hic_mapping`)

Before mapping, you need to extract the node sequences from your hifiasm unitig graph into a FASTA file:

awk '/^S/{print ">"$2;print $3}' hifiasm_p_utg.gfa > hifiasm_p_utg.fa

Then, map the raw Hi-C reads to these node sequences:

HapFold mapping -t 32 -1 hic.R1.fastq.gz -2 hic.R2.fastq.gz -o mapping.txt hifiasm_p_utg.fa

Key Options for mapping:

-1 FILE, -2 FILE : (Required) Paths to Hi-C forward (R1) and reverse (R2) reads.
-t INT: Number of worker threads [32]
-o FILE: Output file to save the mapping relationships (e.g., map.out)
-k INT: k-mer size [31]

Step 2: Haplotype Resolution (`scaffolding`)

Once the mapping is complete, use the mapping results alongside the GFA files to resolve haplotypes and build chromosome-scale scaffolds.

HapFold scaffolding -t 32 -n 78 -u utg_ctg_mappings.csv -i true -1 hap1.p_ctg.gfa -2 hap2.p_ctg.gfa mapping.txt hifiasm_p_utg.gfa output_dir

(Positional arguments: <mapping_result> <unitig.gfa> <output_directory>)

Key Options for scaffolding:

Option	Description
`-t INT`	Number of threads [8].
`-n INT`	Expected number of chromosomes (e.g., `46` for human, `78` for chicken) [0].
`-1 FILE`	(Required) Path to haplotype 1 GFA file (`*.hap1.p_ctg.gfa`).
`-2 FILE`	(Required) Path to haplotype 2 GFA file (`*.hap2.p_ctg.gfa`).
`-u FILE`	Path to `utg_to_ctg` relationship file. Highly recommended for accurate graph traversing.
`-i BOOL`	Enable identity check on contigs (`true`/`false`) [false].
`-f FILE`	Precomputed identity file path; if omitted but `-i true`, the check will run automatically.
`-e STR`	Restriction enzymes separated by comma (e.g., `GATC,GANTC`) [ ].
`-c FILE`	Path to `contig_hap_nodes.txt` (required for specific Hi-C phasing modes).
`-p`	Enable plant mode (uses alternative phasing algorithms optimized for complex genomes).
`-d`, `--debug`	Enable debug mode to run internal test code functions.
`--hic_scaffold_threshold_ratio FLOAT`	Threshold ratio for sequence-based Hi-C scaffolding extensions [0.60].

Citation

If you use HapFold in your research, please cite:

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.vscode		.vscode
lib		lib
scripts		scripts
src		src
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
main.cpp		main.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HapFold: Efficient and Accurate T2T-level Haplotype Reconstruction

Description

Installation and Dependencies

Prerequisites

1. Install via Bioconda (Recommended)

2. Install from Source Code

🚀 Quick Start & Workflow

🎯 Primary Application

General Usage

Step 1: Hi-C Mapping (`hic_mapping`)

Step 2: Haplotype Resolution (`scaffolding`)

Citation

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HapFold: Efficient and Accurate T2T-level Haplotype Reconstruction

Description

Installation and Dependencies

Prerequisites

1. Install via Bioconda (Recommended)

2. Install from Source Code

🚀 Quick Start & Workflow

🎯 Primary Application

General Usage

Step 1: Hi-C Mapping (hic_mapping)

Step 2: Haplotype Resolution (scaffolding)

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Step 1: Hi-C Mapping (`hic_mapping`)

Step 2: Haplotype Resolution (`scaffolding`)

Packages