ATMPI: ASV-to-MAG Integration Pipeline

1. Biological & Computational Problem

The Problem: In microbiome research, we often have two powerful but disconnected datasets:

16S rRNA Amplicon (ASVs): High-resolution taxonomic fingerprints that tell us who is there, but not what they are doing.
Metagenomics (MAGs): Draft genomes that tell us the functional potential (what they can do), but often lack 16S markers because these genes are highly repetitive and difficult to assemble or bin correctly.

The Solution: ATMPI bridges this gap by systematically matching ASVs to MAGs. It solves the "missing 16S" problem in metagenomics by using a dual-search strategy: finding matches in predicted 16S genes AND searching the entire genomic content of the MAGs.

2. Quick Start: How to Run

Step 1: Environment Setup

Ensure you have Conda or Mamba installed. Activate the project-specific environment:

# Recommended: Using mamba for speed
mamba activate env-atmpi

# Alternatively: Using conda
conda activate env-atmpi

Step 2: Prepare Input Tables

The pipeline requires two tab-separated files: asv_table.tsv and mag_table.tsv.

asv_table.tsv format:

asv_id	asv_seq
ASV_1	/path/to/ASV_1.fasta
ASV_2	/path/to/ASV_2.fasta

mag_table.tsv format:

mag_id	mag_seq
MAG_A	/path/to/MAG_A.fa.gz
MAG_B	/path/to/MAG_B.fa.gz

Note: MAG files can be gzipped (.gz) or plain text.

Step 3: Configuration

Review config.yaml to ensure the column names match your TSV files and adjust thresholds if necessary.

Step 4: Execute the Pipeline

Run the pipeline using Snakemake. It is recommended to perform a dry-run first.

# 1. Dry-run (checks logic and file paths without running)
snakemake -n --cores 16 --configfile config.yaml

# 2. Real execution (using 16 cores)
snakemake --cores 16 --configfile config.yaml

# 3. Force rerun of a specific part (e.g., if you changed filtering logic)
snakemake --cores 16 --configfile config.yaml -R process_blast_results

3. Technical Workflow (Rule-by-Rule)

Step	Rule	Tool	Action
0	`validate_inputs`	`Python`	Ensures your ASV/MAG tables are formatted correctly and all sequence files exist.
1	`extract_16s_from_mags`	`Barrnap`	Scans MAGs using HMMs to find ribosomal RNA genes (5S, 16S, 23S).
2	`build_mag_16s_db`	`makeblastdb`	Indexes the predicted 16S genes for high-speed searching.
3	`blast_asv_to_mag_16s`	`blastn`	Matches ASVs against the predicted rRNA genes (The "Gold Standard" match).
4	`blast_asv_to_mag_contigs`	`blastn`	Matches ASVs against all contigs in the MAG. This finds 16S sequences that were assembled but not "recognized" by gene predictors.
5	`process_blast_results`	`Pandas`	The Brain: Merges both search results, prioritizes the best hits, and applies filters (97% identity, 80% ASV coverage).
6	`filter_target_species`	`Python`	Subsets results for specific taxa of interest defined in your config.
7	`generate_summary`	`Python`	Produces an HTML dashboard and TSV statistics.

3. Understanding the Outputs

`assignments/asv_to_mag_assignments.tsv`

This is the most important file. Each row represents a high-confidence link between an ASV and a MAG.

identity: Similarity percentage. >97% is typical for species, >99% for strains.
source:
- 16s: The ASV matched a predicted rRNA gene.
- contig: The ASV matched a raw genomic fragment (essential for fragmented MAGs).

`assignments/asv_strain_resolution.tsv`

Identifies MAGs that matched multiple different ASVs.

Meaning: This suggests the MAG contains multiple 16S operons or represents a population of closely related strains (microdiversity) that were binned together.

4. Analysis & Visualization in R

Once the pipeline finished, you can load the results into R to link taxonomy to function.

Loading and Plotting Identity

library(ggplot2)
library(readr)

# Load assignments
mapping <- read_tsv("assignments/asv_to_mag_assignments.tsv")

# Plot the distribution of identity scores
ggplot(mapping, aes(x=identity, fill=source)) +
  geom_histogram(binwidth=0.2, alpha=0.8) +
  theme_minimal() +
  labs(title="ASV-MAG Match Quality", x="Identity (%)", y="Count")

Linking to Abundance

If you have an ASV abundance table (asv_counts.tsv), you can see how much of your community is captured in your MAGs:

counts <- read_tsv("asv_counts.tsv")
merged <- inner_join(mapping, counts, by="asv_id")

# Summarize abundance by MAG
mag_abundance <- merged %>% 
  group_by(mag_id) %>% 
  summarise(Total_RelAbund = sum(RelativeAbundance))

5. Finding Matches for Specific ASVs

If you are interested in a specific ASV (e.g., a biomarker), you can query the results directly:

Via Terminal:

# Search for ASV_1003 and sort by best identity
grep "ASV_1003" assignments/asv_to_mag_assignments.tsv | sort -k3,3rn

Understanding the match:

Check the identity: Is it high enough to trust the assignment?
Check the source: If it's 16s, the match is very robust.
Check coverage: In blast_results/, you can see if the entire ASV length aligned.

6. Configuration

Edit config.yaml to adjust:

identity_threshold: Default 97. Increase to 99 for strain-level matching.
target_species: List species you want to specifically track.
threads: Number of CPU cores to use.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
scripts		scripts
.gitignore		.gitignore
CHAT.md		CHAT.md
DEBUG.md		DEBUG.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
Snakefile		Snakefile
config.yaml		config.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ATMPI: ASV-to-MAG Integration Pipeline

1. Biological & Computational Problem

2. Quick Start: How to Run

Step 1: Environment Setup

Step 2: Prepare Input Tables

Step 3: Configuration

Step 4: Execute the Pipeline

3. Technical Workflow (Rule-by-Rule)

3. Understanding the Outputs

`assignments/asv_to_mag_assignments.tsv`

`assignments/asv_strain_resolution.tsv`

4. Analysis & Visualization in R

Loading and Plotting Identity

Linking to Abundance

5. Finding Matches for Specific ASVs

6. Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ATMPI: ASV-to-MAG Integration Pipeline

1. Biological & Computational Problem

2. Quick Start: How to Run

Step 1: Environment Setup

Step 2: Prepare Input Tables

Step 3: Configuration

Step 4: Execute the Pipeline

3. Technical Workflow (Rule-by-Rule)

3. Understanding the Outputs

assignments/asv_to_mag_assignments.tsv

assignments/asv_strain_resolution.tsv

4. Analysis & Visualization in R

Loading and Plotting Identity

Linking to Abundance

5. Finding Matches for Specific ASVs

6. Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`assignments/asv_to_mag_assignments.tsv`

`assignments/asv_strain_resolution.tsv`

Packages