This repository contains the datasets used in the following paper:
- Y. Tabatabaee, C. Zhang, T. Warnow, S. Mirarab, Phylogenomic branch length estimation using quartets, Bioinformatics, Volume 39, Issue Supplement_1, June 2023, Pages i185–i193, https://doi.org/10.1093/bioinformatics/btad221
For experiments in this study, we studied a collection of simulated and biological datasets with incomplete lineage sorting (ILS). We generated a new quartet dataset and regenerated species trees with substitution-unit branch lengths for previously published datasets from Zhang et. al. (2018) and Mai et. al. (2017). We also analyzed the mammalian biological dataset from Song et. al. (2012).
All datasets can be accessed from Dryad https://doi.org/10.5061/dryad.pg4f4qs3q.
Quartet simulations
This dataset has six different model conditions that differ in the level of ILS (by varying population size) and varying rate heterogeneity multipliers, each with 200 replicates. The model conditions start from a strict molecular clock with no rate variation (i.e. Homogeneous, no_variation) and becomes successively more complex. Next, we add rate variations across species tree branches only (-hs option, only_hs), creating a model (Sp) akin to MSC + Substitution mentioned in the paper. We then create models that have rate variation only across genes but not species (Loc using -hl, only_hl) and both across species and across genes (Sp, Loc using -hs -hl, hs_hl). Finally, we add rate variations specific to each branch of each gene tree (Sp, Loc, Sp/Loc: -hs -hl -hg, hs_hl_hg), which creates heterotachy. The final condition (hs_hl_hg_highr) has similar rate variations to hs_hl_hg but with a smaller population size and therefore higher level of ILS. Raw dataset is available in quartet_simulations.zip and includes species trees, true gene trees and other SimPhy outputs. Results and intermediate data from the experiments in the paper are available in quartets_results.tar.gz.
Below is a description of files in each directory.
s_tree.trees: true species tree in substitution unitstruegenetrees: true gene trees in substitution unitsultrametric-genetrees.tre: ultrametric gene treess_tree.ralpha: mutation rates for species tree branches in pre-order traversalg_trees_ralpha.txt: mutation rates for gene tree branches in pre-order traversall_trees.trees: locus trees in generation unitsad.txt: average RF distance between the model species tree and true gene treescastles_truegenetrees_s_tree.trees: true species tree furnished with CASTLES SU branch lengthspatristic_[MODE]_truegenetrees.mat: patristic distance matrix calculated using minimum (min), average (avg) or all (all) pairwise distanceserable_patristic_all_truegenetrees_s_tree.trees.derooted.length.nwk: true species tree furnished with ERaBLE SU branch lengthsfastme_BalLS_patristic_avg_truegenetrees_s_tree.trees.derooted: true species tree furnished with FastME SU branch lengths with average distancesfastme_BalLS_patristic_min_truegenetrees_s_tree.trees.derooted: true species tree furnished with FastME SU branch lengths with minimum distances
30-taxon simulations
This dataset has six model conditions with varying levels of deviation from the molecular clock and inclusion of an outgroup, each with 100 replicates. The model conditions are specified as outgroup.[has-OG].species.[DEV].genes.[DEV] where [has-OG] is 1 when the dataset has an outgroup and 0 otherwise, and [DEV] shows the level of deviation from the clock (parameter α of the gamma distribution) that is set to 5 (low), 1.5 (medium), or 0.15 (high). Original dataset is from Mai at al. (2017) and available at https://uym2.github.io/MinVar-Rooting/. Species trees with SU branch lengths are available in MVRoot_SU.tar.gz and results and intermediate data from the experiments in the paper are available in MVRoot_results.tar.xz.
Below is a description of files in each directory.
s_tree.trees: true species tree in substitution unitss_tree.ralpha: mutation rates for species tree branches in pre-order traversalg_trees_ralpha.txt: mutation rates for gene tree branches in pre-order traversaltruegenetrees: true gene trees in substitution unitsestimatedgenetre.gtr: estimated gene treesad.txt: average RF distance between the model species tree and true gene treesgtee_gtr.txt: average RF distance between true and estimated gene treesall-genes.phylipandconcat_align.fasta: concatenation of all gene alignmentscastles_truegenetrees_s_tree.trees: true species tree furnished with CASTLES SU branch lengths run on true gene treescastles_estimatedgenetre.gtr_s_tree.trees: true species tree furnished with CASTLES SU branch lengths run on estimated gene treeserable_patristic_all_truegenetrees_s_tree.trees.derooted.length.nwk: true species tree furnished with ERaBLE SU branch lengths on true gene treeserable_patristic_all_estimatedgenetre.gtr_s_tree.trees.derooted.length.nwk: true species tree furnished with ERaBLE SU branch lengths on estimated gene treesfastme_BalLS_patristic_[MODE]_truegenetrees_s_tree.trees.derooted: true species tree furnished with FastME SU branch lengths with minimum or average distances run on true gene treesfastme_BalLS_patristic_[MODE]_estimatedgenetre.gtr_s_tree.trees.derooted: true species tree furnished with FastME SU branch lengths with minimum or average distances run on estimated gene treespatristic_[MODE]_truegenetrees.mat: patristic distance matrix calculated using minimum (min), average (avg) or all (all) pairwise distances for true gene treespatristic_[MODE]_estimatedgenetre.gtr.mat: patristic distance matrix calculated using minimum (min), average (avg) or all (all) pairwise distances for estimated gene treesRAxML_result.concat_align_s_tree.trees: true species tree furnished with RAxML SU branch lengths
101-taxon simulations
This dataset has four model conditions with varying sequence lengths (1600bp, 800bp, 400bp, 200bp) corresponding to different levels of gene tree estimation error (23%, 31%, 42%, and 55%), each with 50 replicates. The original dataset is from Zhang et. al. (2018) and available at https://gitlab.com/esayyari/ASTRALIII/. Species trees with SU branch lengths are available in ASTRALIII_SU.tar.gz. Results and intermediate data from the experiments in the paper are in ASTRALIII_SU_results.tar.xz. Below is a description of files in each directory.
s_tree.trees: true species tree in substitution unitss_tree.ralpha: mutation rates for species tree branches in pre-order traversalg_trees_ralpha.txt: mutation rates for gene tree branches in pre-order traversaltruegenetrees: true gene trees in substitution unitsfasttree_genetrees_[seq-len]_non: gene trees estimated using FastTree2 from alignments with length [seq-len]bperable_patristic_all_fasttree_genetrees_[seq-len]_non_s_tree.trees.derooted.length.nwk: true species tree furnished with ERaBLE SU branch lengths on gene trees estimated from alignments with length [seq-len]bppatristic_[MODE]_fasttree_genetrees_[seq-len]_non.mat: patristic distance matrix calculated using minimum (min), average (avg) or all (all) pairwise distances for gene trees estimated from alignments with length [seq-len]bpconcat_for_fasttree_[seq-len].fastaorall-genes_for_fasttree.phylip: concatenation of all gene alignments with length [seq-len]bpRAxML_result.concat_for_fasttree_[seq-len]_s_tree.trees: true species tree furnished with RAxML SU branch lengths from concatenation of alignments with length [seq-len]bpcastles_fasttree_genetrees_[seq-len]_non_s_tree.trees: true species tree furnished with CASTLES SU branch lengths run on gene trees estimated from alignments with length [seq-len]bpfastme_BalLS_patristic_[MODE]_fasttree_genetrees_[seq-len]_non_s_tree.trees.derooted: true species tree furnished with FastME SU branch lengths with minimum or average distances run on gene trees estimated from alignments with length [seq-len]bp
The preprocessed mammalian dataset (in which genes with mismatching names are removed) is available at https://drive.google.com/drive/folders/0B0lcoFFOYQf8SlZvQmlOSkFJaEE?resourcekey=0-ClOa-cr-C3TeBWQlQuxmZw and includes an estimated ASTRAL species tree, gene trees and alignments. The files generated for our analysis are available in biological-mammalian.tar.gz. Below is a description of files in each directory.
genetrees.tre: gene trees after filteringconcat-align.fasta: concatenated alignment of all filtered genesastralv5.7.8.tre: main species tree with CU branch lengths estimated using ASTRALRAxML_log.no_GAL.tre: ASTRAL species tree furnished with RAxML SU branch lengths after removing the outgroup Chicken (GAL)castles_rooted.tre: ASTRAL species tree furnished with CASTLES SU branch lengths after removing the outgroup Chicken (GAL)fastme_[MODE].tre: ASTRAL species tree furnished with FastME SU branch lengths with minimum or average distanceserable.tre.length.nwk: ASTRAL species tree furnished with ERaBLE SU branch lengthspatristic_[MODE].mat: patristic distance matrix calculated using minimum (min), average (avg) or all (all) pairwise distances for gene treesmammals-namemap.txt: name mapping for taxamammalian-trees.nexus: FigTree nexus file comparing CASTLES and Concatenation branch lengths