A project to develop a method for predicting the optimal growth pH of microorganisms from 'omics data.
- R folder contains the code for the bootstrapping process used to calculate the optimal pH for a strain using biogeographical information (pH_preference_bootstrapping_Panama_example.R).
- Python folder contains the machine learning (ML) code for prediction of pH preferences using genomic information.
- Gather datasets with 16S rDNA (or other marker gene data) that cover broad ranges in pH (or any environmental factor of interest)
- Identify the pH where each ASV achieves its highest relative abundance (statistical inference of pH preference)
- Identify associations between genes and pH preference
- Match the the 16S rDNA reads to reference genomes
- Annotate the genomes with functions
- Build a ML model that uses the presence/absence of genes to predict pH preference
- /data/ramonedaj/Panama/mapfile_panama.txt
- /data/ramonedaj/Panama/otu_table_panama.txt
- /data/ramonedaj/Panama/refseq.txt
- /data/ramonedaj/pH_optima/bac120_metadata_r207.tsv
- /data/mhoffert/genomes/GTDB_r207/pfam/
- Generate 1000 randomized distributions of the relative abundance of each ASV across samples with replacement (i.e. where each relative abundance value can be sampled more than once).
- Calculated the maximum value for each of these distributions.
- Obtain 95% confidence intervals of these relative abundance maxima using the boot package (v1.3.28) in R.
- Match the extremes of these intervals of relative abundance maxima to the pH of the samples where these ASVs achieved these relative abundance values obtaining the range of pH in which a given ASV consistently achieves maximal abundance across randomizations.
- Remove ASVs with inferred pH preferences with ranges greater than 0.5 pH units.
- Take the midpoint of the pH range as the estimated pH preference for a given ASV.
wget https://data.gtdb.ecogenomic.org/releases/release207/207.0/genomic_files_reps/bac120_ssu_reps_r207.tar.gz tar -xzvf bac120_ssu_reps_r207.tar.gz
vsearch --makeudb_usearch bac120_ssu_reps_r207.fna -output bac120_ssu_reps_r207.udb
vsearch --usearch_global refseq.fasta --db bac120_ssu_reps_r207.udb --strand both --notrunclabels --iddef 0 --id 0.99 --maxrejects 100 --maxaccepts 100 --blast6out aligned_ssu.tsv --threads 16