Add CHAMPS pathogen reference genomes to chromosomes.tsv#75
Add CHAMPS pathogen reference genomes to chromosomes.tsv#75
Conversation
- Add 104 new entries covering 93 organisms from the CHAMPS panel - Fix pre-existing NBCI-REF typo to NCBI-REF - Add 12 organisms requiring accession lookup to chromosomes-todo.tsv Organisms added include: Aeromonas hydrophila, Bordetella holmesii/parapertussis, Brucella abortus/melitensis, Candida albicans, CCHF virus, Chikungunya, Corynebacterium ulcerans, Dengue 1-4, Enterovirus A71, Hepatitis E, HSV 1/2, human adenoviruses 2/40/41, astrovirus, human betaherpesvirus 5 (CMV), human coronaviruses 229E/NL63/OC43/HKU1, metapneumovirus, parainfluenzavirus 1-4, parechovirus 1, rhinovirus A, Influenza A H1N1, Japanese encephalitis virus, Lassa virus, MERS, mumps, Mycoplasma pneumoniae, Nipah virus, norovirus GI/GII, Orientia tsutsugamushi, parvovirus B19, RSV-A, Rift Valley fever virus, Rotavirus A, Rubella virus, Salmonella Typhi/Paratyphi A, sapovirus GI, SARS-CoV-1, Treponema pallidum, Ureaplasma urealyticum, varicella zoster virus, West Nile virus, Yellow fever virus, Zika virus Co-authored-by: lskatz <14798+lskatz@users.noreply.github.com>
- SARS-CoV-1: AY278491 → AY274119 (Tor2 BCCA, INSDC source for NC_004718) - Dengue virus 3: AY099336 → M93130 (H87 prototype, INSDC source for NC_001475) - JEV: AF217620 (vaccine strain) → D90194 (Nakayama, INSDC source for NC_001437) - RSV-A: M74568 (shorter) → AF013254 (complete Long strain, INSDC source for NC_001781) - Rotavirus A: replaced 11 old SA11 segments with EF672557-EF672567 (SA11-H96, INSDC sources for NC_011503-NC_011513) - Corynebacterium ulcerans: fix taxid 38289 → 1714 (species level) Co-authored-by: lskatz <14798+lskatz@users.noreply.github.com> Agent-Logs-Url: https://github.com/lskatz/Kalamari/sessions/3d3860c7-c251-44b8-837d-761e5da02c29
… (GenBank) D90194 is a DDBJ-origin accession. CI log analysis across all 16 download chunks confirmed it was never downloaded by NCBI's efetch. AF217620 is a standard NCBI GenBank accession (SA14-14-2 complete genome, 10977 bp) that was confirmed downloading in CI for the first commit. Co-authored-by: lskatz <14798+lskatz@users.noreply.github.com> Agent-Logs-Url: https://github.com/lskatz/Kalamari/sessions/5d4c9d40-a1b7-454b-8e36-85804296c8ea
|
I'll have to check each of the accessions to see if it
Checking taxonomy one by one (maybe there's a better way) $ esearch -db nuccore -query CP000462 | elink -target taxonomy | efetch -format xml | xtract -format | grep -EC 4 "<Rank>genus|<Rank>species"
</Taxon>
<Taxon>
<TaxId>642</TaxId>
<ScientificName>Aeromonas</ScientificName>
<Rank>genus</Rank>
</Taxon>
<Taxon>
<TaxId>644</TaxId>
<ScientificName>Aeromonas hydrophila</ScientificName>
<Rank>species</Rank>
</Taxon>
<Taxon>
<TaxId>196023</TaxId>
<ScientificName>Aeromonas hydrophila subsp. hydrophila</ScientificName> |
|
@copilot can you suggest other edirect commands to verify contig completeness, taxid, parent taxid, and inclusion in the NCBI reference genomes list? |
Here are edirect commands for each verification type. The key fields from 1. Verify accession, taxid, completeness in one shot: acc="AF013254"
esearch -db nuccore -query "${acc}[accn]" | efetch -format docsum | \
xtract -pattern DocumentSummary -element Caption TaxId Length Completeness Genome Organism2. Verify taxid and get parent taxid: acc="AF013254"
esearch -db nuccore -query "${acc}[accn]" | elink -target taxonomy | \
efetch -format xml | \
xtract -pattern Taxon -element TaxId ScientificName -block LineageEx/Taxon -tab "\n" -element TaxId ScientificName RankThe parent taxid is the last entry before the species in the lineage. 3. Check if it's an NCBI reference genome (RefSeq_category = "reference genome"): acc="AF013254"
esearch -db nuccore -query "${acc}[accn]" | elink -target assembly | \
efetch -format docsum | \
xtract -pattern DocumentSummary -element AssemblyAccession AssemblyStatus RefSeq_category OrganismIf nothing comes back, the accession has no assembly record (common for viral sequences). For those, check manually at https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=11250 4. Batch-check all CHAMPS accessions from chromosomes.tsv: tail -n +2 src/chromosomes.tsv | awk -F'\t' '{print $1"\t"$2"\t"$3"\t"$4}' | \
while IFS=$'\t' read name acc taxid parent; do
result=$(esearch -db nuccore -query "${acc}[accn]" | efetch -format docsum | \
xtract -pattern DocumentSummary -element Caption TaxId Length Completeness)
echo -e "$name\t$acc\t$taxid\t$parent\t$result"
done5. Completeness field values to look for: 6. Verify the NCBI RefSeq reference genome for a given taxid: taxid="11250" # RSV-A
esearch -db assembly -query "txid${taxid}[Organism:exp] AND \"reference genome\"[RefSeq Category]" | \
efetch -format docsum | \
xtract -pattern DocumentSummary -element AssemblyAccession SpeciesName RefSeq_categoryThis returns the RefSeq assembly record; you can then trace back to the INSDC accession from there. |
Adds complete-genome INSDC accessions for pathogens in the CHAMPS child mortality surveillance panel that were missing from Kalamari, with a preference for accessions from NCBI-designated reference genomes.
New entries
Accession selection
All accessions are INSDC (GenBank), not RefSeq. Where an NCBI-designated reference genome exists, its INSDC source accession is used:
efetchZika is represented by KX369547 (PRVABC59, 2015 epidemic strain) rather than AY632535 (1947 mouse-adapted MR766), which is more relevant for surveillance.
Also fixed
Original prompt
This section details on the original issue you should resolve
<issue_title>CHAMPS organisms</issue_title>
<issue_description>There is a multinational collaboration called CHAMPS where they test samples from deaths of very young or stillborn people. Experts determine cause of death, and samples are tested postmordem to identify the pathogens. There is a strong ability to link cause of death to individual pathogens. There is also a test for more than 140 pathogens. I think it would be great to get the reference genomes of these pathogens and compare against CHAMPS results in the future.
📱 Kick off Copilot coding agent tasks wherever you are with GitHub Mobile, available on iOS and Android.