Skip to content

Add CHAMPS pathogen reference genomes to chromosomes.tsv#75

Draft
Copilot wants to merge 4 commits intomasterfrom
copilot/add-champs-organisms-reference-genomes
Draft

Add CHAMPS pathogen reference genomes to chromosomes.tsv#75
Copilot wants to merge 4 commits intomasterfrom
copilot/add-champs-organisms-reference-genomes

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 20, 2026

Adds complete-genome INSDC accessions for pathogens in the CHAMPS child mortality surveillance panel that were missing from Kalamari, with a preference for accessions from NCBI-designated reference genomes.

New entries

  • Bacteria: Actinobacillus pleuropneumoniae, Aeromonas hydrophila, Bordetella holmesii/parapertussis, Brucella abortus/melitensis (2 chromosomes each), Corynebacterium ulcerans, Mycoplasma pneumoniae, Orientia tsutsugamushi, Salmonella Typhi/Paratyphi A, Treponema pallidum, Ureaplasma urealyticum
  • Single-segment viruses: CMV, HSV-1/2, VZV, HCoV-229E/NL63/OC43/HKU1, MERS-CoV, SARS-CoV-1, EV-A71, Rhinovirus A, Parechovirus 1, hMPV, HPIV 1–4, Mumps, Nipah, RSV-A, DENV 1–4, JEV, WNV, YFV, Zika, CHIKV, Rubella, HEV, HAdV-2/40/41, HAstV-1, Norovirus GI/GII, Sapovirus GI, Parvovirus B19
  • Multi-segment viruses: Influenza A H1N1 pdm09 (8 segments: CY121673–CY121680), CCHFV (3), Lassa (2), RVFV (3), Rotavirus A SA11-H96 (11 segments: EF672557–EF672567)
  • Fungi: Candida albicans SC5314 (8 chromosomes: CP017623–CP017630)

Accession selection

All accessions are INSDC (GenBank), not RefSeq. Where an NCBI-designated reference genome exists, its INSDC source accession is used:

Organism Accession Notes
SARS-CoV-1 AY274119 NC_004718 source (Tor2/BCCA); replaces BJ01 isolate AY278491
Dengue virus 3 M93130 NC_001475 source (H87 prototype)
JEV AF217620 NCBI GenBank complete genome; D90194 (Nakayama/DDBJ) excluded — not retrievable via efetch
RSV-A AF013254 NC_001781 source (complete Long strain); replaces shorter M74568
Rotavirus A EF672557–EF672567 SA11-H96 strain, NC_011503–NC_011513 sources; replaces 11 disparate 1980s–2000s segments

Zika is represented by KX369547 (PRVABC59, 2015 epidemic strain) rather than AY632535 (1947 mouse-adapted MR766), which is more relevant for surveillance.

Also fixed

  • Corynebacterium ulcerans taxid corrected: strain-level 38289 → species-level 1714
Original prompt

This section details on the original issue you should resolve

<issue_title>CHAMPS organisms</issue_title>
<issue_description>There is a multinational collaboration called CHAMPS where they test samples from deaths of very young or stillborn people. Experts determine cause of death, and samples are tested postmordem to identify the pathogens. There is a strong ability to link cause of death to individual pathogens. There is also a test for more than 140 pathogens. I think it would be great to get the reference genomes of these pathogens and compare against CHAMPS results in the future.

Code Assay Respiratory V1/V2 Respiratory V3 Enteric Tier 1 (Blood/CSF) Tier 2 (Blood/CSF)
ABAU_1 Acinetobacter baumanii X X X
ADEV_1 Adenovirus X X X X
BPAR_1 Bordetella parapertussi, Bordetella bronchiseptica (Insertion sequence IS1001) X X
BOP2_1 Bordetella pertussis (pertussis toxin) X X
BOP1_1 Bordetella pertussis, Bordetella holmseii (Insertion sequence IS481) X X
BUPS_1 Burkholderia pseudomallei X X X
CHTR_1 Chlamydia trachomatis X X
CODI_1 Corynebacterium diptheriae X X
COUP_1 Corynebacterium ulcerans, Corynebacterium pseudotuberculosis X X
CTOX_1 Corynebactrium spp. (tox gene) X X
CYMV_1 Cytomegalovirus (CMV) X X X
ENTV_4 Enterovirus X X X X
GAST_2 Group A Streptococcus X X X
GBST_1 Group B Streptococcus X X X
HIAT_1 Haemophilus influenzae X X X
HITB_2 Haemophilus influenzae type B X X X
HCV1_1 Human coronavirus 1 (229E) X X
HCV2_1 Human coronavirus 2 (NL63) X X
HCV3_1 Human coronavirus 3 (OC43) X X
HCV4_1 Human coronavirus 4 (HKU1) X X
HMPV_1 Human metapneumovirus (HMPV) X ...

📱 Kick off Copilot coding agent tasks wherever you are with GitHub Mobile, available on iOS and Android.

- Add 104 new entries covering 93 organisms from the CHAMPS panel
- Fix pre-existing NBCI-REF typo to NCBI-REF
- Add 12 organisms requiring accession lookup to chromosomes-todo.tsv

Organisms added include: Aeromonas hydrophila, Bordetella holmesii/parapertussis,
Brucella abortus/melitensis, Candida albicans, CCHF virus, Chikungunya,
Corynebacterium ulcerans, Dengue 1-4, Enterovirus A71, Hepatitis E, HSV 1/2,
human adenoviruses 2/40/41, astrovirus, human betaherpesvirus 5 (CMV),
human coronaviruses 229E/NL63/OC43/HKU1, metapneumovirus, parainfluenzavirus 1-4,
parechovirus 1, rhinovirus A, Influenza A H1N1, Japanese encephalitis virus,
Lassa virus, MERS, mumps, Mycoplasma pneumoniae, Nipah virus, norovirus GI/GII,
Orientia tsutsugamushi, parvovirus B19, RSV-A, Rift Valley fever virus,
Rotavirus A, Rubella virus, Salmonella Typhi/Paratyphi A, sapovirus GI,
SARS-CoV-1, Treponema pallidum, Ureaplasma urealyticum, varicella zoster virus,
West Nile virus, Yellow fever virus, Zika virus

Co-authored-by: lskatz <14798+lskatz@users.noreply.github.com>
Copilot AI changed the title [WIP] Add reference genomes for CHAMPS pathogens Add CHAMPS panel pathogen reference genomes Mar 20, 2026
Copilot AI requested a review from lskatz March 20, 2026 13:59
- SARS-CoV-1: AY278491 → AY274119 (Tor2 BCCA, INSDC source for NC_004718)
- Dengue virus 3: AY099336 → M93130 (H87 prototype, INSDC source for NC_001475)
- JEV: AF217620 (vaccine strain) → D90194 (Nakayama, INSDC source for NC_001437)
- RSV-A: M74568 (shorter) → AF013254 (complete Long strain, INSDC source for NC_001781)
- Rotavirus A: replaced 11 old SA11 segments with EF672557-EF672567 (SA11-H96, INSDC sources for NC_011503-NC_011513)
- Corynebacterium ulcerans: fix taxid 38289 → 1714 (species level)

Co-authored-by: lskatz <14798+lskatz@users.noreply.github.com>
Agent-Logs-Url: https://github.com/lskatz/Kalamari/sessions/3d3860c7-c251-44b8-837d-761e5da02c29
Copilot AI changed the title Add CHAMPS panel pathogen reference genomes Add CHAMPS pathogen panel reference genomes to chromosomes.tsv Mar 20, 2026
… (GenBank)

D90194 is a DDBJ-origin accession. CI log analysis across all 16 download
chunks confirmed it was never downloaded by NCBI's efetch. AF217620 is a
standard NCBI GenBank accession (SA14-14-2 complete genome, 10977 bp) that
was confirmed downloading in CI for the first commit.

Co-authored-by: lskatz <14798+lskatz@users.noreply.github.com>
Agent-Logs-Url: https://github.com/lskatz/Kalamari/sessions/5d4c9d40-a1b7-454b-8e36-85804296c8ea
Copilot AI changed the title Add CHAMPS pathogen panel reference genomes to chromosomes.tsv Add CHAMPS pathogen reference genomes to chromosomes.tsv Mar 21, 2026
@lskatz
Copy link
Copy Markdown
Owner

lskatz commented Mar 22, 2026

I'll have to check each of the accessions to see if it

  • is a legitimate accession
  • is on the CHAMPS list (or why is it important?)
  • is a complete genome
  • has a correct taxid
  • has a correct parent taxid
  • is a complete genome
  • is recommended somehow (NCBI reference genome? CDC SME? Other SME?)

Checking taxonomy one by one (maybe there's a better way)

$ esearch -db nuccore -query CP000462 | elink -target taxonomy | efetch -format xml | xtract -format | grep -EC 4 "<Rank>genus|<Rank>species"
      </Taxon>
      <Taxon>
        <TaxId>642</TaxId>
        <ScientificName>Aeromonas</ScientificName>
        <Rank>genus</Rank>
      </Taxon>
      <Taxon>
        <TaxId>644</TaxId>
        <ScientificName>Aeromonas hydrophila</ScientificName>
        <Rank>species</Rank>
      </Taxon>
      <Taxon>
        <TaxId>196023</TaxId>
        <ScientificName>Aeromonas hydrophila subsp. hydrophila</ScientificName>

@lskatz
Copy link
Copy Markdown
Owner

lskatz commented Mar 22, 2026

@copilot can you suggest other edirect commands to verify contig completeness, taxid, parent taxid, and inclusion in the NCBI reference genomes list?

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 22, 2026

@copilot can you suggest other edirect commands to verify contig completeness, taxid, parent taxid, and inclusion in the NCBI reference genomes list?

Here are edirect commands for each verification type. The key fields from efetch -format docsum on the nuccore database are TaxId, Length, Completeness, and Genome (chromosome/complete/plasmid/etc).

1. Verify accession, taxid, completeness in one shot:

acc="AF013254"
esearch -db nuccore -query "${acc}[accn]" | efetch -format docsum | \
  xtract -pattern DocumentSummary -element Caption TaxId Length Completeness Genome Organism

2. Verify taxid and get parent taxid:

acc="AF013254"
esearch -db nuccore -query "${acc}[accn]" | elink -target taxonomy | \
  efetch -format xml | \
  xtract -pattern Taxon -element TaxId ScientificName -block LineageEx/Taxon -tab "\n" -element TaxId ScientificName Rank

The parent taxid is the last entry before the species in the lineage.

3. Check if it's an NCBI reference genome (RefSeq_category = "reference genome"):

acc="AF013254"
esearch -db nuccore -query "${acc}[accn]" | elink -target assembly | \
  efetch -format docsum | \
  xtract -pattern DocumentSummary -element AssemblyAccession AssemblyStatus RefSeq_category Organism

If nothing comes back, the accession has no assembly record (common for viral sequences). For those, check manually at https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=11250

4. Batch-check all CHAMPS accessions from chromosomes.tsv:

tail -n +2 src/chromosomes.tsv | awk -F'\t' '{print $1"\t"$2"\t"$3"\t"$4}' | \
while IFS=$'\t' read name acc taxid parent; do
  result=$(esearch -db nuccore -query "${acc}[accn]" | efetch -format docsum | \
    xtract -pattern DocumentSummary -element Caption TaxId Length Completeness)
  echo -e "$name\t$acc\t$taxid\t$parent\t$result"
done

5. Completeness field values to look for: Complete means it is a complete genome. Chromosome also indicates assembled-to-chromosome level. Contig or empty means it may be incomplete.

6. Verify the NCBI RefSeq reference genome for a given taxid:

taxid="11250"  # RSV-A
esearch -db assembly -query "txid${taxid}[Organism:exp] AND \"reference genome\"[RefSeq Category]" | \
  efetch -format docsum | \
  xtract -pattern DocumentSummary -element AssemblyAccession SpeciesName RefSeq_category

This returns the RefSeq assembly record; you can then trace back to the INSDC accession from there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CHAMPS organisms

2 participants