Skip to content

Archive original JMB 2025 toxprot dataset for backwards compatibility#52

Merged
tsenoner merged 3 commits into
mainfrom
feat/restore-jmb-2025-toxprot
May 28, 2026
Merged

Archive original JMB 2025 toxprot dataset for backwards compatibility#52
tsenoner merged 3 commits into
mainfrom
feat/restore-jmb-2025-toxprot

Conversation

@tsenoner

@tsenoner tsenoner commented May 28, 2026

Copy link
Copy Markdown
Owner

Restores the venom-toxin (ToxProt) dataset behind the original ProtSpace JMB 2025 figures into data/jmb_2025/toxprot/ for backwards compatibility. Recovered from commit 7c0442e (removed in the Oct 2025 cleanup).

Dataset: 5,181 Swiss-Prot venom toxins, ProtT5 embeddings on mature (signal-peptide-stripped) sequences.

Includes: toxins.json (+_style) with PCA/UMAP/PaCMAP projections, toxins.csv (accessions + curated protein_category), both reconstructed FASTAs (full + mature), and rebuild_mature_fasta.py. README documents the DR parameters.

Notes:

  • FASTA was never committed → reconstructed from UniProt (5,179/5,181; D5KR58/Q2PE51 now obsolete).
  • toxins_prott5.h5 stays gitignored (reproducible via protspace embed -e prot_t5).

🤖 Generated with Claude Code

tsenoner and others added 3 commits May 28, 2026 13:36
Restore the venom-toxin (ToxProt) dataset behind the original ProtSpace
JMB 2025 figures (from commit 7c0442e, removed in the Oct 2025 cleanup)
into data/jmb_2025/toxprot/ for backwards compatibility: ProtSpace JSONs
(embedding + sequence-similarity projections), ProtT5 embeddings, and
annotation CSVs.

The input FASTA was never committed, so rebuild_mature_fasta.py
reconstructs both full and signal-peptide-stripped sequences by
re-fetching the 5,181 accessions from UniProt (5,179 recovered; 2 now
obsolete). README documents the dataset and the exact DR parameters used.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Keep the 22 MB ProtT5 .h5 out of git (it's reproducible from the mature
FASTA via `protspace embed`). The rebuild script now reads the accession
list from the tracked toxins.csv instead of the .h5, so the archive stays
self-contained without the embeddings. README clarifies the .h5 is
untracked and documents the toxins.csv vs toxins_all.csv column split.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the sequence-similarity projection JSONs (toxins_seq_sim*.json) and
the supplementary toxins_all.csv. The archive keeps the ProtT5
embedding-based ProtSpace JSONs, toxins.csv (accessions + curated
protein_category), and the reconstructed FASTAs. README updated to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tsenoner tsenoner merged commit bd1c55d into main May 28, 2026
4 checks passed
@tsenoner tsenoner deleted the feat/restore-jmb-2025-toxprot branch May 28, 2026 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant