Archive original JMB 2025 toxprot dataset for backwards compatibility#52
Merged
Conversation
Restore the venom-toxin (ToxProt) dataset behind the original ProtSpace JMB 2025 figures (from commit 7c0442e, removed in the Oct 2025 cleanup) into data/jmb_2025/toxprot/ for backwards compatibility: ProtSpace JSONs (embedding + sequence-similarity projections), ProtT5 embeddings, and annotation CSVs. The input FASTA was never committed, so rebuild_mature_fasta.py reconstructs both full and signal-peptide-stripped sequences by re-fetching the 5,181 accessions from UniProt (5,179 recovered; 2 now obsolete). README documents the dataset and the exact DR parameters used. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Keep the 22 MB ProtT5 .h5 out of git (it's reproducible from the mature FASTA via `protspace embed`). The rebuild script now reads the accession list from the tracked toxins.csv instead of the .h5, so the archive stays self-contained without the embeddings. README clarifies the .h5 is untracked and documents the toxins.csv vs toxins_all.csv column split. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the sequence-similarity projection JSONs (toxins_seq_sim*.json) and the supplementary toxins_all.csv. The archive keeps the ProtT5 embedding-based ProtSpace JSONs, toxins.csv (accessions + curated protein_category), and the reconstructed FASTAs. README updated to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Restores the venom-toxin (ToxProt) dataset behind the original ProtSpace JMB 2025 figures into
data/jmb_2025/toxprot/for backwards compatibility. Recovered from commit7c0442e(removed in the Oct 2025 cleanup).Dataset: 5,181 Swiss-Prot venom toxins, ProtT5 embeddings on mature (signal-peptide-stripped) sequences.
Includes:
toxins.json(+_style) with PCA/UMAP/PaCMAP projections,toxins.csv(accessions + curatedprotein_category), both reconstructed FASTAs (full + mature), andrebuild_mature_fasta.py. README documents the DR parameters.Notes:
D5KR58/Q2PE51now obsolete).toxins_prott5.h5stays gitignored (reproducible viaprotspace embed -e prot_t5).🤖 Generated with Claude Code