feat: protlabel EAT engine + protspace transfer subcommand#55
Open
tsenoner wants to merge 22 commits into
Open
feat: protlabel EAT engine + protspace transfer subcommand#55tsenoner wants to merge 22 commits into
tsenoner wants to merge 22 commits into
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tion Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add `add_overlay_columns()` in `src/protspace/data/io/predictions.py` that appends three aligned Arrow columns (`COL__pred_value`, `COL__pred_confidence`, `COL__pred_source`) from a list of `protlabel.Prediction` objects, leaving the curated column untouched. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements Task 9: the EAT orchestration core (run_transfer) and the 'protspace transfer' Typer CLI command, wiring classification, nearest- neighbour lookup (protlabel.eat), and overlay-column writing into a single pipeline for filling missing annotation values from pLM embedding space. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rrors - Normalize protein_id→identifier before run_transfer and rename back after so real bundles (produced by protspace prepare) no longer KeyError. - Add ValueError when no bundle proteins match any embedding key. - Correct misleading comment in test_run_transfer_predicts_for_query_with_missing_value. - Add end-to-end regression test exercising the protein_id rename path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
7 tasks
…ty, robustness Resolve issues found in code review of the EAT transfer backend (PR #55): - predictions: make the overlay idempotent — drop existing <col>__pred_* columns before re-appending, so re-running transfer replaces them instead of producing a duplicate-column bundle that can no longer be read back - bundle: atomic writes (temp file + os.replace) in write_bundle and the replace_* helpers, so an interrupted in-place overwrite (-b X -o X) can no longer destroy the bundle; reject the reserved delimiter in serialized parts - backends: replace scipy.cdist with a pure-numpy BLAS GEMM path and recompute the surviving top-k distances in float64 (precise for near-identical vectors); guard cosine against zero-norm NaN - lookup: store float32 + unicode arrays, load with allow_pickle=False (no pickle/RCE surface; lossless round-trip) - transfer/classification: materialize only the needed columns (no full to_pylist); deterministic RI tie-break; translate input errors to BadParameter - cli: colon/Windows-safe -e/-i parsing via a shared split_h5_spec helper - docs/notebook: qualify the reliability-index formula per metric and k Adds tests for protlabel engine, overlay idempotency, atomic write, spec parsing, and CLI error handling. Full suite: 572 passed; ruff clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…onfidence The per-cell prediction overlay now writes only <col>__pred_value and <col>__pred_confidence. The reference id (source) is noise as a colour feature, so it is dropped from the bundle; it remains available on protlabel's Prediction. A legacy <col>__pred_source is dropped on re-run so older bundles are cleaned up. Keeping confidence as a separate numeric column lets the web frontend colour and threshold by reliability (gradient legend) — which inline label|score values do not enable (those render tooltip-only). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Backend for Embedding Annotation Transfer (EAT) — the engine from #54, packaged so the conference users' proximity-mining workflow becomes a thin layer on top rather than a parallel reimplementation.
protlabelpackage (numpy/scipy/h5py only, strict no-protspace-imports boundary): kNN in true pLM embedding space + goPredSim reliability index (RI = 0.5/(0.5+d), Eq. 5) + a persistable.npzlookup sidecar. Ships as a second top-level package in this repo (built into the wheel); a future standalone PyPI split is mechanical.protspace transfersubcommand: classifies query vs reference proteins (ID-prefix /col~substr, no hardcoded biology), transfers each query's missing annotation value from its nearest annotated reference, and writes a per-cell overlay into the bundle.<col>__pred_value(string),<col>__pred_confidence(float32, RI in [0,1]),<col>__pred_source(string) — the curated<col>is left untouched, and the bundle keeps itsprotein_idid column, so existing web readers stay compatible.--metric),k=1. Distances are computed in the original embedding space (HDF5), not in the 2-D/3-D projection (DR is non-isometric).Design & scope
docs/superpowers/specs/2026-06-11-eat-annotation-transfer-design.mddocs/superpowers/plans/2026-06-11-eat-transfer-backend.mdprotspace_webPR — a value-level "predicted-by-transfer" layer orthogonal to PR #272's column-level badge), optional gating/consensus/EDD elbow, neighborhood mining, HTML report, faiss-cpu accelerator, ProtTucker learned distance.Test plan
uv run pytest tests/ -m "not slow"→ 545 passedprotlabelboundary: noprotspaceimportsuv run ruff check src/ tests/cleanprotein_idbundle round-trip through the CLI (load_h5→ transfer → write) — overlay values correct, projection + settings parts preserved byte-for-byte🤖 Generated with Claude Code