Skip to content

feat: protlabel EAT engine + protspace transfer subcommand#55

Open
tsenoner wants to merge 22 commits into
mainfrom
feat/eat-transfer-backend
Open

feat: protlabel EAT engine + protspace transfer subcommand#55
tsenoner wants to merge 22 commits into
mainfrom
feat/eat-transfer-backend

Conversation

@tsenoner

Copy link
Copy Markdown
Owner

Summary

Backend for Embedding Annotation Transfer (EAT) — the engine from #54, packaged so the conference users' proximity-mining workflow becomes a thin layer on top rather than a parallel reimplementation.

  • New protlabel package (numpy/scipy/h5py only, strict no-protspace-imports boundary): kNN in true pLM embedding space + goPredSim reliability index (RI = 0.5/(0.5+d), Eq. 5) + a persistable .npz lookup sidecar. Ships as a second top-level package in this repo (built into the wheel); a future standalone PyPI split is mechanical.
  • New protspace transfer subcommand: classifies query vs reference proteins (ID-prefix / col~substr, no hardcoded biology), transfers each query's missing annotation value from its nearest annotated reference, and writes a per-cell overlay into the bundle.
  • Overlay format: appends <col>__pred_value (string), <col>__pred_confidence (float32, RI in [0,1]), <col>__pred_source (string) — the curated <col> is left untouched, and the bundle keeps its protein_id id column, so existing web readers stay compatible.
  • Defaults: Euclidean (cosine opt-in via --metric), k=1. Distances are computed in the original embedding space (HDF5), not in the 2-D/3-D projection (DR is non-isometric).
  • Storage: the reference matrix is a rebuildable sidecar, never shipped in the bundle (sizing/feasibility in the spec); brute-force kNN is laptop-feasible to full Swiss-Prot, with adaptive per-chunk memory bounding.

Design & scope

  • Spec: docs/superpowers/specs/2026-06-11-eat-annotation-transfer-design.md
  • Plan: docs/superpowers/plans/2026-06-11-eat-transfer-backend.md
  • Out of scope (follow-ups): the web frontend rendering (separate protspace_web PR — a value-level "predicted-by-transfer" layer orthogonal to PR #272's column-level badge), optional gating/consensus/EDD elbow, neighborhood mining, HTML report, faiss-cpu accelerator, ProtTucker learned distance.
  • Implements the backend scope of [FEATURE] EAT — Embedding Annotation Transfer (protlabel lookup table) #54.

Test plan

  • uv run pytest tests/ -m "not slow"545 passed
  • protlabel boundary: no protspace imports
  • uv run ruff check src/ tests/ clean
  • End-to-end: real protein_id bundle round-trip through the CLI (load_h5 → transfer → write) — overlay values correct, projection + settings parts preserved byte-for-byte
  • Reviewer: sanity-check on a real ProtT5 dataset (RI is ProtT5-calibrated; monotone-but-uncalibrated for other embedders)

🤖 Generated with Claude Code

tsenoner and others added 19 commits June 11, 2026 19:17
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add `add_overlay_columns()` in `src/protspace/data/io/predictions.py`
that appends three aligned Arrow columns (`COL__pred_value`,
`COL__pred_confidence`, `COL__pred_source`) from a list of
`protlabel.Prediction` objects, leaving the curated column untouched.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements Task 9: the EAT orchestration core (run_transfer) and the
'protspace transfer' Typer CLI command, wiring classification, nearest-
neighbour lookup (protlabel.eat), and overlay-column writing into a single
pipeline for filling missing annotation values from pLM embedding space.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rrors

- Normalize protein_id→identifier before run_transfer and rename back after
  so real bundles (produced by protspace prepare) no longer KeyError.
- Add ValueError when no bundle proteins match any embedding key.
- Correct misleading comment in test_run_transfer_predicts_for_query_with_missing_value.
- Add end-to-end regression test exercising the protein_id rename path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
tsenoner and others added 2 commits June 16, 2026 17:20
…ty, robustness

Resolve issues found in code review of the EAT transfer backend (PR #55):

- predictions: make the overlay idempotent — drop existing <col>__pred_* columns
  before re-appending, so re-running transfer replaces them instead of producing
  a duplicate-column bundle that can no longer be read back
- bundle: atomic writes (temp file + os.replace) in write_bundle and the
  replace_* helpers, so an interrupted in-place overwrite (-b X -o X) can no
  longer destroy the bundle; reject the reserved delimiter in serialized parts
- backends: replace scipy.cdist with a pure-numpy BLAS GEMM path and recompute
  the surviving top-k distances in float64 (precise for near-identical vectors);
  guard cosine against zero-norm NaN
- lookup: store float32 + unicode arrays, load with allow_pickle=False
  (no pickle/RCE surface; lossless round-trip)
- transfer/classification: materialize only the needed columns (no full
  to_pylist); deterministic RI tie-break; translate input errors to BadParameter
- cli: colon/Windows-safe -e/-i parsing via a shared split_h5_spec helper
- docs/notebook: qualify the reliability-index formula per metric and k

Adds tests for protlabel engine, overlay idempotency, atomic write, spec
parsing, and CLI error handling. Full suite: 572 passed; ruff clean.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…onfidence

The per-cell prediction overlay now writes only <col>__pred_value and
<col>__pred_confidence. The reference id (source) is noise as a colour feature,
so it is dropped from the bundle; it remains available on protlabel's Prediction.
A legacy <col>__pred_source is dropped on re-run so older bundles are cleaned up.

Keeping confidence as a separate numeric column lets the web frontend colour and
threshold by reliability (gradient legend) — which inline label|score values do
not enable (those render tooltip-only).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@tsenoner tsenoner marked this pull request as draft June 16, 2026 17:37
@tsenoner tsenoner requested review from peymanvahidi and t03i June 17, 2026 13:52
@tsenoner tsenoner marked this pull request as ready for review June 17, 2026 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant