feat(clam): CHAODA multi-method ensemble — clears the synthetic PROBE-CHAODA-1000G bar (AUC 0.62 -> 0.99)#220
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: edaffe25e4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for b in (a + 1)..n_leaves { | ||
| let nb = &self.nodes[leaves[b]]; | ||
| let d = self.dist(ca, center(leaves[b])); |
There was a problem hiding this comment.
Guard the quadratic leaf-overlap build
When this runs on trees built with a small min_cluster_size, n_leaves can approach the number of data points, and this inner loop computes a full distance for every leaf pair before any scoring. On production-sized fingerprint/genomics corpora this makes the new public ensemble API O(L² * vec_len) and can hang or exhaust memory where the existing anomaly_scores path was linear; please add pruning/bounds for overlap construction or explicitly guard this method to small fixtures.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in f1f99b8. Real concern — the connected-component term needs an O(L^2 * vec_len) leaf-overlap graph, and with small min_cluster_size the leaf count approaches the point count.
Guarded it: the method is split so the parent-child path-minority signal (the dominant one) is always computed in O(L * depth), and the overlap-graph + component term runs only when n_leaves <= graph_budget:
ensemble_anomaly_scores_budgeted(.., graph_budget)— explicit cap;0forces path-only,usize::MAXalways includes the graph.ensemble_anomaly_scores(..)— wrapper usingENSEMBLE_GRAPH_BUDGET = 4096; above that it degrades to path-minority alone, so the public API never runs the quadratic build on production-sized corpora.
Crucially, the fallback is validated, not assumed. New measurement on the synthetic fixture (graph_budget = 0 forces path-only):
| signal | AUC |
|---|---|
| single-LFD | 0.6240 |
| path-only (the fallback) | 0.9938 |
| full ensemble | 0.9906 |
Path-minority alone clears the 0.85 bar (slightly above the combined — the component term is a marginal refinement), so degrading at scale is safe. The test now asserts path-only AUC >= 0.85 so the guard can never silently degrade large-corpus accuracy.
(Also corrected a stale doc comment in the same commit — it still described the abandoned leaf-cardinality/degree method set; rewritten to match the path-minority implementation. format/stable is now green; clippy clean; 53 clam tests pass.)
…t doc + rustfmt Addresses the Codex P2 on PR #220 (quadratic leaf-overlap build) and a doc-comment inconsistency I introduced, and fixes the format/stable CI. (1) Quadratic-build guard (Codex P2). The connected-component term needs an O(L^2 * vec_len) leaf-overlap graph; on production corpora with small min_cluster_size, L approaches the point count and the public API could hang. Split into: - ensemble_anomaly_scores_budgeted(.., graph_budget): computes the linear O(L*depth) parent-child path-minority signal always, and only builds the overlap graph + component term when n_leaves <= graph_budget. - ensemble_anomaly_scores(..): convenience wrapper using the default ENSEMBLE_GRAPH_BUDGET = 4096; above that it degrades to path-minority alone, so the public API never runs the quadratic build at scale. (2) Path-only fallback is validated, not assumed. New measurement on the synthetic fixture (graph_budget = 0 forces the fallback): single-LFD 0.6240 | path-only 0.9938 | full ensemble 0.9906 Path-minority alone clears the 0.85 bar (slightly above the combined — the component term is a marginal refinement), so degrading at scale is safe. The test now asserts path-only AUC >= 0.85 so the guard can never silently degrade large-corpus accuracy. (3) Doc-comment correction. When the scoring pivoted to path-minority + component, the method doc still described the abandoned relative-cardinality / vertex-degree set and listed parent-child ratio as "deferred" when it is in fact the dominant shipped signal. Rewritten to match the implementation. (4) rustfmt: format/stable was red; the new code is now rustfmt-clean (changes confined to the added ensemble method + tests; no pre-existing code touched). clippy --lib clean; full hpc::clam suite green (53 tests). https://claude.ai/code/session_01VysoWJ6vsyg3wEGc5v7T5v
…HAODA-1000G synthetic bar (AUC 0.62 -> 0.99) Increment 1 of D-GEN-CHAODA-ENSEMBLE (lance-graph genetics-probes-v1.md). Adds ClamTree::ensemble_anomaly_scores as a NEW scoring entry point alongside the unchanged single-method anomaly_scores baseline. The spike (#219) measured single-method leaf-LFD at ROC-AUC 0.624 on a synthetic 5-lane Gaussian mixture, below the 0.85 bar. Mechanical cause: leaf LFD measures intra-leaf geometry, not inter-leaf isolation. This ensemble combines isolation-sensitive CHAODA signals: - parent-child path-minority ratio (dominant): walking a leaf to the root, the minimum child/parent cardinality ratio is tiny for a point that split off as a minority (isolated outlier) and moderate for a point that always stayed in the majority (dense-cluster member). Immune to the leaf-fragmentation that defeats raw leaf cardinality. - connected-component cardinality over the leaf-overlap graph (small components are anomalous). Averaged into one score; every point inherits its leaf's score. A first attempt using raw leaf cardinality + vertex degree + component size scored AUC 0.621 (no lift) because the tree fragments dense blobs into many tiny leaves that mimic isolated outliers under those metrics; the path-minority signal is what actually separates. Leaf degree and raw leaf cardinality were dropped as fragmentation noise. The remaining CHAODA methods (random-walk stationary distribution) are deferred. MEASURED (deterministic synthetic mixture, same fixture as #219): single-LFD AUC = 0.6240 ensemble AUC = 0.9906 (lift +0.3667, clears the 0.85 bar) This is the synthetic SMOKE TEST only. It proves the ensemble approach captures isolation where single-LFD does not; it does NOT prove genomic novelty detection. PROBE-CHAODA-1000G on real corpora remains gated on D-GEN-1 + D-GEN-2 (VCF -> feature-vector pipeline). Tests: full hpc::clam suite green (53 incl. the new ensemble test); ensemble is deterministic (bit-exact rebuild) and built purely from shipped tree fields + the public dist(). https://claude.ai/code/session_01VysoWJ6vsyg3wEGc5v7T5v
…t doc + rustfmt Addresses the Codex P2 on PR #220 (quadratic leaf-overlap build) and a doc-comment inconsistency I introduced, and fixes the format/stable CI. (1) Quadratic-build guard (Codex P2). The connected-component term needs an O(L^2 * vec_len) leaf-overlap graph; on production corpora with small min_cluster_size, L approaches the point count and the public API could hang. Split into: - ensemble_anomaly_scores_budgeted(.., graph_budget): computes the linear O(L*depth) parent-child path-minority signal always, and only builds the overlap graph + component term when n_leaves <= graph_budget. - ensemble_anomaly_scores(..): convenience wrapper using the default ENSEMBLE_GRAPH_BUDGET = 4096; above that it degrades to path-minority alone, so the public API never runs the quadratic build at scale. (2) Path-only fallback is validated, not assumed. New measurement on the synthetic fixture (graph_budget = 0 forces the fallback): single-LFD 0.6240 | path-only 0.9938 | full ensemble 0.9906 Path-minority alone clears the 0.85 bar (slightly above the combined — the component term is a marginal refinement), so degrading at scale is safe. The test now asserts path-only AUC >= 0.85 so the guard can never silently degrade large-corpus accuracy. (3) Doc-comment correction. When the scoring pivoted to path-minority + component, the method doc still described the abandoned relative-cardinality / vertex-degree set and listed parent-child ratio as "deferred" when it is in fact the dominant shipped signal. Rewritten to match the implementation. (4) rustfmt: format/stable was red; the new code is now rustfmt-clean (changes confined to the added ensemble method + tests; no pre-existing code touched). clippy --lib clean; full hpc::clam suite green (53 tests). https://claude.ai/code/session_01VysoWJ6vsyg3wEGc5v7T5v
f1f99b8 to
a630d77
Compare
Summary
Increment 1 of
D-GEN-CHAODA-ENSEMBLE(lance-graphgenetics-probes-v1.md, merged in lance-graph #505). AddsClamTree::ensemble_anomaly_scoresas a new scoring entry point alongside the unchanged single-methodanomaly_scoresbaseline.The spike (#219) measured single-method leaf-LFD at ROC-AUC 0.624 on a synthetic 5-lane Gaussian mixture — below the 0.85 bar — because leaf LFD measures intra-leaf geometry, not inter-leaf isolation. This ensemble fixes that.
Result (deterministic, same fixture as #219)
The ensemble clears the 0.85
PROBE-CHAODA-1000Gbar on the synthetic smoke test.Method
Isolation-sensitive CHAODA signals (Ishaq et al. 2021), averaged:
child_cardinality / parent_cardinalityratio is tiny for a point that split off as a minority (an isolated outlier) and moderate for one that always stayed in the majority (a dense-cluster member). Immune to the leaf-fragmentation that defeats raw leaf cardinality.A first attempt using raw leaf cardinality + vertex degree + component size scored AUC 0.621 (no lift) — the tree fragments dense blobs into many tiny leaves that mimic isolated outliers under those metrics. The path-minority signal is what actually separates; leaf degree + raw leaf cardinality were dropped as fragmentation noise. Random-walk stationary distribution is deferred to a later increment.
Honest scope
This is the synthetic smoke test only. It proves the ensemble approach captures isolation where single-LFD does not; it does not prove genomic novelty detection.
PROBE-CHAODA-1000Gon real corpora remains gated on D-GEN-1 + D-GEN-2 (VCF -> feature-vector pipeline).Test plan
cargo test --lib hpc::clam::tests— 53 passed (51 pre-existing + spike + ensemble), no warnings.f64::to_bits).nodes,reordered,Clusterfields) + publicdist(); no new tree state.anomaly_scoresbaseline unchanged.Stacking
Based on
claude/chaoda-outlier-spike-v1(#219, the spike harness this reuses). GitHub will retarget tomasterwhen #219 merges.https://claude.ai/code/session_01VysoWJ6vsyg3wEGc5v7T5v