Skip to content

feat(clam): CHAODA multi-method ensemble — clears the synthetic PROBE-CHAODA-1000G bar (AUC 0.62 -> 0.99)#220

Merged
AdaWorldAPI merged 2 commits into
claude/chaoda-outlier-spike-v1from
claude/chaoda-ensemble-v1
Jun 16, 2026
Merged

feat(clam): CHAODA multi-method ensemble — clears the synthetic PROBE-CHAODA-1000G bar (AUC 0.62 -> 0.99)#220
AdaWorldAPI merged 2 commits into
claude/chaoda-outlier-spike-v1from
claude/chaoda-ensemble-v1

Conversation

@AdaWorldAPI

@AdaWorldAPI AdaWorldAPI commented Jun 16, 2026

Copy link
Copy Markdown
Owner

Summary

Increment 1 of D-GEN-CHAODA-ENSEMBLE (lance-graph genetics-probes-v1.md, merged in lance-graph #505). Adds ClamTree::ensemble_anomaly_scores as a new scoring entry point alongside the unchanged single-method anomaly_scores baseline.

The spike (#219) measured single-method leaf-LFD at ROC-AUC 0.624 on a synthetic 5-lane Gaussian mixture — below the 0.85 bar — because leaf LFD measures intra-leaf geometry, not inter-leaf isolation. This ensemble fixes that.

Result (deterministic, same fixture as #219)

signal ROC-AUC
single-method leaf-LFD (baseline) 0.6240
multi-method ensemble 0.9906
lift +0.3667

The ensemble clears the 0.85 PROBE-CHAODA-1000G bar on the synthetic smoke test.

Method

Isolation-sensitive CHAODA signals (Ishaq et al. 2021), averaged:

  • parent-child path-minority ratio (dominant) — walking a leaf up to the root, the minimum child_cardinality / parent_cardinality ratio is tiny for a point that split off as a minority (an isolated outlier) and moderate for one that always stayed in the majority (a dense-cluster member). Immune to the leaf-fragmentation that defeats raw leaf cardinality.
  • connected-component cardinality over the leaf-overlap graph — small components are anomalous.

A first attempt using raw leaf cardinality + vertex degree + component size scored AUC 0.621 (no lift) — the tree fragments dense blobs into many tiny leaves that mimic isolated outliers under those metrics. The path-minority signal is what actually separates; leaf degree + raw leaf cardinality were dropped as fragmentation noise. Random-walk stationary distribution is deferred to a later increment.

Honest scope

This is the synthetic smoke test only. It proves the ensemble approach captures isolation where single-LFD does not; it does not prove genomic novelty detection. PROBE-CHAODA-1000G on real corpora remains gated on D-GEN-1 + D-GEN-2 (VCF -> feature-vector pipeline).

Test plan

  • cargo test --lib hpc::clam::tests — 53 passed (51 pre-existing + spike + ensemble), no warnings.
  • Determinism: ensemble rebuild + rescore bit-exact (f64::to_bits).
  • Built purely from shipped tree fields (nodes, reordered, Cluster fields) + public dist(); no new tree state.
  • anomaly_scores baseline unchanged.

Stacking

Based on claude/chaoda-outlier-spike-v1 (#219, the spike harness this reuses). GitHub will retarget to master when #219 merges.

https://claude.ai/code/session_01VysoWJ6vsyg3wEGc5v7T5v

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: aad35e0e-2acc-4c41-9355-4856c3628162

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: edaffe25e4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/hpc/clam.rs Outdated
Comment on lines +1629 to +1631
for b in (a + 1)..n_leaves {
let nb = &self.nodes[leaves[b]];
let d = self.dist(ca, center(leaves[b]));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard the quadratic leaf-overlap build

When this runs on trees built with a small min_cluster_size, n_leaves can approach the number of data points, and this inner loop computes a full distance for every leaf pair before any scoring. On production-sized fingerprint/genomics corpora this makes the new public ensemble API O(L² * vec_len) and can hang or exhaust memory where the existing anomaly_scores path was linear; please add pruning/bounds for overlap construction or explicitly guard this method to small fixtures.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f1f99b8. Real concern — the connected-component term needs an O(L^2 * vec_len) leaf-overlap graph, and with small min_cluster_size the leaf count approaches the point count.

Guarded it: the method is split so the parent-child path-minority signal (the dominant one) is always computed in O(L * depth), and the overlap-graph + component term runs only when n_leaves <= graph_budget:

  • ensemble_anomaly_scores_budgeted(.., graph_budget) — explicit cap; 0 forces path-only, usize::MAX always includes the graph.
  • ensemble_anomaly_scores(..) — wrapper using ENSEMBLE_GRAPH_BUDGET = 4096; above that it degrades to path-minority alone, so the public API never runs the quadratic build on production-sized corpora.

Crucially, the fallback is validated, not assumed. New measurement on the synthetic fixture (graph_budget = 0 forces path-only):

signal AUC
single-LFD 0.6240
path-only (the fallback) 0.9938
full ensemble 0.9906

Path-minority alone clears the 0.85 bar (slightly above the combined — the component term is a marginal refinement), so degrading at scale is safe. The test now asserts path-only AUC >= 0.85 so the guard can never silently degrade large-corpus accuracy.

(Also corrected a stale doc comment in the same commit — it still described the abandoned leaf-cardinality/degree method set; rewritten to match the path-minority implementation. format/stable is now green; clippy clean; 53 clam tests pass.)

AdaWorldAPI pushed a commit that referenced this pull request Jun 16, 2026
…t doc + rustfmt

Addresses the Codex P2 on PR #220 (quadratic leaf-overlap build) and a
doc-comment inconsistency I introduced, and fixes the format/stable CI.

(1) Quadratic-build guard (Codex P2). The connected-component term needs an
O(L^2 * vec_len) leaf-overlap graph; on production corpora with small
min_cluster_size, L approaches the point count and the public API could
hang. Split into:
  - ensemble_anomaly_scores_budgeted(.., graph_budget): computes the linear
    O(L*depth) parent-child path-minority signal always, and only builds the
    overlap graph + component term when n_leaves <= graph_budget.
  - ensemble_anomaly_scores(..): convenience wrapper using the default
    ENSEMBLE_GRAPH_BUDGET = 4096; above that it degrades to path-minority
    alone, so the public API never runs the quadratic build at scale.

(2) Path-only fallback is validated, not assumed. New measurement on the
synthetic fixture (graph_budget = 0 forces the fallback):
    single-LFD 0.6240 | path-only 0.9938 | full ensemble 0.9906
Path-minority alone clears the 0.85 bar (slightly above the combined — the
component term is a marginal refinement), so degrading at scale is safe. The
test now asserts path-only AUC >= 0.85 so the guard can never silently
degrade large-corpus accuracy.

(3) Doc-comment correction. When the scoring pivoted to path-minority +
component, the method doc still described the abandoned relative-cardinality
/ vertex-degree set and listed parent-child ratio as "deferred" when it is in
fact the dominant shipped signal. Rewritten to match the implementation.

(4) rustfmt: format/stable was red; the new code is now rustfmt-clean
(changes confined to the added ensemble method + tests; no pre-existing code
touched). clippy --lib clean; full hpc::clam suite green (53 tests).

https://claude.ai/code/session_01VysoWJ6vsyg3wEGc5v7T5v
claude added 2 commits June 16, 2026 09:26
…HAODA-1000G synthetic bar (AUC 0.62 -> 0.99)

Increment 1 of D-GEN-CHAODA-ENSEMBLE (lance-graph genetics-probes-v1.md).
Adds ClamTree::ensemble_anomaly_scores as a NEW scoring entry point
alongside the unchanged single-method anomaly_scores baseline.

The spike (#219) measured single-method leaf-LFD at ROC-AUC 0.624 on a
synthetic 5-lane Gaussian mixture, below the 0.85 bar. Mechanical cause:
leaf LFD measures intra-leaf geometry, not inter-leaf isolation.

This ensemble combines isolation-sensitive CHAODA signals:
  - parent-child path-minority ratio (dominant): walking a leaf to the
    root, the minimum child/parent cardinality ratio is tiny for a point
    that split off as a minority (isolated outlier) and moderate for a
    point that always stayed in the majority (dense-cluster member).
    Immune to the leaf-fragmentation that defeats raw leaf cardinality.
  - connected-component cardinality over the leaf-overlap graph (small
    components are anomalous).
Averaged into one score; every point inherits its leaf's score.

A first attempt using raw leaf cardinality + vertex degree + component
size scored AUC 0.621 (no lift) because the tree fragments dense blobs
into many tiny leaves that mimic isolated outliers under those metrics;
the path-minority signal is what actually separates. Leaf degree and raw
leaf cardinality were dropped as fragmentation noise. The remaining
CHAODA methods (random-walk stationary distribution) are deferred.

MEASURED (deterministic synthetic mixture, same fixture as #219):
  single-LFD AUC = 0.6240
  ensemble  AUC = 0.9906   (lift +0.3667, clears the 0.85 bar)

This is the synthetic SMOKE TEST only. It proves the ensemble approach
captures isolation where single-LFD does not; it does NOT prove genomic
novelty detection. PROBE-CHAODA-1000G on real corpora remains gated on
D-GEN-1 + D-GEN-2 (VCF -> feature-vector pipeline).

Tests: full hpc::clam suite green (53 incl. the new ensemble test);
ensemble is deterministic (bit-exact rebuild) and built purely from
shipped tree fields + the public dist().

https://claude.ai/code/session_01VysoWJ6vsyg3wEGc5v7T5v
…t doc + rustfmt

Addresses the Codex P2 on PR #220 (quadratic leaf-overlap build) and a
doc-comment inconsistency I introduced, and fixes the format/stable CI.

(1) Quadratic-build guard (Codex P2). The connected-component term needs an
O(L^2 * vec_len) leaf-overlap graph; on production corpora with small
min_cluster_size, L approaches the point count and the public API could
hang. Split into:
  - ensemble_anomaly_scores_budgeted(.., graph_budget): computes the linear
    O(L*depth) parent-child path-minority signal always, and only builds the
    overlap graph + component term when n_leaves <= graph_budget.
  - ensemble_anomaly_scores(..): convenience wrapper using the default
    ENSEMBLE_GRAPH_BUDGET = 4096; above that it degrades to path-minority
    alone, so the public API never runs the quadratic build at scale.

(2) Path-only fallback is validated, not assumed. New measurement on the
synthetic fixture (graph_budget = 0 forces the fallback):
    single-LFD 0.6240 | path-only 0.9938 | full ensemble 0.9906
Path-minority alone clears the 0.85 bar (slightly above the combined — the
component term is a marginal refinement), so degrading at scale is safe. The
test now asserts path-only AUC >= 0.85 so the guard can never silently
degrade large-corpus accuracy.

(3) Doc-comment correction. When the scoring pivoted to path-minority +
component, the method doc still described the abandoned relative-cardinality
/ vertex-degree set and listed parent-child ratio as "deferred" when it is in
fact the dominant shipped signal. Rewritten to match the implementation.

(4) rustfmt: format/stable was red; the new code is now rustfmt-clean
(changes confined to the added ensemble method + tests; no pre-existing code
touched). clippy --lib clean; full hpc::clam suite green (53 tests).

https://claude.ai/code/session_01VysoWJ6vsyg3wEGc5v7T5v
@AdaWorldAPI AdaWorldAPI force-pushed the claude/chaoda-ensemble-v1 branch from f1f99b8 to a630d77 Compare June 16, 2026 09:26
@AdaWorldAPI AdaWorldAPI merged commit 2ef18ed into claude/chaoda-outlier-spike-v1 Jun 16, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants