Skip to content

fix(node): SET with $vector_param now populates HNSW (closes silent-data-loss surfaced by KMSmcp ch#202)#410

Merged
ryaker merged 3 commits into
mainfrom
fix/spa-set-vector-param-hnsw
May 7, 2026
Merged

fix(node): SET with $vector_param now populates HNSW (closes silent-data-loss surfaced by KMSmcp ch#202)#410
ryaker merged 3 commits into
mainfrom
fix/spa-set-vector-param-hnsw

Conversation

@ryaker
Copy link
Copy Markdown
Owner

@ryaker ryaker commented May 7, 2026

Summary

  • Root cause: execute_match_mutate_with_params (the MATCH … SET n.embedding = $emb handler) stored the property via tx.set_property but never called idx.insert on the HNSW vector index. The MERGE write-path had the correct logic at db.rs:1396–1420; the MATCH+SET path was simply missing it. Result: HNSW file size delta = 0 bytes, vectorSearch returns nothing, no error thrown — silent data loss.
  • Fix (Option A): after tx.commit() in execute_match_mutate_with_params, iterate SET mutations; for any SET { value: Param(p) } where p resolves to a vector and (label, prop) has a registered HNSW index, call idx.insert + idx.save for all matched node IDs. Symmetric with the existing MERGE write-path. ~50 LOC, no structural refactoring needed.
  • New API: addToVectorIndex(label, property, nodeId, vector) — explicit HNSW backfill for nodes created before the index existed.

Reproduction

db.createVectorIndex('Memory', 'embedding', 4, 'cosine')
db.execute("CREATE (n:Memory {id: 'k1'})")
db.executeWithParams('MATCH (n:Memory {id: $id}) SET n.embedding = $emb', { id: 'k1', emb: [0.1, 0.2, 0.3, 0.4] })

// Before fix: returns []
// After fix:  returns [{ id: '...', score: 1.0 }]
db.vectorSearch('Memory', 'embedding', new Float32Array([0.1, 0.2, 0.3, 0.4]), 5)

Before/After

Scenario Before After
MATCH … SET n.embedding = $emb Property stored, HNSW unchanged (0-byte delta) Property stored AND HNSW updated
vectorSearch after SET Returns [] Returns the inserted node
addToVectorIndex(label, prop, id, vec) Not available Inserts directly into HNSW; throws RangeError on missing index/node

Test plan

  • New Rust integration tests in crates/sparrowdb/tests/vector_index.rs: set_vector_param_populates_hnsw (in-memory HNSW roundtrip) and set_vector_param_hnsw_roundtrip_survives_reopen (persistence after DB restart) — both pass
  • npm/sparrowdb/test/integration.test.js: existing SET tests strengthened to assert vectorSearch returns the node; 3 new addToVectorIndex tests (happy path, missing node, missing index) — all 34 JS tests pass
  • cargo fmt --all and cargo clippy -p sparrowdb -p sparrowdb-node -- -D warnings clean
  • Pre-existing failures (create_vector_index_ddl, spa191_*) confirmed not introduced by this PR
  • index.d.ts regenerated via napi build, SparrowDb → SparrowDB rename applied, HNSW docstring updated

Surfaced by: KMSmcp inter-session channel message #202.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added addToVectorIndex method to manually backfill vectors into an HNSW index.
  • Bug Fixes

    • Vector indexes now properly update when setting vector properties via parameterized queries.
  • Documentation

    • Updated API documentation to clarify HNSW index update behavior and new backfill method usage.

…data-loss surfaced by KMSmcp ch#202)

Root cause: `execute_match_mutate_with_params` (the handler for
`MATCH (n:L {id: \$id}) SET n.embedding = \$emb`) called `tx.set_property`
for SET mutations but never wrote to the HNSW vector index.  The MERGE
write-path at db.rs:1396-1420 had the correct `idx.insert` + `idx.save`
logic, but it was only reached from `execute_merge_with_params`, not from
`execute_match_mutate_with_params`.  The result: the property was stored on
disk but the HNSW file size delta was 0 bytes — the node was invisible to
`vectorSearch` after a `SET \$param`.  No error was thrown.

Fix (Option A): after `tx.commit()` in `execute_match_mutate_with_params`,
iterate the mutations; for each `SET { value: Param(p) }` where `p` resolves
to a vector value and `(label, prop)` has a registered HNSW index, call
`idx.insert(node_id, &vec)` + `idx.save(...)` for all matched node IDs.  The
label is extracted from the first node pattern in the MATCH clause.  This is
symmetric with the MERGE write-path and requires no structural refactoring.

New API: `addToVectorIndex(label, property, nodeId, vector: Float32Array)` on
`SparrowDB` (Node binding) — explicit backfill API requested by KMSmcp.
Resolves the user-facing string `id` to an internal u64 via a parameterized
MATCH query, then writes into HNSW directly.  Useful for populating the index
on nodes that were created before the index existed, without DETACH DELETE +
recreate.

Tests added:
- `crates/sparrowdb/tests/vector_index.rs`: `set_vector_param_populates_hnsw`
  (in-memory roundtrip) and `set_vector_param_hnsw_roundtrip_survives_reopen`
  (persistence across DB close/open).
- `npm/sparrowdb/test/integration.test.js`: strengthened existing SET tests to
  assert `vectorSearch` returns the node (not just that the property is non-null);
  added `addToVectorIndex` happy-path and error-path tests.
- `npm/sparrowdb/index.d.ts`: regenerated via napi build; `executeWithParams`
  docstring updated to document the HNSW behaviour; `addToVectorIndex`
  declaration added.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented May 7, 2026

User does not have a PR Review subscription.

Go to Team management and add this email to the PR Review subscription.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

Warning

Rate limit exceeded

@ryaker has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 11 minutes and 57 seconds before requesting another review.

To continue reviewing without waiting, purchase usage credits in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2eb4f4d0-0f31-4981-86e8-54baed5c6811

📥 Commits

Reviewing files that changed from the base of the PR and between fffea36 and 9e946e8.

📒 Files selected for processing (5)
  • crates/sparrowdb-execution/src/engine/mutation.rs
  • crates/sparrowdb-node/src/lib.rs
  • crates/sparrowdb/src/db.rs
  • crates/sparrowdb/tests/vector_index.rs
  • npm/sparrowdb/test/integration.test.js
📝 Walkthrough

Walkthrough

This PR adds automatic HNSW vector index updates when using parameterized MATCH...SET mutations with vector properties, plus a new N-API backfill method addToVectorIndex that inserts nodes directly into existing vector indexes. Both paths include parameterized node lookup, dimension validation, index write locking, and on-disk persistence.

Changes

Vector Index Update on SET Mutations and Backfill API

Layer / File(s) Summary
Type Declarations
npm/sparrowdb/index.d.ts
Added addToVectorIndex(label, property, nodeId, vector: Float32Array): void method with JSDoc. Updated executeWithParams docs to note HNSW writes now occur on SET n.prop = $vec.
Core HNSW Write Logic
crates/sparrowdb/src/db.rs
Enhanced execute_match_mutate_with_params to detect when a SET mutation assigns a vector parameter value, resolve the (label, property) HNSW index, insert the vector for all matched nodes, and persist the index to disk once per (label, prop).
N-API Backfill Method
crates/sparrowdb-node/src/lib.rs
Implemented add_to_vector_index that validates the vector index exists (RangeError), checks Float32Array length against expected dimensions (TypeError), looks up the user-facing nodeId via parameterized Cypher, inserts the vector under index write lock, and persists to disk.
Regression Tests
crates/sparrowdb/tests/vector_index.rs
Added two tests: one verifying SET n.embedding = $param populates HNSW and vectorSearch finds the node, and one confirming the index survives database reopening.
Integration Tests & Assertions
npm/sparrowdb/test/integration.test.js
Extended executeWithParams tests to assert HNSW updates on vector SET operations. Added addToVectorIndex positive test (node backfill is discoverable) and negative tests (RangeError on missing node or missing index).

Sequence Diagram

sequenceDiagram
    participant User
    participant NAPI as N-API Layer
    participant GraphDb
    participant HNSWIdx as HNSW Index
    participant Disk

    rect rgba(100, 150, 200, 0.5)
    Note over User,Disk: SET Mutation with Vector Parameter
    User->>GraphDb: execute_with_params("MATCH (n) SET n.emb=$vec", {vec: Float32Array})
    GraphDb->>GraphDb: Execute MATCH...SET mutations
    GraphDb->>HNSWIdx: Detect vector assignment in SET clause
    GraphDb->>HNSWIdx: Resolve (label, property) HNSW index
    GraphDb->>HNSWIdx: Insert vector for each matched node
    GraphDb->>Disk: Persist updated HNSW index
    GraphDb-->>User: QueryResult
    end

    rect rgba(150, 100, 200, 0.5)
    Note over User,Disk: Direct Backfill via addToVectorIndex
    User->>NAPI: addToVectorIndex(label, property, nodeId, Float32Array)
    NAPI->>GraphDb: Validate vector index exists (RangeError if missing)
    NAPI->>GraphDb: Validate vector dimensions match index (TypeError if mismatch)
    NAPI->>GraphDb: Execute parameterized Cypher: MATCH (n:label {id: $id}) RETURN id(n)
    GraphDb-->>NAPI: Internal node u64 id
    NAPI->>HNSWIdx: Acquire write lock
    NAPI->>HNSWIdx: Insert vector for resolved node id
    NAPI->>Disk: Persist updated HNSW index
    NAPI-->>User: void (or error)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Possibly related PRs

  • ryaker/SparrowDB#374: Modified the MATCH...SET mutation execution paths; this PR extends those same paths to wire vector index updates.
  • ryaker/SparrowDB#398: Introduced HNSW vector index creation, search, and insertion on MERGE; this PR builds on that foundation by adding SET-based writes and the backfill API.
  • ryaker/SparrowDB#402: Added N-API vector search and shared error mapping helpers; this PR reuses those patterns for the new backfill method.

Suggested labels

size:L, feature, bug-fix

Poem

🐰 A rabbit hops through vectors fine,
Now SET mutations align—
Embeddings flow where HNSW trees grow,
Backfill brings data back in line! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main fix: SET with $vector_param now populates HNSW, addressing the silent data loss issue. It's specific, clear, and captures the core change.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/spa-set-vector-param-hnsw

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new add_to_vector_index API to allow manual backfilling of vector data into the HNSW index and fixes a bug where SET operations on vector properties failed to update the index, leading to data loss. The feedback identifies a potential Cypher injection vulnerability in the new API due to direct string interpolation of the label and suggests a more robust label resolution strategy in the SET mutation logic to handle anonymous or complex match patterns correctly.

Comment on lines +419 to +422
// 3. Resolve the user-facing string id to the internal u64 node slot.
// We do this with a simple MATCH+RETURN id(n) query so we don't
// need to replicate the id-property lookup logic inline.
let cypher = format!("MATCH (n:{label} {{id: $id}}) RETURN id(n) AS nid");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

Directly interpolating the label string into a Cypher query string poses a security risk (Cypher injection), as label originates from user input in the Node.js binding. While this is a local database, it is best practice to validate the label name to ensure it only contains alphanumeric characters and underscores before interpolation.

        // 3. Resolve the user-facing string id to the internal u64 node slot.
        //    We do this with a simple MATCH+RETURN id(n) query so we don't
        //    need to replicate the id-property lookup logic inline.
        if !label.chars().all(|c| c.is_alphanumeric() || c == '_') {
            return Err(napi::Error::new(
                napi::Status::InvalidArg,
                format!("TypeError: invalid label name '{label}'"),
            ));
        }
        let cypher = format!("MATCH (n:{label} {{id: $id}}) RETURN id(n) AS nid");

Comment thread crates/sparrowdb/src/db.rs Outdated
Comment on lines +1547 to +1552
let label_opt: Option<String> = mm
.match_patterns
.first()
.and_then(|pp| pp.nodes.first())
.and_then(|np| np.labels.first())
.cloned();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current logic derives the label name solely from the first node of the first MATCH pattern. This approach has two significant limitations:

  1. Anonymous Matches: Queries like MATCH (n) SET n.embedding = $emb will have an empty labels list in the AST, causing the HNSW update to be skipped entirely. This means the 'silent data loss' bug persists for anonymous matches.
  2. Multi-label/Complex Matches: In queries involving multiple labels or variables, the code might attempt to update the wrong index or skip valid ones because it assumes all matching_ids belong to the first label found in the AST.

Since NodeId encodes the label ID in its upper 32 bits, a more robust approach would be to resolve the label name for each node_id in matching_ids using the catalog, and then check for a corresponding vector index for that specific label.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/sparrowdb/src/db.rs (1)

1525-1526: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Keep the HNSW update inside the same write critical section.

tx.commit()? completes before the index insert/save runs, so the property write and HNSW write are no longer atomic. Another writer can delete the node or drop/recreate the index in that gap, and a later idx.save(...) failure leaves the row committed but the vector index stale again. This should stay under the same writer-locked path as the primary mutation.

Also applies to: 1567-1574

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/sparrowdb/src/db.rs` around lines 1525 - 1526, The HNSW index update
must happen before calling tx.commit() so the primary row mutation and
idx.save/insert occur atomically under the same writer lock: move the
idx.save(...) / index insert code to execute while the write transaction/lock is
still held (i.e., before tx.commit()), propagate any error from idx.save so the
transaction can roll back instead of committing a partial state, and only call
self.invalidate_catalog() after a successful commit; apply the same change for
the second block referenced (around idx.save at 1567-1574).
🧹 Nitpick comments (1)
npm/sparrowdb/test/integration.test.js (1)

345-399: ⚡ Quick win

Strengthen these vector assertions to avoid cross-test false positives.

All three tests share the same database and only assert hits.length > 0. Once an earlier case has inserted any vector, later cases can still pass even if the current SET/addToVectorIndex call never touched HNSW. Please either isolate the DB per test or assert that vectorSearch returns the specific node created in that case.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@npm/sparrowdb/test/integration.test.js` around lines 345 - 399, The tests
currently only assert hits.length > 0 which allows cross-test pollution; update
each test (the ones referencing db.executeWithParams, db.execute,
db.vectorSearch, and db.addToVectorIndex) to either use an isolated DB instance
per test or, more simply, assert that the vectorSearch results contain the
specific node id created in that test (e.g., for the first test ensure the hits
include id 'k1', the second includes 'k2', and the third includes 'k3') by
mapping hits to ids and checking inclusion rather than just length > 0.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/sparrowdb-node/src/lib.rs`:
- Around line 387-459: The add_to_vector_index path mutates persisted index
state but currently does node lookup and then updates/saves the HNSW outside the
database write discipline; obtain the same database write lock used by other
durable mutations (call begin_write() to get a WriteGuard before the MATCH
lookup) so the node lookup, idx.insert(...) and idx.save(...) happen while
holding the WriteGuard, and drop/other concurrent writers cannot create TOCTOU
races (mirror the lock scope used by other durable write paths such as
drop_vector_index()).
- Around line 433-447: The code only checks for zero rows and then uses
result.rows[0], risking silent selection when multiple rows match; update the
match validation to require exactly one row: replace the if
result.rows.is_empty() block with a guard that returns an napi::Error (via
napi::Error::new or to_napi) when result.rows.len() != 1 and provide distinct
error messages for zero vs. multiple matches (e.g., "no node with id=... found"
and "multiple nodes with id=... found"); keep the subsequent internal_id
extraction logic as-is (the match on result.rows[0][0]) and ensure callers like
addToVectorIndex() rely on the unique internal_id rather than an ambiguous
result.

---

Outside diff comments:
In `@crates/sparrowdb/src/db.rs`:
- Around line 1525-1526: The HNSW index update must happen before calling
tx.commit() so the primary row mutation and idx.save/insert occur atomically
under the same writer lock: move the idx.save(...) / index insert code to
execute while the write transaction/lock is still held (i.e., before
tx.commit()), propagate any error from idx.save so the transaction can roll back
instead of committing a partial state, and only call self.invalidate_catalog()
after a successful commit; apply the same change for the second block referenced
(around idx.save at 1567-1574).

---

Nitpick comments:
In `@npm/sparrowdb/test/integration.test.js`:
- Around line 345-399: The tests currently only assert hits.length > 0 which
allows cross-test pollution; update each test (the ones referencing
db.executeWithParams, db.execute, db.vectorSearch, and db.addToVectorIndex) to
either use an isolated DB instance per test or, more simply, assert that the
vectorSearch results contain the specific node id created in that test (e.g.,
for the first test ensure the hits include id 'k1', the second includes 'k2',
and the third includes 'k3') by mapping hits to ids and checking inclusion
rather than just length > 0.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cf7130d7-c7db-4b1d-969a-a359fffc79e8

📥 Commits

Reviewing files that changed from the base of the PR and between bd744d3 and fffea36.

📒 Files selected for processing (5)
  • crates/sparrowdb-node/src/lib.rs
  • crates/sparrowdb/src/db.rs
  • crates/sparrowdb/tests/vector_index.rs
  • npm/sparrowdb/index.d.ts
  • npm/sparrowdb/test/integration.test.js

Comment thread crates/sparrowdb-node/src/lib.rs
Comment thread crates/sparrowdb-node/src/lib.rs
Comment thread crates/sparrowdb/src/db.rs Outdated
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

Benchmark Results

(no benchmarks found)

✅ No regressions detected

…rd, write lock, anonymous MATCH, label derivation

- Issue 1 (HIGH SECURITY, sparrowdb-node/src/lib.rs): add validate_cypher_identifier()
  that rejects any label not matching [A-Za-z_][A-Za-z0-9_]* with TypeError before
  the label is interpolated into the Cypher MATCH format string
- Issue 3 (MAJOR CORRECTNESS, sparrowdb-node/src/lib.rs): after rows.is_empty() check,
  require rows.len() == 1; >1 matches returns RangeError with count and hint
- Issue 5 (MAJOR CONCURRENCY, sparrowdb-node/src/lib.rs): acquire WriteGuard via
  begin_write() before node MATCH and hold through idx.insert + idx.save, eliminating
  TOCTOU window between node lookup and HNSW mutation
- Issue 2 (HIGH FUNCTIONAL, sparrowdb-execution/src/engine/mutation.rs): scan_match_mutate
  now handles anonymous MATCH (no label) by iterating all label_ids from catalog.list_labels()
  instead of bailing early on catalog.get_label("") returning None
- Issue 2 (HIGH FUNCTIONAL, sparrowdb/src/db.rs): execute_match_mutate_with_params now
  passes runtime params to Engine via .with_params() so $param WHERE clauses evaluate
  correctly during anonymous MATCH scans
- Issue 4 (MAJOR CORRECTNESS, sparrowdb/src/db.rs): HNSW write block now derives label
  from each matched NodeId (upper 32 bits = label_id) rather than AST first-pattern node,
  preventing cross-label pollution when multiple labels share a vector property
- Regression tests: validate_cypher_identifier_rejects_injection_strings,
  add_to_vector_index_duplicate_id_detected_via_engine (sparrowdb-node);
  anonymous_match_set_vector_populates_hnsw,
  set_vector_hnsw_label_derived_from_node_id (sparrowdb/tests/vector_index.rs)

Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

Benchmark Results

(no benchmarks found)

✅ No regressions detected

- crates/sparrowdb/src/db.rs: move HNSW idx.insert/idx.save block to
  before tx.commit() so the property write and vector index update are
  atomic under the same WriteGuard; if idx.save fails the transaction
  is not committed, keeping disk state consistent (was flagged as major
  atomicity gap by CodeRabbit review)
- npm/sparrowdb/test/integration.test.js: strengthen all three vectorSearch
  assertions to resolve each test's node internal ID via id(n) and assert
  the specific node appears in hits, rather than just hits.length > 0
  which allowed cross-test false positives (CodeRabbit nitpick)

Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

Benchmark Results

(no benchmarks found)

✅ No regressions detected

@ryaker ryaker merged commit 82d85b7 into main May 7, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant