fix(node): SET with $vector_param now populates HNSW (closes silent-data-loss surfaced by KMSmcp ch#202) by ryaker · Pull Request #410 · ryaker/SparrowDB

ryaker · 2026-05-07T01:07:12Z

Summary

Root cause: execute_match_mutate_with_params (the MATCH … SET n.embedding = $emb handler) stored the property via tx.set_property but never called idx.insert on the HNSW vector index. The MERGE write-path had the correct logic at db.rs:1396–1420; the MATCH+SET path was simply missing it. Result: HNSW file size delta = 0 bytes, vectorSearch returns nothing, no error thrown — silent data loss.
Fix (Option A): after tx.commit() in execute_match_mutate_with_params, iterate SET mutations; for any SET { value: Param(p) } where p resolves to a vector and (label, prop) has a registered HNSW index, call idx.insert + idx.save for all matched node IDs. Symmetric with the existing MERGE write-path. ~50 LOC, no structural refactoring needed.
New API: addToVectorIndex(label, property, nodeId, vector) — explicit HNSW backfill for nodes created before the index existed.

Reproduction

db.createVectorIndex('Memory', 'embedding', 4, 'cosine')
db.execute("CREATE (n:Memory {id: 'k1'})")
db.executeWithParams('MATCH (n:Memory {id: $id}) SET n.embedding = $emb', { id: 'k1', emb: [0.1, 0.2, 0.3, 0.4] })

// Before fix: returns []
// After fix:  returns [{ id: '...', score: 1.0 }]
db.vectorSearch('Memory', 'embedding', new Float32Array([0.1, 0.2, 0.3, 0.4]), 5)

Before/After

Scenario	Before	After
`MATCH … SET n.embedding = $emb`	Property stored, HNSW unchanged (0-byte delta)	Property stored AND HNSW updated
`vectorSearch` after SET	Returns `[]`	Returns the inserted node
`addToVectorIndex(label, prop, id, vec)`	Not available	Inserts directly into HNSW; throws `RangeError` on missing index/node

Test plan

New Rust integration tests in crates/sparrowdb/tests/vector_index.rs: set_vector_param_populates_hnsw (in-memory HNSW roundtrip) and set_vector_param_hnsw_roundtrip_survives_reopen (persistence after DB restart) — both pass
npm/sparrowdb/test/integration.test.js: existing SET tests strengthened to assert vectorSearch returns the node; 3 new addToVectorIndex tests (happy path, missing node, missing index) — all 34 JS tests pass
cargo fmt --all and cargo clippy -p sparrowdb -p sparrowdb-node -- -D warnings clean
Pre-existing failures (create_vector_index_ddl, spa191_*) confirmed not introduced by this PR
index.d.ts regenerated via napi build, SparrowDb → SparrowDB rename applied, HNSW docstring updated

Surfaced by: KMSmcp inter-session channel message #202.

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added addToVectorIndex method to manually backfill vectors into an HNSW index.
Bug Fixes
- Vector indexes now properly update when setting vector properties via parameterized queries.
Documentation
- Updated API documentation to clarify HNSW index update behavior and new backfill method usage.

…data-loss surfaced by KMSmcp ch#202) Root cause: `execute_match_mutate_with_params` (the handler for `MATCH (n:L {id: \$id}) SET n.embedding = \$emb`) called `tx.set_property` for SET mutations but never wrote to the HNSW vector index. The MERGE write-path at db.rs:1396-1420 had the correct `idx.insert` + `idx.save` logic, but it was only reached from `execute_merge_with_params`, not from `execute_match_mutate_with_params`. The result: the property was stored on disk but the HNSW file size delta was 0 bytes — the node was invisible to `vectorSearch` after a `SET \$param`. No error was thrown. Fix (Option A): after `tx.commit()` in `execute_match_mutate_with_params`, iterate the mutations; for each `SET { value: Param(p) }` where `p` resolves to a vector value and `(label, prop)` has a registered HNSW index, call `idx.insert(node_id, &vec)` + `idx.save(...)` for all matched node IDs. The label is extracted from the first node pattern in the MATCH clause. This is symmetric with the MERGE write-path and requires no structural refactoring. New API: `addToVectorIndex(label, property, nodeId, vector: Float32Array)` on `SparrowDB` (Node binding) — explicit backfill API requested by KMSmcp. Resolves the user-facing string `id` to an internal u64 via a parameterized MATCH query, then writes into HNSW directly. Useful for populating the index on nodes that were created before the index existed, without DETACH DELETE + recreate. Tests added: - `crates/sparrowdb/tests/vector_index.rs`: `set_vector_param_populates_hnsw` (in-memory roundtrip) and `set_vector_param_hnsw_roundtrip_survives_reopen` (persistence across DB close/open). - `npm/sparrowdb/test/integration.test.js`: strengthened existing SET tests to assert `vectorSearch` returns the node (not just that the property is non-null); added `addToVectorIndex` happy-path and error-path tests. - `npm/sparrowdb/index.d.ts`: regenerated via napi build; `executeWithParams` docstring updated to document the HNSW behaviour; `addToVectorIndex` declaration added. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

codeant-ai · 2026-05-07T01:07:16Z

User does not have a PR Review subscription.

Go to Team management and add this email to the PR Review subscription.

coderabbitai · 2026-05-07T01:07:24Z

Warning

Rate limit exceeded

@ryaker has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 11 minutes and 57 seconds before requesting another review.

To continue reviewing without waiting, purchase usage credits in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2eb4f4d0-0f31-4981-86e8-54baed5c6811

📥 Commits

Reviewing files that changed from the base of the PR and between fffea36 and 9e946e8.

📒 Files selected for processing (5)

crates/sparrowdb-execution/src/engine/mutation.rs
crates/sparrowdb-node/src/lib.rs
crates/sparrowdb/src/db.rs
crates/sparrowdb/tests/vector_index.rs
npm/sparrowdb/test/integration.test.js

📝 Walkthrough

Walkthrough

This PR adds automatic HNSW vector index updates when using parameterized MATCH...SET mutations with vector properties, plus a new N-API backfill method addToVectorIndex that inserts nodes directly into existing vector indexes. Both paths include parameterized node lookup, dimension validation, index write locking, and on-disk persistence.

Changes

Vector Index Update on SET Mutations and Backfill API

Layer / File(s)	Summary
Type Declarations `npm/sparrowdb/index.d.ts`	Added `addToVectorIndex(label, property, nodeId, vector: Float32Array): void` method with JSDoc. Updated `executeWithParams` docs to note HNSW writes now occur on `SET n.prop = $vec`.
Core HNSW Write Logic `crates/sparrowdb/src/db.rs`	Enhanced `execute_match_mutate_with_params` to detect when a `SET` mutation assigns a vector parameter value, resolve the `(label, property)` HNSW index, insert the vector for all matched nodes, and persist the index to disk once per `(label, prop)`.
N-API Backfill Method `crates/sparrowdb-node/src/lib.rs`	Implemented `add_to_vector_index` that validates the vector index exists (RangeError), checks `Float32Array` length against expected dimensions (TypeError), looks up the user-facing `nodeId` via parameterized Cypher, inserts the vector under index write lock, and persists to disk.
Regression Tests `crates/sparrowdb/tests/vector_index.rs`	Added two tests: one verifying `SET n.embedding = $param` populates HNSW and `vectorSearch` finds the node, and one confirming the index survives database reopening.
Integration Tests & Assertions `npm/sparrowdb/test/integration.test.js`	Extended `executeWithParams` tests to assert HNSW updates on vector `SET` operations. Added `addToVectorIndex` positive test (node backfill is discoverable) and negative tests (RangeError on missing node or missing index).

Sequence Diagram

sequenceDiagram
    participant User
    participant NAPI as N-API Layer
    participant GraphDb
    participant HNSWIdx as HNSW Index
    participant Disk

    rect rgba(100, 150, 200, 0.5)
    Note over User,Disk: SET Mutation with Vector Parameter
    User->>GraphDb: execute_with_params("MATCH (n) SET n.emb=$vec", {vec: Float32Array})
    GraphDb->>GraphDb: Execute MATCH...SET mutations
    GraphDb->>HNSWIdx: Detect vector assignment in SET clause
    GraphDb->>HNSWIdx: Resolve (label, property) HNSW index
    GraphDb->>HNSWIdx: Insert vector for each matched node
    GraphDb->>Disk: Persist updated HNSW index
    GraphDb-->>User: QueryResult
    end

    rect rgba(150, 100, 200, 0.5)
    Note over User,Disk: Direct Backfill via addToVectorIndex
    User->>NAPI: addToVectorIndex(label, property, nodeId, Float32Array)
    NAPI->>GraphDb: Validate vector index exists (RangeError if missing)
    NAPI->>GraphDb: Validate vector dimensions match index (TypeError if mismatch)
    NAPI->>GraphDb: Execute parameterized Cypher: MATCH (n:label {id: $id}) RETURN id(n)
    GraphDb-->>NAPI: Internal node u64 id
    NAPI->>HNSWIdx: Acquire write lock
    NAPI->>HNSWIdx: Insert vector for resolved node id
    NAPI->>Disk: Persist updated HNSW index
    NAPI-->>User: void (or error)
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

feat(node): expose vector + fulltext + hybrid-search APIs to Node.js binding (KMSmcp consumer) #400: The N-API add_to_vector_index method and core HNSW update wiring on SET mutations directly implement the vector API surface and behavior requested.

Possibly related PRs

ryaker/SparrowDB#374: Modified the MATCH...SET mutation execution paths; this PR extends those same paths to wire vector index updates.
ryaker/SparrowDB#398: Introduced HNSW vector index creation, search, and insertion on MERGE; this PR builds on that foundation by adding SET-based writes and the backfill API.
ryaker/SparrowDB#402: Added N-API vector search and shared error mapping helpers; this PR reuses those patterns for the new backfill method.

Suggested labels

size:L, feature, bug-fix

Poem

🐰 A rabbit hops through vectors fine,
Now SET mutations align—
Embeddings flow where HNSW trees grow,
Backfill brings data back in line! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly describes the main fix: SET with $vector_param now populates HNSW, addressing the silent data loss issue. It's specific, clear, and captures the core change.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/spa-set-vector-param-hnsw

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a new add_to_vector_index API to allow manual backfilling of vector data into the HNSW index and fixes a bug where SET operations on vector properties failed to update the index, leading to data loss. The feedback identifies a potential Cypher injection vulnerability in the new API due to direct string interpolation of the label and suggests a more robust label resolution strategy in the SET mutation logic to handle anonymous or complex match patterns correctly.

gemini-code-assist · 2026-05-07T01:08:39Z

+        // 3. Resolve the user-facing string id to the internal u64 node slot.
+        //    We do this with a simple MATCH+RETURN id(n) query so we don't
+        //    need to replicate the id-property lookup logic inline.
+        let cypher = format!("MATCH (n:{label} {{id: $id}}) RETURN id(n) AS nid");


Directly interpolating the label string into a Cypher query string poses a security risk (Cypher injection), as label originates from user input in the Node.js binding. While this is a local database, it is best practice to validate the label name to ensure it only contains alphanumeric characters and underscores before interpolation.

// 3. Resolve the user-facing string id to the internal u64 node slot. // We do this with a simple MATCH+RETURN id(n) query so we don't // need to replicate the id-property lookup logic inline. if !label.chars().all(|c| c.is_alphanumeric() || c == '_') { return Err(napi::Error::new( napi::Status::InvalidArg, format!("TypeError: invalid label name '{label}'"), )); } let cypher = format!("MATCH (n:{label} {{id: $id}}) RETURN id(n) AS nid");

gemini-code-assist · 2026-05-07T01:08:39Z

+            let label_opt: Option<String> = mm
+                .match_patterns
+                .first()
+                .and_then(|pp| pp.nodes.first())
+                .and_then(|np| np.labels.first())
+                .cloned();


The current logic derives the label name solely from the first node of the first MATCH pattern. This approach has two significant limitations:

Anonymous Matches: Queries like MATCH (n) SET n.embedding = $emb will have an empty labels list in the AST, causing the HNSW update to be skipped entirely. This means the 'silent data loss' bug persists for anonymous matches.

Multi-label/Complex Matches: In queries involving multiple labels or variables, the code might attempt to update the wrong index or skip valid ones because it assumes all matching_ids belong to the first label found in the AST.

Since NodeId encodes the label ID in its upper 32 bits, a more robust approach would be to resolve the label name for each node_id in matching_ids using the catalog, and then check for a corresponding vector index for that specific label.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/sparrowdb/src/db.rs (1)
1525-1526: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Keep the HNSW update inside the same write critical section.

tx.commit()? completes before the index insert/save runs, so the property write and HNSW write are no longer atomic. Another writer can delete the node or drop/recreate the index in that gap, and a later idx.save(...) failure leaves the row committed but the vector index stale again. This should stay under the same writer-locked path as the primary mutation.

Also applies to: 1567-1574
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/sparrowdb/src/db.rs` around lines 1525 - 1526, The HNSW index update
must happen before calling tx.commit() so the primary row mutation and
idx.save/insert occur atomically under the same writer lock: move the
idx.save(...) / index insert code to execute while the write transaction/lock is
still held (i.e., before tx.commit()), propagate any error from idx.save so the
transaction can roll back instead of committing a partial state, and only call
self.invalidate_catalog() after a successful commit; apply the same change for
the second block referenced (around idx.save at 1567-1574).

🧹 Nitpick comments (1)

npm/sparrowdb/test/integration.test.js (1)
345-399: ⚡ Quick win

Strengthen these vector assertions to avoid cross-test false positives.

All three tests share the same database and only assert hits.length > 0. Once an earlier case has inserted any vector, later cases can still pass even if the current SET/addToVectorIndex call never touched HNSW. Please either isolate the DB per test or assert that vectorSearch returns the specific node created in that case.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@npm/sparrowdb/test/integration.test.js` around lines 345 - 399, The tests
currently only assert hits.length > 0 which allows cross-test pollution; update
each test (the ones referencing db.executeWithParams, db.execute,
db.vectorSearch, and db.addToVectorIndex) to either use an isolated DB instance
per test or, more simply, assert that the vectorSearch results contain the
specific node id created in that test (e.g., for the first test ensure the hits
include id 'k1', the second includes 'k2', and the third includes 'k3') by
mapping hits to ids and checking inclusion rather than just length > 0.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/sparrowdb-node/src/lib.rs`:
- Around line 387-459: The add_to_vector_index path mutates persisted index
state but currently does node lookup and then updates/saves the HNSW outside the
database write discipline; obtain the same database write lock used by other
durable mutations (call begin_write() to get a WriteGuard before the MATCH
lookup) so the node lookup, idx.insert(...) and idx.save(...) happen while
holding the WriteGuard, and drop/other concurrent writers cannot create TOCTOU
races (mirror the lock scope used by other durable write paths such as
drop_vector_index()).
- Around line 433-447: The code only checks for zero rows and then uses
result.rows[0], risking silent selection when multiple rows match; update the
match validation to require exactly one row: replace the if
result.rows.is_empty() block with a guard that returns an napi::Error (via
napi::Error::new or to_napi) when result.rows.len() != 1 and provide distinct
error messages for zero vs. multiple matches (e.g., "no node with id=... found"
and "multiple nodes with id=... found"); keep the subsequent internal_id
extraction logic as-is (the match on result.rows[0][0]) and ensure callers like
addToVectorIndex() rely on the unique internal_id rather than an ambiguous
result.

---

Outside diff comments:
In `@crates/sparrowdb/src/db.rs`:
- Around line 1525-1526: The HNSW index update must happen before calling
tx.commit() so the primary row mutation and idx.save/insert occur atomically
under the same writer lock: move the idx.save(...) / index insert code to
execute while the write transaction/lock is still held (i.e., before
tx.commit()), propagate any error from idx.save so the transaction can roll back
instead of committing a partial state, and only call self.invalidate_catalog()
after a successful commit; apply the same change for the second block referenced
(around idx.save at 1567-1574).

---

Nitpick comments:
In `@npm/sparrowdb/test/integration.test.js`:
- Around line 345-399: The tests currently only assert hits.length > 0 which
allows cross-test pollution; update each test (the ones referencing
db.executeWithParams, db.execute, db.vectorSearch, and db.addToVectorIndex) to
either use an isolated DB instance per test or, more simply, assert that the
vectorSearch results contain the specific node id created in that test (e.g.,
for the first test ensure the hits include id 'k1', the second includes 'k2',
and the third includes 'k3') by mapping hits to ids and checking inclusion
rather than just length > 0.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cf7130d7-c7db-4b1d-969a-a359fffc79e8

📥 Commits

Reviewing files that changed from the base of the PR and between bd744d3 and fffea36.

📒 Files selected for processing (5)

crates/sparrowdb-node/src/lib.rs
crates/sparrowdb/src/db.rs
crates/sparrowdb/tests/vector_index.rs
npm/sparrowdb/index.d.ts
npm/sparrowdb/test/integration.test.js

github-actions · 2026-05-07T01:20:28Z

Benchmark Results

(no benchmarks found)

✅ No regressions detected

…rd, write lock, anonymous MATCH, label derivation - Issue 1 (HIGH SECURITY, sparrowdb-node/src/lib.rs): add validate_cypher_identifier() that rejects any label not matching [A-Za-z_][A-Za-z0-9_]* with TypeError before the label is interpolated into the Cypher MATCH format string - Issue 3 (MAJOR CORRECTNESS, sparrowdb-node/src/lib.rs): after rows.is_empty() check, require rows.len() == 1; >1 matches returns RangeError with count and hint - Issue 5 (MAJOR CONCURRENCY, sparrowdb-node/src/lib.rs): acquire WriteGuard via begin_write() before node MATCH and hold through idx.insert + idx.save, eliminating TOCTOU window between node lookup and HNSW mutation - Issue 2 (HIGH FUNCTIONAL, sparrowdb-execution/src/engine/mutation.rs): scan_match_mutate now handles anonymous MATCH (no label) by iterating all label_ids from catalog.list_labels() instead of bailing early on catalog.get_label("") returning None - Issue 2 (HIGH FUNCTIONAL, sparrowdb/src/db.rs): execute_match_mutate_with_params now passes runtime params to Engine via .with_params() so $param WHERE clauses evaluate correctly during anonymous MATCH scans - Issue 4 (MAJOR CORRECTNESS, sparrowdb/src/db.rs): HNSW write block now derives label from each matched NodeId (upper 32 bits = label_id) rather than AST first-pattern node, preventing cross-label pollution when multiple labels share a vector property - Regression tests: validate_cypher_identifier_rejects_injection_strings, add_to_vector_index_duplicate_id_detected_via_engine (sparrowdb-node); anonymous_match_set_vector_populates_hnsw, set_vector_hnsw_label_derived_from_node_id (sparrowdb/tests/vector_index.rs) Co-Authored-By: Claude <noreply@anthropic.com>

github-actions · 2026-05-07T01:49:10Z

Benchmark Results

(no benchmarks found)

✅ No regressions detected

- crates/sparrowdb/src/db.rs: move HNSW idx.insert/idx.save block to before tx.commit() so the property write and vector index update are atomic under the same WriteGuard; if idx.save fails the transaction is not committed, keeping disk state consistent (was flagged as major atomicity gap by CodeRabbit review) - npm/sparrowdb/test/integration.test.js: strengthen all three vectorSearch assertions to resolve each test's node internal ID via id(n) and assert the specific node appears in hits, rather than just hits.length > 0 which allowed cross-test false positives (CodeRabbit nitpick) Co-Authored-By: Claude <noreply@anthropic.com>

github-actions · 2026-05-07T02:07:09Z

Benchmark Results

(no benchmarks found)

✅ No regressions detected

gemini-code-assist Bot reviewed May 7, 2026

View reviewed changes

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

Comment thread crates/sparrowdb-node/src/lib.rs

Comment thread crates/sparrowdb-node/src/lib.rs

Comment thread crates/sparrowdb/src/db.rs Outdated

ryaker merged commit 82d85b7 into main May 7, 2026
11 checks passed

Conversation

ryaker commented May 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reproduction

Before/After

Test plan

Summary by CodeRabbit

Uh oh!

codeant-ai Bot commented May 7, 2026

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 7, 2026

Benchmark Results

Uh oh!

github-actions Bot commented May 7, 2026

Benchmark Results

Uh oh!

github-actions Bot commented May 7, 2026

Benchmark Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ryaker commented May 7, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 7, 2026 •

edited

Loading