fix(node): SET with $vector_param now populates HNSW (closes silent-data-loss surfaced by KMSmcp ch#202)#410
Conversation
…data-loss surfaced by KMSmcp ch#202)
Root cause: `execute_match_mutate_with_params` (the handler for
`MATCH (n:L {id: \$id}) SET n.embedding = \$emb`) called `tx.set_property`
for SET mutations but never wrote to the HNSW vector index. The MERGE
write-path at db.rs:1396-1420 had the correct `idx.insert` + `idx.save`
logic, but it was only reached from `execute_merge_with_params`, not from
`execute_match_mutate_with_params`. The result: the property was stored on
disk but the HNSW file size delta was 0 bytes — the node was invisible to
`vectorSearch` after a `SET \$param`. No error was thrown.
Fix (Option A): after `tx.commit()` in `execute_match_mutate_with_params`,
iterate the mutations; for each `SET { value: Param(p) }` where `p` resolves
to a vector value and `(label, prop)` has a registered HNSW index, call
`idx.insert(node_id, &vec)` + `idx.save(...)` for all matched node IDs. The
label is extracted from the first node pattern in the MATCH clause. This is
symmetric with the MERGE write-path and requires no structural refactoring.
New API: `addToVectorIndex(label, property, nodeId, vector: Float32Array)` on
`SparrowDB` (Node binding) — explicit backfill API requested by KMSmcp.
Resolves the user-facing string `id` to an internal u64 via a parameterized
MATCH query, then writes into HNSW directly. Useful for populating the index
on nodes that were created before the index existed, without DETACH DELETE +
recreate.
Tests added:
- `crates/sparrowdb/tests/vector_index.rs`: `set_vector_param_populates_hnsw`
(in-memory roundtrip) and `set_vector_param_hnsw_roundtrip_survives_reopen`
(persistence across DB close/open).
- `npm/sparrowdb/test/integration.test.js`: strengthened existing SET tests to
assert `vectorSearch` returns the node (not just that the property is non-null);
added `addToVectorIndex` happy-path and error-path tests.
- `npm/sparrowdb/index.d.ts`: regenerated via napi build; `executeWithParams`
docstring updated to document the HNSW behaviour; `addToVectorIndex`
declaration added.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
User does not have a PR Review subscription. Go to Team management and add this email to the PR Review subscription. |
|
Warning Rate limit exceeded
To continue reviewing without waiting, purchase usage credits in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
📝 WalkthroughWalkthroughThis PR adds automatic HNSW vector index updates when using parameterized ChangesVector Index Update on SET Mutations and Backfill API
Sequence DiagramsequenceDiagram
participant User
participant NAPI as N-API Layer
participant GraphDb
participant HNSWIdx as HNSW Index
participant Disk
rect rgba(100, 150, 200, 0.5)
Note over User,Disk: SET Mutation with Vector Parameter
User->>GraphDb: execute_with_params("MATCH (n) SET n.emb=$vec", {vec: Float32Array})
GraphDb->>GraphDb: Execute MATCH...SET mutations
GraphDb->>HNSWIdx: Detect vector assignment in SET clause
GraphDb->>HNSWIdx: Resolve (label, property) HNSW index
GraphDb->>HNSWIdx: Insert vector for each matched node
GraphDb->>Disk: Persist updated HNSW index
GraphDb-->>User: QueryResult
end
rect rgba(150, 100, 200, 0.5)
Note over User,Disk: Direct Backfill via addToVectorIndex
User->>NAPI: addToVectorIndex(label, property, nodeId, Float32Array)
NAPI->>GraphDb: Validate vector index exists (RangeError if missing)
NAPI->>GraphDb: Validate vector dimensions match index (TypeError if mismatch)
NAPI->>GraphDb: Execute parameterized Cypher: MATCH (n:label {id: $id}) RETURN id(n)
GraphDb-->>NAPI: Internal node u64 id
NAPI->>HNSWIdx: Acquire write lock
NAPI->>HNSWIdx: Insert vector for resolved node id
NAPI->>Disk: Persist updated HNSW index
NAPI-->>User: void (or error)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a new add_to_vector_index API to allow manual backfilling of vector data into the HNSW index and fixes a bug where SET operations on vector properties failed to update the index, leading to data loss. The feedback identifies a potential Cypher injection vulnerability in the new API due to direct string interpolation of the label and suggests a more robust label resolution strategy in the SET mutation logic to handle anonymous or complex match patterns correctly.
| // 3. Resolve the user-facing string id to the internal u64 node slot. | ||
| // We do this with a simple MATCH+RETURN id(n) query so we don't | ||
| // need to replicate the id-property lookup logic inline. | ||
| let cypher = format!("MATCH (n:{label} {{id: $id}}) RETURN id(n) AS nid"); |
There was a problem hiding this comment.
Directly interpolating the label string into a Cypher query string poses a security risk (Cypher injection), as label originates from user input in the Node.js binding. While this is a local database, it is best practice to validate the label name to ensure it only contains alphanumeric characters and underscores before interpolation.
// 3. Resolve the user-facing string id to the internal u64 node slot.
// We do this with a simple MATCH+RETURN id(n) query so we don't
// need to replicate the id-property lookup logic inline.
if !label.chars().all(|c| c.is_alphanumeric() || c == '_') {
return Err(napi::Error::new(
napi::Status::InvalidArg,
format!("TypeError: invalid label name '{label}'"),
));
}
let cypher = format!("MATCH (n:{label} {{id: $id}}) RETURN id(n) AS nid");| let label_opt: Option<String> = mm | ||
| .match_patterns | ||
| .first() | ||
| .and_then(|pp| pp.nodes.first()) | ||
| .and_then(|np| np.labels.first()) | ||
| .cloned(); |
There was a problem hiding this comment.
The current logic derives the label name solely from the first node of the first MATCH pattern. This approach has two significant limitations:
- Anonymous Matches: Queries like
MATCH (n) SET n.embedding = $embwill have an emptylabelslist in the AST, causing the HNSW update to be skipped entirely. This means the 'silent data loss' bug persists for anonymous matches. - Multi-label/Complex Matches: In queries involving multiple labels or variables, the code might attempt to update the wrong index or skip valid ones because it assumes all
matching_idsbelong to the first label found in the AST.
Since NodeId encodes the label ID in its upper 32 bits, a more robust approach would be to resolve the label name for each node_id in matching_ids using the catalog, and then check for a corresponding vector index for that specific label.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/sparrowdb/src/db.rs (1)
1525-1526:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftKeep the HNSW update inside the same write critical section.
tx.commit()?completes before the index insert/save runs, so the property write and HNSW write are no longer atomic. Another writer can delete the node or drop/recreate the index in that gap, and a lateridx.save(...)failure leaves the row committed but the vector index stale again. This should stay under the same writer-locked path as the primary mutation.Also applies to: 1567-1574
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/sparrowdb/src/db.rs` around lines 1525 - 1526, The HNSW index update must happen before calling tx.commit() so the primary row mutation and idx.save/insert occur atomically under the same writer lock: move the idx.save(...) / index insert code to execute while the write transaction/lock is still held (i.e., before tx.commit()), propagate any error from idx.save so the transaction can roll back instead of committing a partial state, and only call self.invalidate_catalog() after a successful commit; apply the same change for the second block referenced (around idx.save at 1567-1574).
🧹 Nitpick comments (1)
npm/sparrowdb/test/integration.test.js (1)
345-399: ⚡ Quick winStrengthen these vector assertions to avoid cross-test false positives.
All three tests share the same database and only assert
hits.length > 0. Once an earlier case has inserted any vector, later cases can still pass even if the currentSET/addToVectorIndexcall never touched HNSW. Please either isolate the DB per test or assert thatvectorSearchreturns the specific node created in that case.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@npm/sparrowdb/test/integration.test.js` around lines 345 - 399, The tests currently only assert hits.length > 0 which allows cross-test pollution; update each test (the ones referencing db.executeWithParams, db.execute, db.vectorSearch, and db.addToVectorIndex) to either use an isolated DB instance per test or, more simply, assert that the vectorSearch results contain the specific node id created in that test (e.g., for the first test ensure the hits include id 'k1', the second includes 'k2', and the third includes 'k3') by mapping hits to ids and checking inclusion rather than just length > 0.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/sparrowdb-node/src/lib.rs`:
- Around line 387-459: The add_to_vector_index path mutates persisted index
state but currently does node lookup and then updates/saves the HNSW outside the
database write discipline; obtain the same database write lock used by other
durable mutations (call begin_write() to get a WriteGuard before the MATCH
lookup) so the node lookup, idx.insert(...) and idx.save(...) happen while
holding the WriteGuard, and drop/other concurrent writers cannot create TOCTOU
races (mirror the lock scope used by other durable write paths such as
drop_vector_index()).
- Around line 433-447: The code only checks for zero rows and then uses
result.rows[0], risking silent selection when multiple rows match; update the
match validation to require exactly one row: replace the if
result.rows.is_empty() block with a guard that returns an napi::Error (via
napi::Error::new or to_napi) when result.rows.len() != 1 and provide distinct
error messages for zero vs. multiple matches (e.g., "no node with id=... found"
and "multiple nodes with id=... found"); keep the subsequent internal_id
extraction logic as-is (the match on result.rows[0][0]) and ensure callers like
addToVectorIndex() rely on the unique internal_id rather than an ambiguous
result.
---
Outside diff comments:
In `@crates/sparrowdb/src/db.rs`:
- Around line 1525-1526: The HNSW index update must happen before calling
tx.commit() so the primary row mutation and idx.save/insert occur atomically
under the same writer lock: move the idx.save(...) / index insert code to
execute while the write transaction/lock is still held (i.e., before
tx.commit()), propagate any error from idx.save so the transaction can roll back
instead of committing a partial state, and only call self.invalidate_catalog()
after a successful commit; apply the same change for the second block referenced
(around idx.save at 1567-1574).
---
Nitpick comments:
In `@npm/sparrowdb/test/integration.test.js`:
- Around line 345-399: The tests currently only assert hits.length > 0 which
allows cross-test pollution; update each test (the ones referencing
db.executeWithParams, db.execute, db.vectorSearch, and db.addToVectorIndex) to
either use an isolated DB instance per test or, more simply, assert that the
vectorSearch results contain the specific node id created in that test (e.g.,
for the first test ensure the hits include id 'k1', the second includes 'k2',
and the third includes 'k3') by mapping hits to ids and checking inclusion
rather than just length > 0.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: cf7130d7-c7db-4b1d-969a-a359fffc79e8
📒 Files selected for processing (5)
crates/sparrowdb-node/src/lib.rscrates/sparrowdb/src/db.rscrates/sparrowdb/tests/vector_index.rsnpm/sparrowdb/index.d.tsnpm/sparrowdb/test/integration.test.js
Benchmark Results |
…rd, write lock, anonymous MATCH, label derivation
- Issue 1 (HIGH SECURITY, sparrowdb-node/src/lib.rs): add validate_cypher_identifier()
that rejects any label not matching [A-Za-z_][A-Za-z0-9_]* with TypeError before
the label is interpolated into the Cypher MATCH format string
- Issue 3 (MAJOR CORRECTNESS, sparrowdb-node/src/lib.rs): after rows.is_empty() check,
require rows.len() == 1; >1 matches returns RangeError with count and hint
- Issue 5 (MAJOR CONCURRENCY, sparrowdb-node/src/lib.rs): acquire WriteGuard via
begin_write() before node MATCH and hold through idx.insert + idx.save, eliminating
TOCTOU window between node lookup and HNSW mutation
- Issue 2 (HIGH FUNCTIONAL, sparrowdb-execution/src/engine/mutation.rs): scan_match_mutate
now handles anonymous MATCH (no label) by iterating all label_ids from catalog.list_labels()
instead of bailing early on catalog.get_label("") returning None
- Issue 2 (HIGH FUNCTIONAL, sparrowdb/src/db.rs): execute_match_mutate_with_params now
passes runtime params to Engine via .with_params() so $param WHERE clauses evaluate
correctly during anonymous MATCH scans
- Issue 4 (MAJOR CORRECTNESS, sparrowdb/src/db.rs): HNSW write block now derives label
from each matched NodeId (upper 32 bits = label_id) rather than AST first-pattern node,
preventing cross-label pollution when multiple labels share a vector property
- Regression tests: validate_cypher_identifier_rejects_injection_strings,
add_to_vector_index_duplicate_id_detected_via_engine (sparrowdb-node);
anonymous_match_set_vector_populates_hnsw,
set_vector_hnsw_label_derived_from_node_id (sparrowdb/tests/vector_index.rs)
Co-Authored-By: Claude <noreply@anthropic.com>
Benchmark Results |
- crates/sparrowdb/src/db.rs: move HNSW idx.insert/idx.save block to before tx.commit() so the property write and vector index update are atomic under the same WriteGuard; if idx.save fails the transaction is not committed, keeping disk state consistent (was flagged as major atomicity gap by CodeRabbit review) - npm/sparrowdb/test/integration.test.js: strengthen all three vectorSearch assertions to resolve each test's node internal ID via id(n) and assert the specific node appears in hits, rather than just hits.length > 0 which allowed cross-test false positives (CodeRabbit nitpick) Co-Authored-By: Claude <noreply@anthropic.com>
Benchmark Results |
Summary
execute_match_mutate_with_params(theMATCH … SET n.embedding = $embhandler) stored the property viatx.set_propertybut never calledidx.inserton the HNSW vector index. The MERGE write-path had the correct logic atdb.rs:1396–1420; the MATCH+SET path was simply missing it. Result: HNSW file size delta = 0 bytes,vectorSearchreturns nothing, no error thrown — silent data loss.tx.commit()inexecute_match_mutate_with_params, iterate SET mutations; for anySET { value: Param(p) }wherepresolves to a vector and(label, prop)has a registered HNSW index, callidx.insert+idx.savefor all matched node IDs. Symmetric with the existing MERGE write-path. ~50 LOC, no structural refactoring needed.addToVectorIndex(label, property, nodeId, vector)— explicit HNSW backfill for nodes created before the index existed.Reproduction
Before/After
MATCH … SET n.embedding = $embvectorSearchafter SET[]addToVectorIndex(label, prop, id, vec)RangeErroron missing index/nodeTest plan
crates/sparrowdb/tests/vector_index.rs:set_vector_param_populates_hnsw(in-memory HNSW roundtrip) andset_vector_param_hnsw_roundtrip_survives_reopen(persistence after DB restart) — both passnpm/sparrowdb/test/integration.test.js: existing SET tests strengthened to assertvectorSearchreturns the node; 3 newaddToVectorIndextests (happy path, missing node, missing index) — all 34 JS tests passcargo fmt --allandcargo clippy -p sparrowdb -p sparrowdb-node -- -D warningscleancreate_vector_index_ddl,spa191_*) confirmed not introduced by this PRindex.d.tsregenerated vianapi build, SparrowDb → SparrowDB rename applied, HNSW docstring updatedSurfaced by: KMSmcp inter-session channel message #202.
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
addToVectorIndexmethod to manually backfill vectors into an HNSW index.Bug Fixes
Documentation