Skip to content

Fix HNSW index crashes and incorrect query results#181

Open
grparry wants to merge 1 commit intoruvnet:mainfrom
grparry:fix/hnsw-index-crashes
Open

Fix HNSW index crashes and incorrect query results#181
grparry wants to merge 1 commit intoruvnet:mainfrom
grparry:fix/hnsw-index-crashes

Conversation

@grparry
Copy link

@grparry grparry commented Feb 17, 2026

Hi! I've been evaluating ruvector as a pgvector alternative for a project and ran into a few issues with the HNSW index access method on PostgreSQL 17. I traced them down and put together fixes — hope these are helpful.

Summary

Six bugs fixed in the HNSW access method (hnsw_am.rs), plus one shared fix in ivfflat_am.rs:

  1. SIGSEGV on repeated queriesbeginscan must allocate xs_orderbyvals and xs_orderbynulls arrays (like GiST and SP-GiST do). RelationGetIndexScan allocates the scan descriptor but not these arrays, so the executor was writing distance values to stale/uninitialized memory. First query in a backend often worked; second+ crashed at addresses like 0x1.

  2. Empty HNSW graphconnect_node_to_neighbors was a TODO stub. Implemented bidirectional connections (new node → neighbors, each neighbor → new node) with pruning when neighbor count exceeds M (upper layers) or M0 (layer 0).

  3. Wrong distance metrichnsw_build used HnswConfig::default() which hardcodes DistanceMetric::Euclidean, regardless of the operator class (e.g. ruvector_cosine_ops). Added metric_from_index() that reads the opclass support function name via index_getprocid + get_func_name.

  4. Wrong result orderingBinaryHeap::into_iter().take(k) returns k arbitrary items from the heap's backing array, not the k smallest. Removed .take(k) before the sort+truncate.

  5. xs_recheckorderby must be false — Setting it to true triggers PG17's IndexNextWithReorder distance comparison, which raises "index returned tuples in wrong order" on harmless floating-point differences between index-computed and operator-computed distances.

  6. Use-after-free in endscan — Added null check and null-out of scan->opaque to prevent double-free if endscan is called after a rescan that resets the state.

Testing

  • Built and tested on PostgreSQL 17.7 with pgrx 0.12.9
  • Verified with 5d and 8d vectors, dataset sizes 15 and 100
  • 20+ consecutive HNSW queries per session with no crashes
  • HNSW results match sequential scan ground truth across all test queries

Test plan

  • Build with cargo pgrx install --release
  • Create table with ruvector(N) column and HNSW index using ruvector_cosine_ops
  • Run SELECT ... ORDER BY embedding <=> query::ruvector LIMIT k multiple times in the same psql session
  • Compare HNSW results against sequential scan (SET enable_indexscan = off)
  • Verify no crashes on repeated queries

🤖 Generated with Claude Code

Six bugs fixed in the HNSW access method:

1. SIGSEGV on repeated queries: beginscan must allocate xs_orderbyvals
   and xs_orderbynulls arrays (like GiST/SP-GiST do). Without this,
   the executor writes distance values to stale palloc memory, causing
   segfaults on the second+ query in the same backend.

2. Empty HNSW graph: connect_node_to_neighbors was a no-op TODO stub.
   Implemented bidirectional connections (new→neighbors, neighbor→new)
   with pruning at M/M0 capacity limits.

3. Wrong distance metric: hnsw_build hardcoded DistanceMetric::Euclidean
   regardless of the operator class used (e.g. ruvector_cosine_ops).
   Added metric_from_index() to read the metric from the opclass
   support function via index_getprocid + get_func_name.

4. Wrong result ordering: BinaryHeap::into_iter().take(k) returns k
   arbitrary items, not the k closest. Removed .take(k) before sort.

5. xs_recheckorderby must be false: setting it to true triggers PG17's
   IndexNextWithReorder distance comparison, which errors on harmless
   floating-point differences between index-stored and heap vectors.

6. Use-after-free in endscan: added null check and null-out of
   scan->opaque to prevent double-free across rescans.

Also applied the same xs_orderbyvals fix to ivfflat_ambeginscan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant