Commit 0a9ffc1
fix(posting): defensive-copy reused key buffers in ReadPostingList + addKeyToBatch
The 21million live-load systest failed with `~genre @filter(uid(F))`
returning wrong films for some genres — corruption pattern was that
one genre's reverse list contained another genre's films verbatim,
while a third genre's reverse list was missing thousands of entries.
Different genres were affected on each load, but always: legacy
runMutation produced correct reverse counts; the per-predicate
pipeline produced corrupt ones.
Root cause: the pipeline's ProcessCount per-uid loop allocates one
key buffer per ProcessCount call and mutates its trailing 8 bytes
each iteration via `binary.BigEndian.PutUint64(dataKey[len-8:], uid)`.
Two distinct sites then captured the slice header rather than the
bytes:
1. ReadPostingList's `l.key = key` aliased the caller's buffer.
saveInCache stores a copyList of the freshly-read list, and
copyList sets `key: l.key` — also an alias. The cached list's
key field therefore points at a buffer that the pipeline keeps
mutating; by the time the async rollup path retrieves the
cached list and runs `kv.Key = alloc.Copy(l.key)` to build the
rolled-up KV, the bytes are whatever the LAST iteration left
behind. Rollup then writes a BitCompletePosting with
WithDiscard() to the wrong key, overwriting an unrelated
reverse list with this list's (rolled-up) contents.
2. ReadPostingList's defer `IncrRollup.addKeyToBatch(key, ...)`
appended the slice header verbatim to the rollup queue. Every
queued entry from one ProcessCount goroutine ended up pointing
at the same shared dataKey buffer; by the time the rollup
goroutine processed the batch the bytes had collapsed to the
final iteration's uid, redirecting many distinct rollup
targets to the same key.
Both sites fixed by taking ownership of the bytes (`append nil` /
explicit `make+copy`). Legacy runMutation hits ReadPostingList too
but allocates a fresh key per call, so it never aliased anything;
the bug is only visible when a caller deliberately reuses one
buffer across many uids the way ProcessCount does.
Verified end-to-end on the systest/21million/live load with
mutations-pipeline-threshold=1: a fresh load + sweep across all 764
Genre entities now reports `count(~genre) == count(forward edges)`
for every genre. Pre-fix the same sweep showed 4 mismatched genres
with thousands of stale or missing reverse entries; on a different
load it showed Documentary missing 23,325 of its 31,370 reverse
entries while Children's/Family had 1,435 stale Crime-Thriller
entries.
Unit tests in posting/ and worker/ still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent de90d97 commit 0a9ffc1
1 file changed
Lines changed: 22 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
137 | 137 | | |
138 | 138 | | |
139 | 139 | | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
140 | 151 | | |
141 | 152 | | |
142 | | - | |
| 153 | + | |
143 | 154 | | |
144 | 155 | | |
145 | 156 | | |
| |||
678 | 689 | | |
679 | 690 | | |
680 | 691 | | |
681 | | - | |
| 692 | + | |
| 693 | + | |
| 694 | + | |
| 695 | + | |
| 696 | + | |
| 697 | + | |
| 698 | + | |
| 699 | + | |
| 700 | + | |
| 701 | + | |
682 | 702 | | |
683 | 703 | | |
684 | 704 | | |
| |||
0 commit comments