Skip to content

refactor: Try to simplify StringColumn's CompactionOnDump, compact in place#181

Closed
zhanglei1949 wants to merge 4 commits intoalibaba:mainfrom
zhanglei1949:zl/ref-str-col-comp
Closed

refactor: Try to simplify StringColumn's CompactionOnDump, compact in place#181
zhanglei1949 wants to merge 4 commits intoalibaba:mainfrom
zhanglei1949:zl/ref-str-col-comp

Conversation

@zhanglei1949
Copy link
Copy Markdown
Member

@zhanglei1949 zhanglei1949 commented Apr 7, 2026

Another solution in #187
Fix #177

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Simplify StringColumn compaction to in-place algorithm

✨ Enhancement 🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Simplify StringColumn compaction by replacing streaming approach with in-place algorithm
• Move file_header.h include from column.h to id_indexer.h for better dependency management
• Remove unordered_set include and CompactionPlan struct, reducing complexity
• Implement O(size_ log size_) in-place compaction that reclaims stale bytes from append-only
  updates
Diagram
flowchart LR
  A["StringColumn dump()"] --> B["compact_in_place()"]
  B --> C["Sort entries by offset"]
  C --> D["Memmove bytes in-place"]
  D --> E["Update item offsets"]
  E --> F["Resize data_buffer_"]
  F --> G["Store new pos_"]
Loading

Grey Divider

File Changes

1. include/neug/utils/id_indexer.h Dependencies +1/-0

Add file_header.h include

• Added include for neug/storages/container/file_header.h
• Moved file_header.h dependency from column.h to id_indexer.h

include/neug/utils/id_indexer.h


2. include/neug/utils/property/column.h ✨ Enhancement +61/-139

Refactor StringColumn compaction to in-place algorithm

• Removed #include <unordered_set> (no longer needed)
• Removed #include "neug/storages/container/file_header.h" (moved to id_indexer.h)
• Replaced streaming compaction approach with simpler in-place algorithm in dump() method
• Removed CompactionPlan struct and prepare_compaction_plan() method
• Removed stream_compact_and_dump() method with file I/O and MD5 logic
• Implemented new compact_in_place() method using memmove and offset remapping

include/neug/utils/property/column.h


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Apr 7, 2026

Code Review by Qodo

🐞 Bugs (4)   📘 Rule violations (0)   📎 Requirement gaps (1)   🎨 UX Issues (0)
🐞\ ≡ Correctness (3) ➹ Performance (1) ⭐ New (2)
📎\ ⚙ Maintainability (1)

Grey Divider


Action required

1. Compaction fast-exit wrong 🐞
Description
compact_in_place() can incorrectly skip compaction when sum(item.length) == pos_ even though stale
bytes exist, because the sum double-counts shared offsets (from resize(default)) and can numerically
cancel out unreferenced stale bytes (from append-only updates). This leaves pos_ inflated and stale
bytes persisted into dumps.
Code

include/neug/utils/property/column.h[R465-475]

+  void compact_in_place() {
+    if (size_ == 0)
+      return;
+    size_t total = 0;
+    for (size_t i = 0; i < size_; ++i) {
+      total += get_string_item(i).length;
+    }
+    if (total == pos_.load()) {
+      VLOG(10) << "StringColumn compact_in_place: no stale bytes to reclaim";
+      return;
+    }
Evidence
compact_in_place() uses total += get_string_item(i).length and early-returns when total == pos_,
but StringColumn can (a) share a single backing string across many slots via resize(default) and (b)
leave old bytes stale due to append-only set_value(). In such a state, total (which counts
duplicates) can equal pos_ (which includes stale bytes), so the function incorrectly concludes there
are “no stale bytes to reclaim” and skips compaction.

include/neug/utils/property/column.h[465-475]
include/neug/utils/property/column.h[316-340]
include/neug/utils/property/column.h[352-361]
include/neug/utils/property/column.h[286-296]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`compact_in_place()` uses `sum(lengths) == pos_` as a “no stale bytes” fast-exit, but that sum counts duplicate offsets multiple times (shared defaults), which can mask the presence of unreferenced stale bytes. This causes compaction to be skipped when it should run.

### Issue Context
StringColumn can create duplicate offsets via `resize(size, default)` (many slots share one {offset,length}) and create stale bytes via append-only `set_value()` updates. The fast-exit must reason about **unique referenced byte ranges**, not the sum of per-slot lengths.

### Fix Focus Areas
- include/neug/utils/property/column.h[465-522]

### Suggested direction
Replace the fast-exit with a check based on **unique offsets** (e.g., track `seen_offsets` and compute `unique_total += length` only on first occurrence of an offset). Only skip compaction when `unique_total == pos_` (and optionally validate monotonic/non-overlap invariants if desired).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Dump re-expands compacted data 🐞
Description
dump() compacts data_buffer_ down to write_ptr but then immediately calls resize(size_), which can
grow data_buffer_ back up to max(size_ * string_avg_size(), pos_). Since container Dump() writes
the full container payload size, dumps can include large unused (zero-filled) tails and negate
compaction savings.
Code

include/neug/utils/property/column.h[R286-296]

  void dump(const std::string& filename) override {
-    // Compact before dumping.  StringColumn uses an append-only strategy for
-    // updates, leaving stale copies in data_buffer_.  When there is reused
-    // data we stream the compacted bytes directly to the output file in a
-    // single forward pass, computing MD5 on-the-fly, which avoids:
-    //   1. A temporary buffer allocation (effective_size bytes).
-    //   2. The memcpy from temp_buf → data_buffer_.
-    //   3. The subsequent container Dump() copy.
-    // When there is nothing to compact we fall through and let the container
-    // handle the write as usual (e.g. reflink / copy_file_range via
-    // FileSharedMMap::Dump, or a single fwrite via MMapContainer::Dump).
    if (!items_buffer_ || !data_buffer_) {
      THROW_RUNTIME_ERROR("Buffers not initialized for dumping");
    }
+    compact_in_place();
    resize(size_);  // Resize the string column with avg size to shrink or
                    // expand data buffer
-    size_t pos_val;
-    if (size_ > 0) {
-      auto plan = prepare_compaction_plan();
-      if (plan.reused_size > 0) {
-        // Stream path: source (data_buffer_) and destination (snapshot file)
-        // are always different files, so there is no aliasing hazard.
-        pos_val = stream_compact_and_dump(plan, filename + ".data");
-        write_file(filename + ".pos", &pos_val, sizeof(pos_val), 1);
-        items_buffer_->Dump(filename + ".items");
-        items_buffer_->Close();
-        data_buffer_->Close();
-        return;
-      }
-    }
-    pos_val = pos_.load();
-    // No-compaction path: dump containers as-is.
+    size_t pos_val = pos_.load();
    write_file(filename + ".pos", &pos_val, sizeof(pos_val), 1);
    items_buffer_->Dump(filename + ".items");
    data_buffer_->Dump(filename + ".data");
Evidence
In dump(), compact_in_place() shrinks the underlying container to write_ptr and sets `pos_ =
write_ptr. Then resize(size_) can expand data_buffer_ to max(size * avg_size, pos_)`, where
avg_size is computed per-slot (duplicates included). MMapContainer::Dump() writes size_ bytes of
payload (the full container allocation), so the expanded tail is persisted to the snapshot file even
though pos_ indicates it is unused.

include/neug/utils/property/column.h[286-296]
include/neug/utils/property/column.h[303-313]
include/neug/utils/property/column.h[436-448]
include/neug/utils/property/column.h[517-521]
src/storages/container/mmap_container.cc[132-151]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
After compaction, `dump()` calls `resize(size_)`, which may expand `data_buffer_` beyond `pos_`. Because `IDataContainer::Dump()` persists the entire container payload size, snapshot `.data` files can become much larger than the actually-used string data (with large unused/zero regions).

### Issue Context
- `compact_in_place()` shrinks the container to `write_ptr` and sets `pos_ = write_ptr`.
- `resize(size_)` can expand `data_buffer_` to `max(size * avg_size, pos_)`.
- `MMapContainer::Dump()` writes the full payload `size_`.

### Fix Focus Areas
- include/neug/utils/property/column.h[286-313]
- include/neug/utils/property/column.h[303-313]
- include/neug/utils/property/column.h[465-522]
- src/storages/container/mmap_container.cc[132-151]

### Suggested direction
Ensure `.data` dumps only persist the used prefix (`pos_`) rather than the container’s reserved capacity. Options include:
- In `dump()`, avoid re-expanding `data_buffer_` after compaction (remove/move `resize(size_)`, or make it shrink-only for string columns during dump).
- Or explicitly `data_buffer_->Resize(pos_.load())` immediately before `data_buffer_->Dump(...)`.
- If reserved capacity must be preserved across reopen, persist capacity separately and re-reserve on open (e.g., via existing EnsureCapacity paths) rather than by dumping zero-filled tail bytes.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Empty offsets not reset 🐞
Description
compact_in_place() skips entries with length==0 without rewriting their offset, then shrinks
data_buffer_, so later get_view() can assert/crash or hit undefined behavior when computing
raw_data + item.offset for empty strings whose offset is now out of range.
Code

include/neug/utils/property/column.h[R469-473]

+    for (const auto& e : entries) {
+      if (e.length == 0) {
+        // Empty string: no bytes to move; slot keeps offset 0.
+        continue;
+      }
Evidence
Empty strings can have non-zero offsets (because set_value() stores offset = pos_.fetch_add(0)),
but the compaction loop continues for length==0 and never normalizes the slot. After
data_buffer_->Resize(write_ptr) shrinks the buffer, get_view() asserts `item.offset +
item.length <= data_buffer_->GetDataSize() and still computes raw_data + item.offset`, which
becomes out-of-bounds for those empty-string slots.

include/neug/utils/property/column.h[346-366]
include/neug/utils/property/column.h[378-383]
include/neug/utils/property/column.h[469-472]
include/neug/utils/property/column.h[489-491]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`compact_in_place()` skips `length==0` entries and leaves their `string_item.offset` unchanged. When the function later shrinks `data_buffer_`, those offsets can become greater than the new buffer size, causing `get_view()` to assert/crash and also risking UB from pointer arithmetic.
### Issue Context
Empty strings are legal values and are created by `set_value()` with `copied_val.size()==0` (offset is still recorded).
### Fix Focus Areas
- include/neug/utils/property/column.h[469-486]
### Suggested fix
When `e.length == 0`, explicitly set the slot to `{0, 0}` (or otherwise guarantee `offset <= new_data_size`). Optionally add a defensive assertion that `write_ptr <= e.offset` for non-empty entries.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


View more (1)
4. Compaction removes update capacity 🐞
Description
compact_in_place() shrinks data_buffer_ to exactly the compacted payload size
(Resize(write_ptr)), so after reopening from checkpoint there is no slack space and set_value()
updates (used by UpdateProperty via set_any with insert_safe=false) will throw "not enough
space" as soon as any non-empty update is attempted.
Code

include/neug/utils/property/column.h[R489-491]

+    size_t old_pos = pos_.load();
+    data_buffer_->Resize(write_ptr);
+    pos_.store(write_ptr);
Evidence
Compaction persists a .data file whose payload size equals write_ptr and sets pos_ to the same
value. On reopen, containers set GetDataSize() from the file size (size_ = file_size - header),
and the StringColumn open() path does not expand the data buffer. Since set_value() requires
pos_.load() + len GetDataSize() and does not resize, any non-empty update will fail when `pos_ ==
GetDataSize(). This is reachable because VertexTable::UpdateProperty calls set_any` without
insert_safe, which routes string updates through set_value() (not set_value_safe()).
Additionally, PropertyGraph::Open calls EnsureCapacity() based on vertex count; if the
checkpoint already has sufficient reserved *row* capacity, EnsureCapacity() does not call
table_->resize(), so it will not re-inflate the string data buffer capacity lost due to dump-time
shrink.

include/neug/utils/property/column.h[248-258]
include/neug/utils/property/column.h[346-366]
include/neug/utils/property/column.h[489-491]
src/storages/container/mmap_container.cc[38-67]
src/storages/graph/vertex_table.cc[175-185]
src/storages/graph/property_graph.cc[811-819]
src/storages/graph/vertex_table.cc[140-156]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`compact_in_place()` shrinks the persisted string data buffer to exactly the compacted payload size. After reopening from checkpoint, `GetDataSize()` equals this compact size, and `set_value()` rejects any non-empty update because `pos_ == GetDataSize()` and `set_value()` does not auto-resize. This breaks common update paths that call `set_any(..., insert_safe=false)`.
### Issue Context
- `VertexTable::UpdateProperty()` calls `set_any(vid, value)` without `insert_safe=true`.
- `PropertyGraph::Open()` calls `EnsureCapacity()` but it only resizes the table if the *row capacity* is insufficient; it does not address string data-buffer slack lost by dump compaction.
### Fix Focus Areas
- include/neug/utils/property/column.h[248-275]
- include/neug/utils/property/column.h[489-492]
- src/storages/graph/vertex_table.cc[140-156]
### Suggested fix options (pick one)
1. **Preferred (keeps compact snapshots):** After `init_pos()` in `open()/open_in_memory()/open_with_hugepages()`, proactively resize `data_buffer_` to restore working slack (e.g., at least `size_ * width_`, or another persisted/derived capacity), without changing `pos_`.
2. **Alternative (preserves slack in snapshot):** Do not shrink the container in `compact_in_place()` (remove/avoid `data_buffer_->Resize(write_ptr)`), only update `pos_` and offsets.
Either approach must ensure that after reopen, `pos_.load() + some_update_bytes <= data_buffer_->GetDataSize()` for typical non-`insert_safe` updates.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

5. compact_in_place() logic in header 📎
Description
The string column compaction implementation (compact_in_place) is still defined in
include/neug/utils/property/column.h, keeping substantial compaction logic in the header. This
does not meet the refactor objective of moving string compaction logic out of column.h into more
appropriate compilation units/modules.
Code

include/neug/utils/property/column.h[R423-494]

+  // Compact data_buffer_ in-place, reclaiming stale bytes left by
+  // append-only updates (each set_value appends without freeing the old copy).
+  //
+  // Algorithm (O(size_ log size_) time, O(size_) metadata space):
+  //   0. Fast-exit: sum(item.length) == pos_ means every byte is referenced
+  //      exactly once (no stale bytes, no shared offsets) — nothing to do.
+  //   1. Collect {slot, old_offset, length} for every item and sort by offset.
+  //   2. Walk in offset order with write_ptr=0:
+  //        - First occurrence of an offset: memmove bytes down to write_ptr
+  //          (safe because write_ptr <= old_offset always holds after sorting),
+  //          record old→new mapping, advance write_ptr.
+  //        - Duplicate offset (slots sharing default value from resize()):
+  //          look up new offset via old→new map, update item only.
+  //   3. Resize data_buffer_ to write_ptr and store new pos_.
+  void compact_in_place() {
+    if (size_ == 0)
+      return;
+    size_t total = 0;
+    for (size_t i = 0; i < size_; ++i) {
+      total += get_string_item(i).length;
+    }
+    if (total == pos_.load()) {
+      VLOG(10) << "StringColumn compact_in_place: no stale bytes to reclaim";
+      return;
+    }
   struct Entry {
-      size_t index;     // position in items array
-      uint64_t offset;  // current byte offset in the data buffer
-      uint32_t length;  // byte length of the string
+      size_t index;
+      uint64_t offset;
+      uint32_t length;
   };
-    std::vector<Entry> entries;
-    size_t total_size = 0;   // sum of all entry lengths (includes duplicates)
-    size_t reused_size = 0;  // bytes shared by entries that point to the same
-                             // offset (i.e. slots updated with the same value)
-  };

-  // Scan items and build a compaction plan that records which offsets are
-  // shared across multiple item slots (those are stale copies from updates).
-  CompactionPlan prepare_compaction_plan() const {
-    CompactionPlan plan;
-    plan.entries.reserve(size_);
-    std::unordered_set<uint64_t> seen_offsets;
+    std::vector<Entry> entries;
+    entries.reserve(size_);
   for (size_t i = 0; i < size_; ++i) {
     const auto item = get_string_item(i);
-      plan.total_size += item.length;
-      plan.entries.push_back(
-          {i, item.offset, static_cast<uint32_t>(item.length)});
-      if (item.length > 0) {
-        if (seen_offsets.count(item.offset)) {
-          // This offset is already referenced by an earlier slot: the current
-          // slot was set via resize(default) and shares the same backing bytes.
-          plan.reused_size += item.length;
-        } else {
-          seen_offsets.insert(item.offset);
-        }
-      }
-    }
-    return plan;
-  }
-
-  // Remove stale string data produced by update operations (which append a new
-  // copy without reclaiming the old one) by streaming compacted bytes directly
-  // to data_filename.
-  //   * Iterates plan.entries in order; unique offsets are fwrite'd once.
-  //   * MD5 is accumulated in a single forward pass and written into the
-  //     FileHeader at position 0 via fseek after all data is written.
-  //   * item offsets in items_buffer_ are updated to match the new layout.
-  //   * pos_ is updated to effective_size for future appends.
-  // Returns the effective (compacted) data size.
-  size_t stream_compact_and_dump(const CompactionPlan& plan,
-                                 const std::string& data_filename) {
-    auto parent_dir = std::filesystem::path(data_filename).parent_path();
-    if (!parent_dir.empty()) {
-      std::filesystem::create_directories(parent_dir);
-    }
-    FILE* fout = fopen(data_filename.c_str(), "wb");
-    if (!fout) {
-      THROW_IO_EXCEPTION("Failed to open output for stream compaction: " +
-                         data_filename);
+      entries.push_back({i, item.offset, static_cast<uint32_t>(item.length)});
   }

-    // Write a placeholder header; will be overwritten after MD5 is finalised.
-    FileHeader header{};
-    if (fwrite(&header, sizeof(header), 1, fout) != 1) {
-      fclose(fout);
-      THROW_IO_EXCEPTION("Failed to write placeholder header: " +
-                         data_filename);
-    }
+    std::sort(
+        entries.begin(), entries.end(),
+        [](const Entry& a, const Entry& b) { return a.offset < b.offset; });

-    const auto* raw_data =
-        reinterpret_cast<const char*>(data_buffer_->GetData());
-    size_t write_offset = 0;
-    std::unordered_map<uint64_t, uint64_t> old_offset_to_new;
-    MD5_CTX md5_ctx;
-    MD5_Init(&md5_ctx);
-
-    for (const auto& entry : plan.entries) {
-      if (entry.length > 0) {
-        auto it = old_offset_to_new.find(entry.offset);
-        if (it != old_offset_to_new.end()) {
-          // Duplicate offset: already written; just remap the item.
-          set_string_item(entry.index,
-                          {it->second, static_cast<uint32_t>(entry.length)});
-          continue;
-        }
-        old_offset_to_new.emplace(entry.offset, write_offset);
-        const char* src = raw_data + entry.offset;
-        if (fwrite(src, 1, entry.length, fout) != entry.length) {
-          fclose(fout);
-          THROW_IO_EXCEPTION("Failed to fwrite compacted data to: " +
-                             data_filename);
+    auto* raw = reinterpret_cast<char*>(data_buffer_->GetData());
+    size_t write_ptr = 0;
+    std::unordered_map<uint64_t, uint64_t> old_to_new;
+
+    for (const auto& e : entries) {
+      if (e.length == 0) {
+        // Empty string: no bytes to move; slot keeps offset 0.
+        continue;
+      }
+      auto it = old_to_new.find(e.offset);
+      if (it == old_to_new.end()) {
+        // First time we see this offset: slide bytes down if needed.
+        if (write_ptr != e.offset) {
+          memmove(raw + write_ptr, raw + e.offset, e.length);
       }
-        MD5_Update(&md5_ctx, src, entry.length);
+        old_to_new.emplace(e.offset, write_ptr);
+        set_string_item(e.index, {write_ptr, e.length});
+        write_ptr += e.length;
+      } else {
+        // Duplicate offset (shared default): remap to the new location.
+        set_string_item(e.index, {it->second, e.length});
     }
-      set_string_item(entry.index,
-                      {write_offset, static_cast<uint32_t>(entry.length)});
-      write_offset += entry.length;
-    }
-
-    // Seek back and stamp the real MD5 into the file header.
-    MD5_Final(header.data_md5, &md5_ctx);
-    if (fseek(fout, 0, SEEK_SET) != 0) {
-      fclose(fout);
-      THROW_IO_EXCEPTION("Failed to seek to header in: " + data_filename);
-    }
-    if (fwrite(&header, sizeof(header), 1, fout) != 1) {
-      fclose(fout);
-      THROW_IO_EXCEPTION("Failed to write final header to: " + data_filename);
-    }
-    if (fflush(fout) != 0) {
-      fclose(fout);
-      THROW_IO_EXCEPTION("Failed to fflush: " + data_filename);
-    }
-    if (fclose(fout) != 0) {
-      THROW_IO_EXCEPTION("Failed to fclose: " + data_filename);
   }

-    size_t effective_size = plan.total_size - plan.reused_size;
-    pos_.store(effective_size);
-    VLOG(1) << "StringColumn stream compaction: " << plan.total_size << " -> "
-            << effective_size << " bytes saved to " << data_filename;
-    return effective_size;
+    size_t old_pos = pos_.load();
+    data_buffer_->Resize(write_ptr);
+    pos_.store(write_ptr);
+    VLOG(1) << "StringColumn compact_in_place: " << old_pos << " -> "
+            << write_ptr << " bytes";
 }
Evidence
PR Compliance ID 1 requires refactoring string column stream compaction to avoid keeping excessive
implementation in column.h (ideally moving logic into .cc/modules). The PR replaces prior
streaming compaction code with a new in-place compaction algorithm, but that algorithm remains
implemented directly in column.h as compact_in_place().

Refactor string column stream compaction to reduce code in column.h
include/neug/utils/property/column.h[423-494]
include/neug/utils/property/column.h[286-288]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`TypedColumn<std::string_view>::compact_in_place()` contains non-trivial compaction implementation directly in `include/neug/utils/property/column.h`, which conflicts with the compliance objective to reduce/move string compaction logic out of the header.
## Issue Context
`column.h` is a heavily included header; keeping sorting, memmove-based compaction, and offset mapping logic inline increases header complexity and rebuild costs. If template constraints prevent a straightforward move to a `.cc`, consider alternatives like explicit instantiation for the string specialization, or extracting the implementation into a dedicated module/header to keep `column.h` minimal.
## Fix Focus Areas
- include/neug/utils/property/column.h[423-494]
- include/neug/utils/property/column.h[286-288]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Grey Divider

Previous review results

Review updated until commit 07390e4

Results up to commit 270f93c


🐞 Bugs (2)  
📘 Rule violations (0)  
📎 Requirement gaps (1)  
🎨 UX Issues (0)

🐞\ ≡ Correctness (2)
📎\ ⚙ Maintainability (1)

Grey Divider


Action required

1. Empty offsets not reset 🐞
Description
compact_in_place() skips entries with length==0 without rewriting their offset, then shrinks
data_buffer_, so later get_view() can assert/crash or hit undefined behavior when computing
raw_data + item.offset for empty strings whose offset is now out of range.
Code

include/neug/utils/property/column.h[R469-473]

+    for (const auto& e : entries) {
+      if (e.length == 0) {
+        // Empty string: no bytes to move; slot keeps offset 0.
+        continue;
+      }
Evidence
Empty strings can have non-zero offsets (because set_value() stores offset = pos_.fetch_add(0)),
but the compaction loop continues for length==0 and never normalizes the slot. After
data_buffer_->Resize(write_ptr) shrinks the buffer, get_view() asserts `item.offset +
item.length <= data_buffer_->GetDataSize() and still computes raw_data + item.offset`, which
becomes out-of-bounds for those empty-string slots.

include/neug/utils/property/column.h[346-366]
include/neug/utils/property/column.h[378-383]
include/neug/utils/property/column.h[469-472]
include/neug/utils/property/column.h[489-491]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`compact_in_place()` skips `length==0` entries and leaves their `string_item.offset` unchanged. When the function later shrinks `data_buffer_`, those offsets can become greater than the new buffer size, causing `get_view()` to assert/crash and also risking UB from pointer arithmetic.

### Issue Context
Empty strings are legal values and are created by `set_value()` with `copied_val.size()==0` (offset is still recorded).

### Fix Focus Areas
- include/neug/utils/property/column.h[469-486]

### Suggested fix
When `e.length == 0`, explicitly set the slot to `{0, 0}` (or otherwise guarantee `offset <= new_data_size`). Optionally add a defensive assertion that `write_ptr <= e.offset` for non-empty entries.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Compaction removes update capacity 🐞
Description
compact_in_place() shrinks data_buffer_ to exactly the compacted payload size
(Resize(write_ptr)), so after reopening from checkpoint there is no slack space and set_value()
updates (used by UpdateProperty via set_any with insert_safe=false) will throw "not enough
space" as soon as any non-empty update is attempted.
Code

include/neug/utils/property/column.h[R489-491]

+    size_t old_pos = pos_.load();
+    data_buffer_->Resize(write_ptr);
+    pos_.store(write_ptr);
Evidence
Compaction persists a .data file whose payload size equals write_ptr and sets pos_ to the same
value. On reopen, containers set GetDataSize() from the file size (size_ = file_size - header),
and the StringColumn open() path does not expand the data buffer. Since set_value() requires
pos_.load() + len <= data_buffer_->GetDataSize() and does not resize, any non-empty update will
fail when pos_ == GetDataSize(). This is reachable because VertexTable::UpdateProperty calls
set_any without insert_safe, which routes string updates through set_value() (not
set_value_safe()). Additionally, PropertyGraph::Open calls EnsureCapacity() based on vertex
count; if the checkpoint already has sufficient reserved *row* capacity, EnsureCapacity() does not
call table_->resize(), so it will not re-inflate the string data buffer capacity lost due to
dump-time shrink.

include/neug/utils/property/column.h[248-258]
include/neug/utils/property/column.h[346-366]
include/neug/utils/property/column.h[489-491]
src/storages/container/mmap_container.cc[38-67]
src/storages/graph/vertex_table.cc[175-185]
src/storages/graph/property_graph.cc[811-819]
src/storages/graph/vertex_table.cc[140-156]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`compact_in_place()` shrinks the persisted string data buffer to exactly the compacted payload size. After reopening from checkpoint, `GetDataSize()` equals this compact size, and `set_value()` rejects any non-empty update because `pos_ == GetDataSize()` and `set_value()` does not auto-resize. This breaks common update paths that call `set_any(..., insert_safe=false)`.

### Issue Context
- `VertexTable::UpdateProperty()` calls `set_any(vid, value)` without `insert_safe=true`.
- `PropertyGraph::Open()` calls `EnsureCapacity()` but it only resizes the table if the *row capacity* is insufficient; it does not address string data-buffer slack lost by dump compaction.

### Fix Focus Areas
- include/neug/utils/property/column.h[248-275]
- include/neug/utils/property/column.h[489-492]
- src/storages/graph/vertex_table.cc[140-156]

### Suggested fix options (pick one)
1. **Preferred (keeps compact snapshots):** After `init_pos()` in `open()/open_in_memory()/open_with_hugepages()`, proactively resize `data_buffer_` to restore working slack (e.g., at least `size_ * width_`, or another persisted/derived capacity), without changing `pos_`.
2. **Alternative (preserves slack in snapshot):** Do not shrink the container in `compact_in_place()` (remove/avoid `data_buffer_->Resize(write_ptr)`), only update `pos_` and offsets.

Either approach must ensure that after reopen, `pos_.load() + some_update_bytes <= data_buffer_->GetDataSize()` for typical non-`insert_safe` updates.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

3. compact_in_place() logic in header 📎
Description
The string column compaction implementation (compact_in_place) is still defined in
include/neug/utils/property/column.h, keeping substantial compaction logic in the header. This
does not meet the refactor objective of moving string compaction logic out of column.h into more
appropriate compilation units/modules.
Code

include/neug/utils/property/column.h[R423-494]

+  // Compact data_buffer_ in-place, reclaiming stale bytes left by
+  // append-only updates (each set_value appends without freeing the old copy).
+  //
+  // Algorithm (O(size_ log size_) time, O(size_) metadata space):
+  //   0. Fast-exit: sum(item.length) == pos_ means every byte is referenced
+  //      exactly once (no stale bytes, no shared offsets) — nothing to do.
+  //   1. Collect {slot, old_offset, length} for every item and sort by offset.
+  //   2. Walk in offset order with write_ptr=0:
+  //        - First occurrence of an offset: memmove bytes down to write_ptr
+  //          (safe because write_ptr <= old_offset always holds after sorting),
+  //          record old→new mapping, advance write_ptr.
+  //        - Duplicate offset (slots sharing default value from resize()):
+  //          look up new offset via old→new map, update item only.
+  //   3. Resize data_buffer_ to write_ptr and store new pos_.
+  void compact_in_place() {
+    if (size_ == 0)
+      return;
+    size_t total = 0;
+    for (size_t i = 0; i < size_; ++i) {
+      total += get_string_item(i).length;
+    }
+    if (total == pos_.load()) {
+      VLOG(10) << "StringColumn compact_in_place: no stale bytes to reclaim";
+      return;
+    }
    struct Entry {
-      size_t index;     // position in items array
-      uint64_t offset;  // current byte offset in the data buffer
-      uint32_t length;  // byte length of the string
+      size_t index;
+      uint64_t offset;
+      uint32_t length;
    };
-    std::vector<Entry> entries;
-    size_t total_size = 0;   // sum of all entry lengths (includes duplicates)
-    size_t reused_size = 0;  // bytes shared by entries that point to the same
-                             // offset (i.e. slots updated with the same value)
-  };

-  // Scan items and build a compaction plan that records which offsets are
-  // shared across multiple item slots (those are stale copies from updates).
-  CompactionPlan prepare_compaction_plan() const {
-    CompactionPlan plan;
-    plan.entries.reserve(size_);
-    std::unordered_set<uint64_t> seen_offsets;
+    std::vector<Entry> entries;
+    entries.reserve(size_);
    for (size_t i = 0; i < size_; ++i) {
      const auto item = get_string_item(i);
-      plan.total_size += item.length;
-      plan.entries.push_back(
-          {i, item.offset, static_cast<uint32_t>(item.length)});
-      if (item.length > 0) {
-        if (seen_offsets.count(item.offset)) {
-          // This offset is already referenced by an earlier slot: the current
-          // slot was set via resize(default) and shares the same backing bytes.
-          plan.reused_size += item.length;
-        } else {
-          seen_offsets.insert(item.offset);
-        }
-      }
-    }
-    return plan;
-  }
-
-  // Remove stale string data produced by update operations (which append a new
-  // copy without reclaiming the old one) by streaming compacted bytes directly
-  // to data_filename.
-  //   * Iterates plan.entries in order; unique offsets are fwrite'd once.
-  //   * MD5 is accumulated in a single forward pass and written into the
-  //     FileHeader at position 0 via fseek after all data is written.
-  //   * item offsets in items_buffer_ are updated to match the new layout.
-  //   * pos_ is updated to effective_size for future appends.
-  // Returns the effective (compacted) data size.
-  size_t stream_compact_and_dump(const CompactionPlan& plan,
-                                 const std::string& data_filename) {
-    auto parent_dir = std::filesystem::path(data_filename).parent_path();
-    if (!parent_dir.empty()) {
-      std::filesystem::create_directories(parent_dir);
-    }
-    FILE* fout = fopen(data_filename.c_str(), "wb");
-    if (!fout) {
-      THROW_IO_EXCEPTION("Failed to open output for stream compaction: " +
-                         data_filename);
+      entries.push_back({i, item.offset, static_cast<uint32_t>(item.length)});
    }

-    // Write a placeholder header; will be overwritten after MD5 is finalised.
-    FileHeader header{};
-    if (fwrite(&header, sizeof(header), 1, fout) != 1) {
-      fclose(fout);
-      THROW_IO_EXCEPTION("Failed to write placeholder header: " +
-                         data_filename);
-    }
+    std::sort(
+        entries.begin(), entries.end(),
+        [](const Entry& a, const Entry& b) { return a.offset < b.offset; });

-    const auto* raw_data =
-        reinterpret_cast<const char*>(data_buffer_->GetData());
-    size_t write_offset = 0;
-    std::unordered_map<uint64_t, uint64_t> old_offset_to_new;
-    MD5_CTX md5_ctx;
-    MD5_Init(&md5_ctx);
-
-    for (const auto& entry : plan.entries) {
-      if (entry.length > 0) {
-        auto it = old_offset_to_new.find(entry.offset);
-        if (it != old_offset_to_new.end()) {
-          // Duplicate offset: already written; just remap the item.
-          set_string_item(entry.index,
-                          {it->second, static_cast<uint32_t>(entry.length)});
-          continue;
-        }
-        old_offset_to_new.emplace(entry.offset, write_offset);
-        const char* src = raw_data + entry.offset;
-        if (fwrite(src, 1, entry.length, fout) != entry.length) {
-          fclose(fout);
-          THROW_IO_EXCEPTION("Failed to fwrite compacted data to: " +
-                             data_filename);
+    auto* raw = reinterpret_cast<char*>(data_buffer_->GetData());
+    size_t write_ptr = 0;
+    std::unordered_map<uint64_t, uint64_t> old_to_new;
+
+    for (const auto& e : entries) {
+      if (e.length == 0) {
+        // Empty string: no bytes to move; slot keeps offset 0.
+        continue;
+      }
+      auto it = old_to_new.find(e.offset);
+      if (it == old_to_new.end()) {
+        // First time we see this offset: slide bytes down if needed.
+        if (write_ptr != e.offset) {
+          memmove(raw + write_ptr, raw + e.offset, e.length);
        }
-        MD5_Update(&md5_ctx, src, entry.length);
+        old_to_new.emplace(e.offset, write_ptr);
+        set_string_item(e.index, {write_ptr, e.length});
+        write_ptr += e.length;
+      } else {
+        // Duplicate offset (shared default): remap to the new location.
+        set_string_item(e.index, {it->second, e.length});
      }
-      set_string_item(entry.index,
-                      {write_offset, static_cast<uint32_t>(entry.length)});
-      write_offset += entry.length;
-    }
-
-    // Seek back and stamp the real MD5 into the file header.
-    MD5_Final(header.data_md5, &md5_ctx);
-    if (fseek(fout, 0, SEEK_SET) != 0) {
-      fclose(fout);
-      THROW_IO_EXCEPTION("Failed to seek to header in: " + data_filename);
-    }
-    if (fwrite(&header, sizeof(header), 1, fout) != 1) {
-      fclose(fout);
-      THROW_IO_EXCEPTION("Failed to write final header to: " + data_filename);
-    }
-    if (fflush(fout) != 0) {
-      fclose(fout);
-      THROW_IO_EXCEPTION("Failed to fflush: " + data_filename);
-    }
-    if (fclose(fout) != 0) {
-      THROW_IO_EXCEPTION("Failed to fclose: " + data_filename);
    }

-    size_t effective_size = plan.total_size - plan.reused_size;
-    pos_.store(effective_size);
-    VLOG(1) << "StringColumn stream compaction: " << plan.total_size << " -> "
-            << effective_size << " bytes saved to " << data_filename;
-    return effective_size;
+    size_t old_pos = pos_.load();
+    data_buffer_->Resize(write_ptr);
+    pos_.store(write_ptr);
+    VLOG(1) << "StringColumn compact_in_place: " << old_pos << " -> "
+            << write_ptr << " bytes";
  }
Evidence
PR Compliance ID 1 requires refactoring string column stream compaction to avoid keeping excessive
implementation in column.h (ideally moving logic into .cc/modules). The PR replaces prior
streaming compaction code with a new in-place compaction algorithm, but that algorithm remains
implemented directly in column.h as compact_in_place().

Refactor string column stream compaction to reduce code in column.h
include/neug/utils/property/column.h[423-494]
include/neug/utils/property/column.h[286-288]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`TypedColumn<std::string_view>::compact_in_place()` contains non-trivial compaction implementation directly in `include/neug/utils/property/column.h`, which conflicts with the compliance objective to reduce/move string compaction logic out of the header.

## Issue Context
`column.h` is a heavily included header; keeping sorting, memmove-based compaction, and offset mapping logic inline increases header complexity and rebuild costs. If template constraints prevent a straightforward move to a `.cc`, consider alternatives like explicit instantiation for the string specialization, or extracting the implementation into a dedicated module/header to keep `column.h` minimal.

## Fix Focus Areas
- include/neug/utils/property/column.h[423-494]
- include/neug/utils/property/column.h[286-288]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Grey Divider

Qodo Logo

@zhanglei1949 zhanglei1949 force-pushed the zl/ref-str-col-comp branch from 270f93c to 624eb97 Compare April 7, 2026 09:38
Comment thread include/neug/utils/property/column.h
Comment thread include/neug/utils/property/column.h
@zhanglei1949
Copy link
Copy Markdown
Member Author

/review

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Apr 7, 2026

Persistent review updated to latest commit 07390e4

@zhanglei1949 zhanglei1949 requested a review from Copilot April 7, 2026 09:50
Comment thread include/neug/utils/property/column.h
Comment thread include/neug/utils/property/column.h
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR refactors StringColumn’s dump-time compaction by removing the previous streaming-to-file compaction path and replacing it with an in-place compaction of data_buffer_ before dumping, addressing #177.

Changes:

  • Replace the streaming compaction-on-dump implementation with an in-place compact_in_place() routine.
  • Simplify dump() to always compact, then dump containers normally.
  • Move file_header.h include usage away from column.h into id_indexer.h.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
include/neug/utils/property/column.h Removes streaming compaction dump path and introduces in-place compaction logic for string data.
include/neug/utils/id_indexer.h Adds file_header.h include (presumably to keep header definitions available where needed after refactor).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread include/neug/utils/property/column.h
Comment thread include/neug/utils/property/column.h
Comment thread include/neug/utils/property/column.h Outdated
@zhanglei1949 zhanglei1949 changed the title refactor: Try to simplify StringColumn's CompactionOnDump refactor: Try to simplify StringColumn's CompactionOnDump, compact in place Apr 8, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread include/neug/utils/property/column.h
Comment thread include/neug/utils/property/column.h Outdated
Comment thread include/neug/utils/property/column.h
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor string column stream compaction

2 participants