Skip to content

feat: add compression module with zstd/gzip/lzma support#165

Closed
cluster2600 wants to merge 1 commit intoalibaba:mainfrom
cluster2600:feat/compression-implementation
Closed

feat: add compression module with zstd/gzip/lzma support#165
cluster2600 wants to merge 1 commit intoalibaba:mainfrom
cluster2600:feat/compression-implementation

Conversation

@cluster2600
Copy link
Contributor

@cluster2600 cluster2600 commented Feb 24, 2026

Summary

  • Add Python compression module with zstd, gzip, and lzma backends
  • Add streaming compression API for large datasets
  • Enable RocksDB runtime compression codec detection (ZSTD → LZ4 → Snappy fallback)
  • Extend CollectionSchema with compression configuration

Changes

Python

  • python/zvec/compression.py — Core compression with pluggable backends
  • python/zvec/compression_integration.py — Collection-level compression integration
  • python/zvec/streaming.py — Streaming compression/decompression API
  • python/zvec/model/schema/collection_schema.py — Compression config in schema
  • python/zvec/__init__.py — Export compression module

C++

  • src/db/common/rocbsdb_context.cc — Runtime detection of supported compression codecs using rocksdb::GetSupportedCompressions(), with fallback chain: ZSTD → LZ4 → Snappy → None

Tests

  • python/tests/test_compression.py
  • python/tests/test_compression_integration.py
  • python/tests/test_schema_compression.py
  • python/tests/test_streaming.py

Docs

  • docs/COMPRESSION.md — Compression usage and configuration guide

Context

Split from #157 to isolate compression feature from CI and GPU work.

Test plan

  • ruff lint and format pass
  • clang-format passes on rocbsdb_context.cc
  • Compression tests pass locally
  • CI builds succeed (RocksDB compression fallback works on all platforms)

- Add compression module supporting zstd, gzip, and lzma codecs
- Add compression parameter to CollectionSchema for storage optimization
- Add compression integration module for end-to-end vector compression
- Add streaming compression API for large datasets
- Enable RocksDB compression with runtime codec detection (ZSTD → LZ4 → Snappy → none)
- Add comprehensive compression documentation and tests

The RocksDB compression uses GetSupportedCompressions() to detect
available codecs at runtime, preventing crashes when ZSTD is not
linked with the binary.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@zhourrr
Copy link
Collaborator

zhourrr commented Feb 25, 2026

Thanks for putting together this compression proposal — we appreciate the effort and thought you've put into the design!

Our system is built around a C++ core, with Python (and other languages) serving as bindings to this vector search engine. Because of this, we generally aim to keep data-handling logic in the C++ engine, so all bindings benefit uniformly and behavior remains consistent. Are there particular use cases or constraints that make compression in Python necessary or advantageous?

I'm also curious about the expected benefit for vector data specifically. High-dimensional floating-point vectors are typically not very compressible with general-purpose algorithms. For storage efficiency, we already support quantizations (though they hurt recall), which are typically more effective for vectors than byte-level compression.

Any additional context you can share about the problem you're trying to solve would be very helpful!

@cluster2600
Copy link
Contributor Author

Thanks for the thoughtful feedback @zhourrr! You raise valid points — keeping data-handling logic in the C++ core makes sense for consistency across all language bindings, and I agree that general-purpose compression has limited benefit for high-dimensional float vectors compared to the quantization you already support.

I'll close this one out. If compression at the C++ engine level ever becomes useful, happy to contribute there instead. Appreciate the review!

@cluster2600 cluster2600 deleted the feat/compression-implementation branch February 25, 2026 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants