From 69fcd005ef18d533bf3d7ef3179fcd5db903e246 Mon Sep 17 00:00:00 2001 From: prrao87 <35005448+prrao87@users.noreply.github.com> Date: Thu, 19 Feb 2026 11:53:40 -0500 Subject: [PATCH] Clarify disk usage during compaction --- docs/indexing/reindexing.mdx | 35 +++++++++++++++++++++++++---------- docs/lance.mdx | 14 ++++++++------ 2 files changed, 33 insertions(+), 16 deletions(-) diff --git a/docs/indexing/reindexing.mdx b/docs/indexing/reindexing.mdx index 25fe620..5f5b683 100644 --- a/docs/indexing/reindexing.mdx +++ b/docs/indexing/reindexing.mdx @@ -14,13 +14,16 @@ As data is being added and a reindex operation is running, LanceDB will combine Rather than dropping an existing index entirely and reindexing from scratch, LanceDB supports **incremental indexing**. -## Incremental Indexing +## Incremental Reindexing -OSS +You can manually trigger an incremental indexing operation on updated data +using the `optimize()` method on a table. -In LanceDB OSS, you can manually trigger an incremental indexing operation using the `optimize()` -method on a table. This will perform compaction, pruning and updating of the index on the specified -table. +Table optimization performs three maintenance operations: + +1. **Compaction**: merges small fragments into larger ones to improve read performance +2. **Pruning/Cleanup**: removes files from versions older than a retention window (7 days by default) +3. **Index update**: adds newly-ingested data to existing indexes @@ -36,11 +39,23 @@ LanceDB Cloud/Enterprise support incremental reindexing through an automated bac - While indexes are being rebuilt, queries use brute force methods on unindexed rows, which may temporarily increase latency. To avoid this, set `fast_search=True` to search only indexed data. - Use `index_stats()` to view the number of unindexed rows. This will be zero when indexes are fully up-to-date. - - -**Performance and simplicity** - The benefit of using LanceDB Cloud & Enterprise is that they automate the reindexing process and operate continuously in the background, minimizing the impact on latency under high loads. In OSS, you must manually manage the reindexing cadence based on your data growth and performance needs. - \ No newline at end of file + +## Disk utilization + +Compaction by itself does not immediately free disk space, and can temporarily increase it because new +compacted files are written before old-version files are deleted. Disk space is reclaimed when old versions +are pruned during cleanup. Set retention only as low as your rollback and time-travel requirements allow. + +If you need to reclaim space more aggressively in OSS, use a shorter retention window: + + ```python Python icon=Python + from datetime import timedelta + + table.optimize(cleanup_older_than=timedelta(days=1)) + ``` + + + diff --git a/docs/lance.mdx b/docs/lance.mdx index 9381e47..0f5faa0 100644 --- a/docs/lance.mdx +++ b/docs/lance.mdx @@ -73,13 +73,15 @@ throughput (i.e., keep latencies down to a minimum). Compaction is the process o together to reduce the amount of metadata that needs to be managed, and to reduce the number of files that need to be opened while scanning the dataset. -### Performance Optimization Through Compaction +Running compaction on a Lance dataset will do the following: -Compaction performs the following tasks in the background: +- Remove deleted rows from fragments +- Remove dropped columns from fragments +- Merge small fragments into larger ones -- Removes deleted rows from fragments -- Removes dropped columns from fragments -- Merges small fragments into larger ones +Compaction focuses on read performance, not immediate disk reclamation. During compaction, Lance writes +new compacted files while older files are still referenced by previous table versions. This means disk +usage can increase temporarily until old versions are cleaned up. ### Data deletion and recovery @@ -97,4 +99,4 @@ exists based on your backup policy. href="https://lance.org/quickstart" > Lance is a separate open source project. Check out its documentation to learn more. - \ No newline at end of file +