Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions docs/iceberg-maintenance/optimization/configuration.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
---
title: Configuration
sidebar_position: 2
---

# Configuring Table Compaction in Olake

Each table in Olake can have its **own compaction schedule and advanced settings**.
Follow these steps to configure compaction for a specific table:


### 1. Click the Configure Button

Click the **Configure** button next to the table you want to optimize.
This opens a modal where you can schedule **Lite, Medium, and Full compactions**.

<!-- TODO BEFORE MERGE: Add UI screenshots -->


### 2. Set the Compaction Schedule

- Select a schedule from the **predefined dropdown options** or choose **Custom** to specify your own cron expression.
- The optimization will run automatically according to the schedule set for that table.

<!-- TODO BEFORE MERGE: Add UI screenshots -->

### 3. Advanced Settings: Target File Size

- Expand the **Advanced Settings** panel in the modal.
- Specify the **Target File Size** for the table:

- **Full Compaction:** Data files will be rewritten to match this target size.
- **Medium Compaction:** Smaller files will be merged into files **closer to the target file size**, reducing fragmentation and improving query performance.

> **Tip:** Choose a target size based on your query patterns and table size. Larger files improve scan efficiency but may increase the cost of rewriting files.

<!-- TODO BEFORE MERGE: Add UI screenshots -->

### 4. Save the Configuration

- Click **Save**.
- A dialog box confirms that the configuration was successful.

<!-- TODO BEFORE MERGE: Add UI screenshots -->



### 5. Enable the Table for Optimization

- After saving, you will be redirected to the **Tables** page.
- Locate the table and **toggle the Enable switch** to activate scheduled optimization for that specific table.

> **Important:** The **Enable toggle must be switched on**. Even if a cron schedule is configured, the optimization will not execute unless the table is enabled.

<!-- TODO BEFORE MERGE: Add UI screenshots -->


63 changes: 63 additions & 0 deletions docs/iceberg-maintenance/optimization/overview.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
---
title: Overview
sidebar_position: 1
---


# What is Iceberg Optimization?

In Apache Iceberg, optimization refers to a set of table maintenance operations that improve query performance, storage efficiency, and interoperability across query engines.

As data is continuously written to an Iceberg table—especially through streaming pipelines or Change Data Capture (CDC)—tables can accumulate many small data files, fragmented layouts, and delete files.
Over time, this can increase metadata overhead and negatively impact query performance.

Optimization helps maintain efficient tables by performing operations such as:

- Compacting small data files to reduce file fragmentation and improve query performance

- Converting equality delete files to positional delete files, ensuring compatibility with query engines that do not support equality deletes

- Expiring older snapshots to remove unused metadata and reduce storage overhead

- Deleting orphan files that are no longer referenced by the table metadata

By periodically running optimization, Iceberg tables remain compact, performant, and easier to manage, while ensuring consistent query behavior across different analytics engines.

# Types of Optimizations Supported in OLake

Olake provides three types of optimizations that can be performed on a table, depending on the level of optimization required.

1. **Lite** – Performs a lightweight optimization by converting **equality delete files** into **positional delete files**. Some query engines do not support reading equality delete files, so this conversion ensures better compatibility across query engines without rewriting the underlying data files.

2. **Medium** – Performs partial compaction by merging smaller **data files** and **delete files** into medium-sized files closer to the configured **target file size**. This helps reduce file fragmentation, lowers metadata overhead, and improves query performance while avoiding a full rewrite of the table.

3. **Full** – Performs the deepest level of optimization by rewriting data files so that they align with the configured **target file size**. This results in a complete **copy-on-write (COW)** rewrite of the table’s data files, producing the most optimal file layout. Full compaction is typically used when tables have accumulated significant fragmentation or when maximum query performance is required.

### Choosing the Right Optimization Type

| Optimization Type | Output | What it Does | Cost Incurred | When to Use |
|-------------------|--------|--------------|------|-------------|
| **Lite** | Equality delete files are converted to positional delete files | Improves query engine compatibility without rewriting data files | **Low** | Use when your table contains equality delete files and your query engine does not support them. Suitable for lightweight maintenance with minimal compute usage. |
| **Medium** | Smaller data and delete files are merged into medium-sized files closer to the target file size | Reduces file fragmentation and metadata overhead by compacting smaller files | **Medium** | Use when tables accumulate many small data or delete files due to frequent syncs or CDC updates. Helps improve query performance without a full table rewrite. |
| **Full** | Data files are completely rewritten into files aligned with the target file size | Performs a full copy-on-write rewrite of the table to produce the most optimal file layout | **High** | Use when tables are heavily fragmented or when maximum query performance and optimal file layout are required. |

# Why is Optimization Required?


Optimization should be run periodically to maintain the performance and efficiency of tables as data volumes grow.
Running optimization is advisable in the following situations:

- **Large number of small files**
Frequent data ingestion can create many small files, which increases metadata overhead and slows down query planning.

- **Accumulation of delete files**
Tables that receive frequent updates or deletes may generate multiple delete files, which can impact query performance.

- **Query performance degradation**
If queries start taking longer to execute, optimizing the table can help reorganize files and improve scan efficiency.

- **Compatibility with query engines**
If your table contains equality delete files and your query engine does not support them, running Lite optimization can convert them to positional delete files.

- **Periodic table maintenance**
Running optimization at regular intervals helps maintain an efficient file layout and prevents fragmentation as the table grows.
51 changes: 51 additions & 0 deletions docs/iceberg-maintenance/overview.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
---
title: Overview
sidebar_position: 1
---

## What is Iceberg Maintenance?

In Apache Iceberg, table maintenance is the process of keeping tables efficient and performant as data is continuously added, updated, or deleted.
As an Iceberg table evolves, it can accumulate small data files, delete files, and additional metadata. Over time, this can make the table less efficient for querying and increase storage and compute overhead.

Maintenance operations helps address this by keeping the table well-organized, compact, and optimized for query engines, ensuring consistent performance as data grows.
It is typically performed periodically through operations like compaction and cleanup, allowing tables to remain reliable and efficient for large-scale analytical workloads.

## Why Table Maintenance is Required?

Each write to an Iceberg table creates a new snapshot, and snapshots (plus metadata) accumulate unless you expire them, which can grow metadata and retain unused files.

Streaming/CDC-style ingestion workloads also tend to generate many small files over time, which gradually makes reads slower and more expensive unless you compact and clean up regularly.

## Maintenance Capabilities in OLake

OLake provides built-in optimization capabilities to keep Iceberg tables efficient, well-structured, and performant as data continuously evolves.

It offers three levels of compaction— **Lite, Medium and Full** — each designed for a different level of maintenance.

- **Lite Optimization** focuses on converting equality delete files into positional delete files to ensure compatibility across query engines with minimal overhead.

- **Medium Optimization** reduces fragmentation by merging smaller data and delete files into sizes closer to the configured target file size, improving query efficiency.

- **Full Optimization** performs a complete rewrite of data files to align with the target file size, resulting in the most optimal file layout.

Together, these capabilities enable users to effectively balance performance, storage efficiency, and compute cost based on their workloads.


## When is Table Maintenance Required?

In Apache Iceberg, table maintenance should be performed periodically to ensure consistent query performance and efficient storage as data evolves.

You should consider running maintenance in the following scenarios:

- **Frequent Data Ingestion or Updates**

- **Accumulation of Small Files**

- **Presence of Delete Files**

- **Degrading Query Performance**

- **Growing Table Size Over Time**

Regular maintenance ensures that Iceberg tables remain optimized, scalable, and performant for analytical workloads.
13 changes: 13 additions & 0 deletions sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,19 @@ const docSidebar = {
],
},

// ICEBERG MAINTENANCE
sectionHeader("ICEBERG MAINTENANCE"),
{
type: 'category',
label: 'Optimization',
items: [
{
type: 'autogenerated',
dirName: 'iceberg-maintenance/optimization',
},
],
},

sectionHeader("UNDERSTANDING OLAKE"),
{
type: 'category',
Expand Down
Loading