Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
381 changes: 381 additions & 0 deletions blog/2025-09-04-creating-job-olake-docker-cli.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,381 @@
---
title: "From Postgres to Iceberg: Creating OLake Jobs with Docker CLI and UI"
description: "A friendly, step-by-step walkthrough to configure replication from Postgres to Apache Iceberg (Glue catalog) using the OLake UI or the Docker CLI."
slug: creating-job-olake-docker-cli
authors: [akshay]
tags: [docker,apache-iceberg,replication]
image: /img/blog/cover/pipeline-on-olake.png
---

# From Postgres to Iceberg: Creating OLake Jobs with Docker CLI and UI

Data replication has become one of the most essential building blocks in modern data engineering. Whether it's keeping your analytics warehouse in sync with operational databases or feeding real-time pipelines for machine learning, companies rely on tools to move data quickly and reliably.

Today, there's no shortage of options—platforms like Fivetran, Airbyte, Debezium, and even custom-built Flink or Spark pipelines are widely used to handle replication. But each of these comes with trade-offs: infrastructure complexity, cost, or lack of flexibility when you want to adapt replication to your specific needs.

That's where OLake comes in. Instead of forcing you into one way of working, OLake focuses on making replication into Apache Iceberg (and other destinations) straightforward, fast, and adaptable. You can choose between a guided UI experience for simplicity or a Docker CLI flow for automation and DevOps-style control.

In this blog, we'll walk through how to set up a replication job in OLake, step by step. We'll start with the UI wizard for those who prefer a visual setup, then move on to the CLI-based workflow for teams that like to keep things in code. By the end, you'll have a job that continuously replicates from Postgres → Apache Iceberg (Glue catalog) with CDC, normalization, filters, partitioning, and scheduling—all running seamlessly.

## Two Setup Styles (pick what fits you)

### Option A — UI "Job-first" (guided, all-in-one)
Perfect if you want a clear wizard and visual guardrails.

### Option B — CLI (Docker)
Great if you prefer terminal, versioned JSON, or automation.

Both produce the **same result**. Choose the path that matches your workflow today.

## Option A — OLake UI (Guided)

We'll take the "job-first" approach. It's straightforward and keeps you in one flow.

### 1) Create a Job

From the left nav, go to **Jobs → Create Job**.
You'll land on a wizard that starts with the **source**.

![Job page](/img/docs/getting-started/create-your-first-job/job-create.png)

### 2) Configure the Source (Postgres)

Choose **Set up a new source** → select **Postgres** → keep OLake version at the latest stable.
Name it clearly, fill the Postgres endpoint config, and hit **Test Connection**.

![Job source connector](/img/docs/getting-started/create-your-first-job/job-source-connector.png)

![Job source config](/img/docs/getting-started/create-your-first-job/job-source-config.png)

> 📝 **Planning for CDC?**
> Make sure a **replication slot** exists in Postgres.
> See: [Replication Slot Guide](/docs/connectors/postgres/setup/generic).

### 3) Configure the Destination (Iceberg + Glue)

Now we set where the data will land.
Pick **Apache Iceberg** as the destination, and **AWS Glue** as the catalog.

![Job dest connector](/img/docs/getting-started/create-your-first-job/job-dest-connector.png)

![Job dest catalog](/img/docs/getting-started/create-your-first-job/job-dest-catalog.png)

Provide the connection details and **Test Connection**.

![Job dest config](/img/docs/getting-started/create-your-first-job/job-dest-config.png)

### 4) Configure Streams

This is where we dial in *what* to replicate and *how*.
For this walkthrough, we'll:

- Include stream `fivehundred`
- **Sync mode:** **Full Refresh + CDC**
- **Normalization:** **On**
- **Filter:** `dropoff_datetime >= "2010-01-01 00:00:00"`
- **Partitioning:** by **year** extracted from `dropoff_datetime`
- **Schedule:** every day at **12:00 AM**

![Job streams page](/img/docs/getting-started/create-your-first-job/job-streams.png)

Select the checkbox for `fivehundred`, then click the stream name to open stream settings.
Pick the sync mode and toggle **Normalization**.

![Select stream](/img/docs/getting-started/create-your-first-job/job-stream-select.png)

Let's make the destination query-friendly. Open **Partitioning** → choose `dropoff_datetime` → **year**.
Want more? Read the [Partitioning Guide](/docs/writers/parquet/partitioning).

![Stream partitioning](/img/docs/getting-started/create-your-first-job/job-stream-partition.png)

Add the **Data Filter** so we only move rows from 2010 onward.

![Stream filter](/img/docs/getting-started/create-your-first-job/job-data-filter.png)

Click **Next** to continue.

### 5) Schedule the Job

Give the job a clear name, set **Every Day @ 12:00 AM**, and hit **Create Job**.

![Job schedule](/img/docs/getting-started/create-your-first-job/job-schedule.png)

You're set! 🎉

![Job created](/img/docs/getting-started/create-your-first-job/job-creation-success.png)

Want results right away? Start a run immediately with **Jobs → (⋮) → Sync Now**.

![Sync now](/img/docs/getting-started/create-your-first-job/job-sync-now.png)

You'll see status badges on the right (**Running / Failed / Completed**).
For more details, open **Job Logs & History**.

- Running
![Job running](/img/docs/getting-started/create-your-first-job/job-running.png)

- Completed
![Job success](/img/docs/getting-started/create-your-first-job/job-success.png)

Finally, verify that data landed in S3/Iceberg as configured:

![S3 data](/img/docs/getting-started/create-your-first-job/job-data-s3.png)

### 6) Manage Your Job (from the Jobs page)

**Sync Now** — Trigger a run without waiting.

**Edit Streams** — Change which streams are included and tweak replication settings.
Use the stepper to jump between **Source** and **Destination**.

![Edit streams](/img/docs/getting-started/create-your-first-job/job-edit-streams-page.png)

> By default, source/destination editing is locked. Click **Edit** to unlock.

![Edit destination](/img/docs/getting-started/create-your-first-job/job-edit-destination.png)

> 🔄 **Need to change Partitioning / Filter / Normalization for an existing stream?**
> Unselect the stream → **Save** → reopen **Edit Streams** → re-add it with new settings.

**Pause Job** — Temporarily stop runs. You'll find paused jobs under **Inactive Jobs**, where you can **Resume** any time.

![Pause/Resume](/img/docs/getting-started/create-your-first-job/job-resume.png)

**Job Logs & History** — See all runs. Use **View Logs** for per-run details.

![Job logs list](/img/docs/getting-started/create-your-first-job/view-logs.png)

![Logs page](/img/docs/getting-started/create-your-first-job/logs-page.png)

**Job Settings** — Rename, change frequency, pause, or delete.
Deleting a job moves its source/destination to **inactive** (if not used elsewhere).

![Job settings](/img/docs/getting-started/create-your-first-job/job-settings.png)

## Option B — OLake CLI (Docker)

Prefer terminals, PR reviews, and repeatable runs? Let's do the same pipeline via Docker.

### Prerequisites

- **Docker** installed and running
- OLake images: **Docker Hub → `olakego/*`**

### How the CLI flow works

1. **Configure source & destination** (JSON files)
2. **Discover streams** → writes a `streams.json`
3. **Edit stream configuration** (normalization, filters, partitions, sync mode)
4. **Run the sync**
5. **Monitor with `stats.json`**

### What we'll build

- Source: **Postgres**
- Destination: **Apache Iceberg** (Glue catalog)
- Table: `fivehundred`
- **CDC** mode + **Normalization**
- Filter: `dropoff_datetime >= "2010-01-01 00:00:00"`
- Partition by **year** from `dropoff_datetime`

### 1) Create Config Files

We'll put everything under `/path/to/config/`.

**Source — `source.json`**

```json title="source.json"
{
"host": "dz-stag.postgres.database.azure.com",
"port": 5432,
"database": "postgres",
"username": "postgres",
"password": "XXX",
"jdbc_url_params": {},
"ssl": { "mode": "require" },
"update_method": {
"replication_slot": "replication_slot",
"intital_wait_time": 120
},
"default_mode": "cdc",
"max_threads": 6
}
```

> 📝 If you plan to run CDC, ensure a Postgres **replication slot** exists.
> See: [Replication Slot Guide](/docs/connectors/postgres/setup/generic).

**Destination — `destination.json`**

```json title="destination.json"
{
"type": "ICEBERG",
"writer": {
"iceberg_s3_path": "s3://vz-testing-olake/olake_cli_demo",
"aws_region": "XXX",
"aws_access_key": "XXX",
"aws_secret_key": "XXX",
"iceberg_db": "olake_cli_demo",
"grpc_port": 50051,
"sink_rpc_server_host": "localhost"
}
}
```

### 2) Discover Streams

This pulls available tables and writes `streams.json`.

```bash
docker run --pull=always \
-v "/path/to/config:/mnt/config" \
olakego/source-postgres:latest \
discover \
--config /mnt/config/source.json
```

*Start logs*
![Discover start](/img/docs/getting-started/create-your-first-job/cli-discover-logs-start.jpeg)

*Completion*
![Discover end](/img/docs/getting-started/create-your-first-job/cli-discover-logs-end.jpeg)

> ℹ️ Logs are also written to:
> `/path/to/config/logs/sync_[YYYY-MM-DD]_[HH-MM-SS]/olake.log`

### 3) Edit `streams.json`

Select exactly what to move and how.

* **Select streams** → keep only `fivehundred` under `"selected_streams"`.
* **Normalization** → `"normalization": true`
* **Filter** → `"filter": "dropoff_datetime >= \"2010-01-01 00:00:00\""`
* **Partitioning** → `"partition_regex": "/{dropoff_datetime, year}"`
* **Sync mode** → set the stream's `"sync_mode"` to `"cdc"`

**Minimal selection block**

```json title="streams.json (selection)"
{
"selected_streams": {
"public": [
{
"partition_regex": "/{dropoff_datetime, year}",
"stream_name": "fivehundred",
"normalization": true,
"filter": "dropoff_datetime >= \"2010-01-01 00:00:00\""
}
]
}
}
```

**Full stream entry (showing supported modes)**

```json title="streams.json (stream detail)"
{
"streams": [
{
"stream": {
"name": "fivehundred",
"namespace": "public",
"type_schema": {
"properties": {
"dropoff_datetime": { "type": ["timestamp", "null"] }
}
},
"supported_sync_modes": [
"strict_cdc",
"full_refresh",
"incremental",
"cdc"
],
"source_defined_primary_key": [],
"available_cursor_fields": ["id", "pickup_datetime", "rate_code_id"],
"sync_mode": "cdc"
}
}
]
}
```

> 📚 Need a refresher on how modes differ?
> See: [Sync Modes](/docs/understanding/olake-terminologies/stream-properties#sync-modes).

### 4) Run the Sync

Kick off replication:

```bash
docker run --pull=always \
-v "/path/to/config:/mnt/config" \
olakego/source-postgres:latest \
sync \
--config /mnt/config/streams.json \
--catalog /mnt/config/catalog.json \
--destination /mnt/config/destination.json
```

*Sync start*
![Sync start](/img/docs/getting-started/create-your-first-job/cli-sync-logs-start.jpeg)

*Sync completed*
![Sync completed](/img/docs/getting-started/create-your-first-job/cli-sync-logs-end.jpeg)

### 5) Monitor Progress with `stats.json`

A `stats.json` appears next to your configs:

```json title="stats.json"
{
"Estimated Remaining Time": "0.00 s",
"Memory": "367 mb",
"Running Threads": 0,
"Seconds Elapsed": "34.01",
"Speed": "14.70 rps",
"Synced Records": 500
}
```

Confirm the data in your destination (S3 / Iceberg):

![Data in Iceberg](/img/docs/getting-started/create-your-first-job/cli-s3-data.png)

### 6) About the `state.json` (Resumable & CDC-friendly)

When a sync starts, OLake writes a `state.json` that tracks progress and CDC offsets (e.g., Postgres LSN).
This lets you **resume without duplicates** and continue CDC seamlessly.

To resume / keep streaming:

```bash
docker run --pull=always \
-v "/path/to/config:/mnt/config" \
olakego/source-postgres:latest \
sync \
--config /mnt/config/streams.json \
--catalog /mnt/config/catalog.json \
--destination /mnt/config/destination.json \
--state /mnt/config/state.json
```

More details: [State File (Postgres)](/docs/connectors/postgres/config#statejson-configuration)

---

## Quick Q&A

**UI or CLI—how should I choose?**
If you're new to OLake or prefer a guided setup, start with **UI**.
If you're automating, versioning configs, or scripting in CI, use **CLI**.

**Why "Full Refresh + CDC"?**
You get a baseline snapshot *and* continuous changes—ideal for keeping downstream analytics fresh.

**Can I change partitioning later?**

* **UI**: unselect the stream → save → re-add with updated partitioning/filter/normalization.
* **CLI**: edit `streams.json` and re-run.

---

Loading