Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions src/current/molt/molt-replicator.md
Original file line number Diff line number Diff line change
Expand Up @@ -678,6 +678,14 @@ MOLT Replicator metrics are not enabled by default. Enable Replicator metrics by
--metricsAddr :30005
~~~

Metrics can additionally be written to snapshot files at repeated intervals. Metrics snapshotting is disabled by default. If metrics have been enabled, metrics snapshotting can also be enabled with the [`--metricsSnapshotPeriod`]({% link molt/replicator-flags.md %}#metrics-snapshot-period) flag. For example, the following flag enables metrics snapshotting every 15 seconds:

~~~
--metricsSnapshotPeriod 15s
~~~

Metrics snapshots enable access to metrics when the Prometheus server is unavailable, and they can be sent to [CockroachDB support]({% link {{ site.current_cloud_version }}/support-resources.md %}) to help quickly resolve an issue.

For guidelines on using and interpreting replication metrics, refer to [Replicator Metrics]({% link molt/replicator-metrics.md %}).

### Logging
Expand Down
5 changes: 5 additions & 0 deletions src/current/molt/replicator-flags.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ This page lists all available flags for the [MOLT Replicator commands]({% link m
| <a id="claim"></a> `--claim` | `make-jwt` | `BOOL` | If `true`, print a minimal JWT claim instead of signing. |
| <a id="collapse-mutations"></a> `--collapseMutations` | `start`, `pglogical`, `mylogical` | `BOOL` | Combine multiple mutations on the same primary key within each batch into a single mutation.<br><br>**Default:** `true` |
| <a id="default-gtid-set"></a> `--defaultGTIDSet` | `mylogical` | `STRING` | **Required** the first time `replicator` is run. The default GTID set, in the format `source_uuid:min(interval_start)-max(interval_end)`, which provides a replication marker for streaming changes. |
| <a id="data-dir"></a> `--dataDir` | `start`, `pglogical`, `mylogical`, `oraclelogminer` | `STRING` | Base directory for replicator data (for example, metrics snapshots).<br><br>**Default:** `"replicator-data"` |
| <a id="disable-authentication"></a> `--disableAuthentication` | `start` | `BOOL` | Disable authentication of incoming Replicator requests; not recommended for production. |
| <a id="discard"></a> `--discard` | `start` | `BOOL` | **Dangerous:** Discard all incoming HTTP requests; useful for changefeed throughput testing. Not intended for production. |
| <a id="discard-delay"></a> `--discardDelay` | `start` | `DURATION` | Adds additional delay in discard mode; useful for gauging the impact of changefeed round-trip time (RTT). |
Expand All @@ -38,6 +39,10 @@ This page lists all available flags for the [MOLT Replicator commands]({% link m
| <a id="log-format"></a> `--logFormat` | `start`, `pglogical`, `mylogical`, `oraclelogminer` | `STRING` | Choose log output format: `"fluent"`, `"text"`.<br><br>**Default:** `"text"` |
| <a id="max-retries"></a> `--maxRetries` | `start`, `pglogical`, `mylogical`, `oraclelogminer` | `INT` | Maximum number of times to retry a failed mutation on the target (for example, due to contention or a temporary unique constraint violation) before treating it as a hard failure.<br><br>**Default:** `10` |
| <a id="metrics-addr"></a> `--metricsAddr` | `start`, `pglogical`, `mylogical`, `oraclelogminer` | `STRING` | A `:port` or `host:port` on which to serve metrics and diagnostics. The metrics endpoint is `http://{host}:{port}/_/varz`. |
| <a id="metrics-snapshot-compression"></a> `--metricsSnapshotCompression` | `start`, `pglogical`, `mylogical`, `oraclelogminer` | `STRING` | Compression for snapshot files: `"gzip"` or `"none"`.<br><br>**Default:** `"gzip"` |
| <a id="metrics-snapshot-period"></a> `--metricsSnapshotPeriod` | `start`, `pglogical`, `mylogical`, `oraclelogminer` | `DURATION` | How often to periodically store a metrics snaphot to files (for example, `15s`, `1m`). Set to `0` to disable.<br><br>**Default:** `0` |
| <a id="metrics-snapshot-retention-size"></a> `--metricsSnapshotRetentionSize` | `start`, `pglogical`, `mylogical`, `oraclelogminer` | `STRING` | Delete oldest snapshots when the total size of the `metrics-snapshots` directory in the [`--dataDir`](#data-dir) exceeds this (for example, `100MB`, `1GiB`). Either this flag or `--metricsSnapshotRetentionTime` (or both) must be enabled. <br><br>**Default:** `""` |
| <a id="metrics-snapshot-retention-time"></a> `--metricsSnapshotRetentionTime` | `start`, `pglogical`, `mylogical`, `oraclelogminer` | `DURATION` | Delete snapshots older than this duration (for example, `24h`, `168h`). `0` to disable. Either this flag or `--metricsSnapshotRetentionSize` (or both) must be enabled. <br><br>**Default:** `168h` |
| <a id="ndjson-buffer-size"></a> `--ndjsonBufferSize` | `start` | `INT` | The maximum amount of data to buffer while reading a single line of `ndjson` input; increase when source cluster has large blob values.<br><br>**Default:** `65536` |
| <a id="oracle-application-users"></a> `--oracle-application-users` | `oraclelogminer` | `STRING` | List of Oracle usernames responsible for DML transactions in the PDB schema. Enables replication from the latest-possible starting point. Usernames are case-sensitive and must match the internal Oracle usernames (e.g., `PDB_USER`). |
| <a id="out"></a> `-o`, `--out` | `make-jwt` | `STRING` | A file to write the token to. |
Expand Down
200 changes: 200 additions & 0 deletions src/current/molt/replicator-metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -307,6 +307,206 @@ For checkpoint terminology, refer to the [MOLT Replicator documentation]({% link

[Read more about userscript metrics]({% link molt/userscript-metrics.md %}).

## Metrics snapshots

When enabled, the metrics snapshotter periodically writes out a point-in-time snapshot of Replicator's Prometheus metrics to a file in the [Replicator data directory]({% link molt/replicator-flags.md %}#data-dir). Metrics snapshots can help with debugging when direct access to the Prometheus server is not available, and you can [bundle snapshots and send them to CockroachDB support](#bundle-and-send-metrics-snapshots) to help resolve an issue. A metrics snapshot includes all of the metrics on this page.

Metrics snapshotting is disabled by default, and can be enabled with the [`--metricsSnapshotPeriod`]({% link molt/replicator-flags.md %}#metrics-snapshot-period) Replicator flag. [Replicator metrics must be enabled](#set-up-metrics) (with the [`--metricsAddr`]({% link molt/replicator-flags.md %}#metrics-addr) flag) in order for metrics snapshotting to work.

If snapshotting is enabled, the snapshot period must be at least 15 seconds. The recommended range for the snapshot period is 15-60 seconds. The retention policy for metrics snapshot files can be determined by [time]({% link molt/replicator-flags.md %}#metrics-snapshot-retention-time) and by the [total size]({% link molt/replicator-flags.md %}#metrics-snapshot-retention-size) of the snapshot data subdirectory. At least one retention policy must be configured. Snapshots can also be [compressed to a gzip file]({% link molt/replicator-flags.md %}#metrics-snapshot-compression).

Changing the snapshotter's configuration requires restarting the Replicator binary with different flags.

### Enable metrics snapshotting

#### Step 1. Run Replicator with the snapshot flags

The following is an example of a `replicator` command where snapshotting is configured:

<div class="filters filters-big clearfix">
<button class="filter-button" data-scope="postgres">PostgreSQL</button>
<button class="filter-button" data-scope="mysql">MySQL</button>
<button class="filter-button" data-scope="oracle">Oracle</button>
<button class="filter-button" data-scope="cockroachdb">CockroachDB</button>
</div>

<section class="filter-content" markdown="1" data-scope="postgres">
{% include_cached copy-clipboard.html %}
~~~shell
replicator pglogical \
--targetConn postgres://postgres:postgres@localhost:5432/molt?sslmode=disable \
--stagingConn postgres://root@localhost:26257/_replicator?sslmode=disable \
--slotName molt_slot \
--bindAddr 0.0.0.0:30004 \
--stagingSchema _replicator \
--stagingCreateSchema \
--disableAuthentication \
--tlsSelfSigned \
--stageMode crdb \
--bestEffortWindow 1s \
--flushSize 1000 \
--metricsAddr :30005 \
--metricsSnapshotPeriod 15s \
--metricsSnapshotCompression gzip \
--metricsSnapshotRetentionTime 168h \
-v
~~~
</section>

<section class="filter-content" markdown="1" data-scope="mysql">
{% include_cached copy-clipboard.html %}
~~~shell
replicator mylogical \
--targetConn postgres://postgres:postgres@localhost:5432/molt?sslmode=disable \
--stagingConn postgres://root@localhost:26257/_replicator?sslmode=disable \
--defaultGTIDSet '4c658ae6-e8ad-11ef-8449-0242ac140006:1-29' \
--bindAddr 0.0.0.0:30004 \
--stagingSchema _replicator \
--stagingCreateSchema \
--disableAuthentication \
--tlsSelfSigned \
--stageMode crdb \
--bestEffortWindow 1s \
--flushSize 1000 \
--metricsAddr :30005 \
--metricsSnapshotPeriod 15s \
--metricsSnapshotCompression gzip \
--metricsSnapshotRetentionTime 168h \
-v
~~~
</section>

<section class="filter-content" markdown="1" data-scope="oracle">
{% include_cached copy-clipboard.html %}
~~~shell
replicator oraclelogminer \
--targetConn postgres://postgres:postgres@localhost:5432/molt?sslmode=disable \
--stagingConn postgres://root@localhost:26257/_replicator?sslmode=disable \
--scn 26685786 \
--backfillFromSCN 26685444 \
--bindAddr 0.0.0.0:30004 \
--stagingSchema _replicator \
--stagingCreateSchema \
--disableAuthentication \
--tlsSelfSigned \
--stageMode crdb \
--bestEffortWindow 1s \
--flushSize 1000 \
--metricsAddr :30005 \
--metricsSnapshotPeriod 15s \
--metricsSnapshotCompression gzip \
--metricsSnapshotRetentionTime 168h \
-v
~~~
</section>

<section class="filter-content" markdown="1" data-scope="cockroachdb">
{% include_cached copy-clipboard.html %}
~~~shell
replicator start \
--targetConn postgres://postgres:postgres@localhost:5432/molt?sslmode=disable \
--stagingConn postgres://root@localhost:26257/_replicator?sslmode=disable \
--bindAddr 0.0.0.0:30004 \
--stagingSchema _replicator \
--stagingCreateSchema \
--disableAuthentication \
--tlsSelfSigned \
--stageMode crdb \
--bestEffortWindow 1s \
--flushSize 1000 \
--metricsAddr :30005 \
--metricsSnapshotPeriod 15s \
--metricsSnapshotCompression gzip \
--metricsSnapshotRetentionTime 168h \
-v
~~~
</section>

If successful, Replicator will start, and the console output will indicate that the snapshotter has started as well:

~~~
INFO [Feb 2 10:20:32] Replicator starting
...
INFO [Feb 2 10:20:32] metrics snapshotter started, writing to replicator-data/metrics-snapshots every 15s, retaining 168h0m0s
~~~

Upon interruption of Replicator, the snapshotter will be stopped:

~~~
INFO [Feb 2 10:26:45] Interrupted
INFO [Feb 2 10:26:45] metrics snapshotter stopped
INFO [Feb 2 10:26:45] Server shutdown complete
~~~

#### Step 2. Find the snapshot files in the data directory

You can find the snapshot files in the [Replicator data directory]({% link molt/replicator-flags.md %}#data-dir):

{% include_cached copy-clipboard.html %}
~~~shell
cd replicator-data/metrics-snapshots && ls . | tail -n 5
~~~

~~~
snapshot-20260202T152405.737Z.txt.gz
snapshot-20260202T152420.736Z.txt.gz
snapshot-20260202T152435.736Z.txt.gz
snapshot-20260202T152450.735Z.txt.gz
snapshot-20260202T152505.735Z.txt.gz
~~~

The uncompressed files list the metrics collected at that snapshot:

{% include_cached copy-clipboard.html %}
~~~shell
gzcat snapshot-20260202T152505.735Z.txt.gz | head -n 3
~~~

~~~
# HELP cdc_resolved_timestamp_buffer_size Current size of the resolved timestamp buffer channel which is yet to be processed by Pebble Stager
# TYPE cdc_resolved_timestamp_buffer_size gauge
cdc_resolved_timestamp_buffer_size 0.0 1.770045905735e+09
~~~

### Bundle and send metrics snapshots

The following requires a Linux system that supports bash.

#### Step 1. Download the export script

Download the [metrics snapshot export script](https://replicator.cockroachdb.com/export-metrics-snapshots.sh). Ensure it's accessible and can be run by the current user.

#### Step 2. Run a snapshot export

Run an export, indicating the `metrics-snapshots` directory within your [Replicator data directory]({% link molt/replicator-flags.md %}#data-dir). You can also provide start and end timestamps to define a subset of metrics to bundle. Times are specified as UTC and should be of the format `YYYYMMDDTHHMMSS`.

Running the script without timestamps bundles all of the data in the snapshot directory. For example:

{% include_cached copy-clipboard.html %}
~~~shell
./export-metrics-snapshots.sh ./replicator-data/metrics-snapshots
~~~

Running the script with one timestamp bundles all of the data in the snapshot directory beginning at that timestamp. For example:

{% include_cached copy-clipboard.html %}
~~~shell
./export-metrics-snapshots.sh ./replicator-data/metrics-snapshots 20260115T120000
~~~

Running the script with two timestamps bundles all of the data in the snapshot directory within the two timestamps. For example:

{% include_cached copy-clipboard.html %}
~~~shell
./export-metrics-snapshots.sh ./replicator-data/metrics-snapshots 20260115T120000 20260115T140000
~~~

The resulting output is a `.tar.gz` file placed in the directory from which you ran the script (or to a path specified as an optional argument).

#### Step 3. Upload output file to a support ticket

Include this bundled metrics snapshot file on a [support ticket]({% link {{ site.current_cloud_version }}/support-resources.md %}) to give support metrics information that's relevant to your issue.

## See also

- [MOLT Replicator]({% link molt/molt-replicator.md %})
Expand Down
2 changes: 1 addition & 1 deletion src/current/molt/userscript-metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ To improve observability and debugging in the field, [MOLT Replicator]({% link m

All userscript metrics include a `script_` prefix and are automatically labeled with the relevant schema or table for each configured handler (for example, `schema="target.public"`). If a userscript defines both schema-level and table-level handlers, separate label values will be created for each.

These metrics are part of the default [Replicator Prometheus metrics]({% link molt/replicator-metrics.md %}) set and can be visualized immediately using the provided [`replicator.json` Grafana dashboard file](https://replicator.cockroachdb.com/replicator_grafana_dashboard.json).
These metrics are part of the default [Replicator Prometheus metrics]({% link molt/replicator-metrics.md %}) set and can be visualized immediately using the provided [`replicator.json` Grafana dashboard file](https://replicator.cockroachdb.com/replicator_grafana_dashboard.json). They are also included in [Replicator metrics snapshots]({% link molt/replicator-metrics.md %}#metrics-snapshots).

Consider using these metrics to:

Expand Down
Loading