diff --git a/src/current/molt/molt-replicator.md b/src/current/molt/molt-replicator.md index bc520b128c9..319ed1c2bcb 100644 --- a/src/current/molt/molt-replicator.md +++ b/src/current/molt/molt-replicator.md @@ -678,6 +678,14 @@ MOLT Replicator metrics are not enabled by default. Enable Replicator metrics by --metricsAddr :30005 ~~~ +Metrics can additionally be written to snapshot files at repeated intervals. Metrics snapshotting is disabled by default. If metrics have been enabled, metrics snapshotting can also be enabled with the [`--metricsSnapshotPeriod`]({% link molt/replicator-flags.md %}#metrics-snapshot-period) flag. For example, the following flag enables metrics snapshotting every 15 seconds: + +~~~ +--metricsSnapshotPeriod 15s +~~~ + +Metrics snapshots enable access to metrics when the Prometheus server is unavailable, and they can be sent to [CockroachDB support]({% link {{ site.current_cloud_version }}/support-resources.md %}) to help quickly resolve an issue. + For guidelines on using and interpreting replication metrics, refer to [Replicator Metrics]({% link molt/replicator-metrics.md %}). ### Logging diff --git a/src/current/molt/replicator-flags.md b/src/current/molt/replicator-flags.md index ec240ee4008..8ae5b08aec1 100644 --- a/src/current/molt/replicator-flags.md +++ b/src/current/molt/replicator-flags.md @@ -19,6 +19,7 @@ This page lists all available flags for the [MOLT Replicator commands]({% link m | `--claim` | `make-jwt` | `BOOL` | If `true`, print a minimal JWT claim instead of signing. | | `--collapseMutations` | `start`, `pglogical`, `mylogical` | `BOOL` | Combine multiple mutations on the same primary key within each batch into a single mutation.

**Default:** `true` | | `--defaultGTIDSet` | `mylogical` | `STRING` | **Required** the first time `replicator` is run. The default GTID set, in the format `source_uuid:min(interval_start)-max(interval_end)`, which provides a replication marker for streaming changes. | +| `--dataDir` | `start`, `pglogical`, `mylogical`, `oraclelogminer` | `STRING` | Base directory for replicator data (for example, metrics snapshots).

**Default:** `"replicator-data"` | | `--disableAuthentication` | `start` | `BOOL` | Disable authentication of incoming Replicator requests; not recommended for production. | | `--discard` | `start` | `BOOL` | **Dangerous:** Discard all incoming HTTP requests; useful for changefeed throughput testing. Not intended for production. | | `--discardDelay` | `start` | `DURATION` | Adds additional delay in discard mode; useful for gauging the impact of changefeed round-trip time (RTT). | @@ -38,6 +39,10 @@ This page lists all available flags for the [MOLT Replicator commands]({% link m | `--logFormat` | `start`, `pglogical`, `mylogical`, `oraclelogminer` | `STRING` | Choose log output format: `"fluent"`, `"text"`.

**Default:** `"text"` | | `--maxRetries` | `start`, `pglogical`, `mylogical`, `oraclelogminer` | `INT` | Maximum number of times to retry a failed mutation on the target (for example, due to contention or a temporary unique constraint violation) before treating it as a hard failure.

**Default:** `10` | | `--metricsAddr` | `start`, `pglogical`, `mylogical`, `oraclelogminer` | `STRING` | A `:port` or `host:port` on which to serve metrics and diagnostics. The metrics endpoint is `http://{host}:{port}/_/varz`. | +| `--metricsSnapshotCompression` | `start`, `pglogical`, `mylogical`, `oraclelogminer` | `STRING` | Compression for snapshot files: `"gzip"` or `"none"`.

**Default:** `"gzip"` | +| `--metricsSnapshotPeriod` | `start`, `pglogical`, `mylogical`, `oraclelogminer` | `DURATION` | How often to periodically store a metrics snaphot to files (for example, `15s`, `1m`). Set to `0` to disable.

**Default:** `0` | +| `--metricsSnapshotRetentionSize` | `start`, `pglogical`, `mylogical`, `oraclelogminer` | `STRING` | Delete oldest snapshots when the total size of the `metrics-snapshots` directory in the [`--dataDir`](#data-dir) exceeds this (for example, `100MB`, `1GiB`). Either this flag or `--metricsSnapshotRetentionTime` (or both) must be enabled.

**Default:** `""` | +| `--metricsSnapshotRetentionTime` | `start`, `pglogical`, `mylogical`, `oraclelogminer` | `DURATION` | Delete snapshots older than this duration (for example, `24h`, `168h`). `0` to disable. Either this flag or `--metricsSnapshotRetentionSize` (or both) must be enabled.

**Default:** `168h` | | `--ndjsonBufferSize` | `start` | `INT` | The maximum amount of data to buffer while reading a single line of `ndjson` input; increase when source cluster has large blob values.

**Default:** `65536` | | `--oracle-application-users` | `oraclelogminer` | `STRING` | List of Oracle usernames responsible for DML transactions in the PDB schema. Enables replication from the latest-possible starting point. Usernames are case-sensitive and must match the internal Oracle usernames (e.g., `PDB_USER`). | | `-o`, `--out` | `make-jwt` | `STRING` | A file to write the token to. | diff --git a/src/current/molt/replicator-metrics.md b/src/current/molt/replicator-metrics.md index 8111f8ce683..de0343902c7 100644 --- a/src/current/molt/replicator-metrics.md +++ b/src/current/molt/replicator-metrics.md @@ -307,6 +307,206 @@ For checkpoint terminology, refer to the [MOLT Replicator documentation]({% link [Read more about userscript metrics]({% link molt/userscript-metrics.md %}). +## Metrics snapshots + +When enabled, the metrics snapshotter periodically writes out a point-in-time snapshot of Replicator's Prometheus metrics to a file in the [Replicator data directory]({% link molt/replicator-flags.md %}#data-dir). Metrics snapshots can help with debugging when direct access to the Prometheus server is not available, and you can [bundle snapshots and send them to CockroachDB support](#bundle-and-send-metrics-snapshots) to help resolve an issue. A metrics snapshot includes all of the metrics on this page. + +Metrics snapshotting is disabled by default, and can be enabled with the [`--metricsSnapshotPeriod`]({% link molt/replicator-flags.md %}#metrics-snapshot-period) Replicator flag. [Replicator metrics must be enabled](#set-up-metrics) (with the [`--metricsAddr`]({% link molt/replicator-flags.md %}#metrics-addr) flag) in order for metrics snapshotting to work. + +If snapshotting is enabled, the snapshot period must be at least 15 seconds. The recommended range for the snapshot period is 15-60 seconds. The retention policy for metrics snapshot files can be determined by [time]({% link molt/replicator-flags.md %}#metrics-snapshot-retention-time) and by the [total size]({% link molt/replicator-flags.md %}#metrics-snapshot-retention-size) of the snapshot data subdirectory. At least one retention policy must be configured. Snapshots can also be [compressed to a gzip file]({% link molt/replicator-flags.md %}#metrics-snapshot-compression). + +Changing the snapshotter's configuration requires restarting the Replicator binary with different flags. + +### Enable metrics snapshotting + +#### Step 1. Run Replicator with the snapshot flags + +The following is an example of a `replicator` command where snapshotting is configured: + +
+ + + + +
+ +
+{% include_cached copy-clipboard.html %} +~~~shell +replicator pglogical \ +--targetConn postgres://postgres:postgres@localhost:5432/molt?sslmode=disable \ +--stagingConn postgres://root@localhost:26257/_replicator?sslmode=disable \ +--slotName molt_slot \ +--bindAddr 0.0.0.0:30004 \ +--stagingSchema _replicator \ +--stagingCreateSchema \ +--disableAuthentication \ +--tlsSelfSigned \ +--stageMode crdb \ +--bestEffortWindow 1s \ +--flushSize 1000 \ +--metricsAddr :30005 \ +--metricsSnapshotPeriod 15s \ +--metricsSnapshotCompression gzip \ +--metricsSnapshotRetentionTime 168h \ +-v +~~~ +
+ +
+{% include_cached copy-clipboard.html %} +~~~shell +replicator mylogical \ +--targetConn postgres://postgres:postgres@localhost:5432/molt?sslmode=disable \ +--stagingConn postgres://root@localhost:26257/_replicator?sslmode=disable \ +--defaultGTIDSet '4c658ae6-e8ad-11ef-8449-0242ac140006:1-29' \ +--bindAddr 0.0.0.0:30004 \ +--stagingSchema _replicator \ +--stagingCreateSchema \ +--disableAuthentication \ +--tlsSelfSigned \ +--stageMode crdb \ +--bestEffortWindow 1s \ +--flushSize 1000 \ +--metricsAddr :30005 \ +--metricsSnapshotPeriod 15s \ +--metricsSnapshotCompression gzip \ +--metricsSnapshotRetentionTime 168h \ +-v +~~~ +
+ +
+{% include_cached copy-clipboard.html %} +~~~shell +replicator oraclelogminer \ +--targetConn postgres://postgres:postgres@localhost:5432/molt?sslmode=disable \ +--stagingConn postgres://root@localhost:26257/_replicator?sslmode=disable \ +--scn 26685786 \ +--backfillFromSCN 26685444 \ +--bindAddr 0.0.0.0:30004 \ +--stagingSchema _replicator \ +--stagingCreateSchema \ +--disableAuthentication \ +--tlsSelfSigned \ +--stageMode crdb \ +--bestEffortWindow 1s \ +--flushSize 1000 \ +--metricsAddr :30005 \ +--metricsSnapshotPeriod 15s \ +--metricsSnapshotCompression gzip \ +--metricsSnapshotRetentionTime 168h \ +-v +~~~ +
+ +
+{% include_cached copy-clipboard.html %} +~~~shell +replicator start \ +--targetConn postgres://postgres:postgres@localhost:5432/molt?sslmode=disable \ +--stagingConn postgres://root@localhost:26257/_replicator?sslmode=disable \ +--bindAddr 0.0.0.0:30004 \ +--stagingSchema _replicator \ +--stagingCreateSchema \ +--disableAuthentication \ +--tlsSelfSigned \ +--stageMode crdb \ +--bestEffortWindow 1s \ +--flushSize 1000 \ +--metricsAddr :30005 \ +--metricsSnapshotPeriod 15s \ +--metricsSnapshotCompression gzip \ +--metricsSnapshotRetentionTime 168h \ +-v +~~~ +
+ +If successful, Replicator will start, and the console output will indicate that the snapshotter has started as well: + +~~~ +INFO [Feb 2 10:20:32] Replicator starting +... +INFO [Feb 2 10:20:32] metrics snapshotter started, writing to replicator-data/metrics-snapshots every 15s, retaining 168h0m0s +~~~ + +Upon interruption of Replicator, the snapshotter will be stopped: + +~~~ +INFO [Feb 2 10:26:45] Interrupted +INFO [Feb 2 10:26:45] metrics snapshotter stopped +INFO [Feb 2 10:26:45] Server shutdown complete +~~~ + +#### Step 2. Find the snapshot files in the data directory + +You can find the snapshot files in the [Replicator data directory]({% link molt/replicator-flags.md %}#data-dir): + +{% include_cached copy-clipboard.html %} +~~~shell +cd replicator-data/metrics-snapshots && ls . | tail -n 5 +~~~ + +~~~ +snapshot-20260202T152405.737Z.txt.gz +snapshot-20260202T152420.736Z.txt.gz +snapshot-20260202T152435.736Z.txt.gz +snapshot-20260202T152450.735Z.txt.gz +snapshot-20260202T152505.735Z.txt.gz +~~~ + +The uncompressed files list the metrics collected at that snapshot: + +{% include_cached copy-clipboard.html %} +~~~shell +gzcat snapshot-20260202T152505.735Z.txt.gz | head -n 3 +~~~ + +~~~ +# HELP cdc_resolved_timestamp_buffer_size Current size of the resolved timestamp buffer channel which is yet to be processed by Pebble Stager +# TYPE cdc_resolved_timestamp_buffer_size gauge +cdc_resolved_timestamp_buffer_size 0.0 1.770045905735e+09 +~~~ + +### Bundle and send metrics snapshots + +The following requires a Linux system that supports bash. + +#### Step 1. Download the export script + +Download the [metrics snapshot export script](https://replicator.cockroachdb.com/export-metrics-snapshots.sh). Ensure it's accessible and can be run by the current user. + +#### Step 2. Run a snapshot export + +Run an export, indicating the `metrics-snapshots` directory within your [Replicator data directory]({% link molt/replicator-flags.md %}#data-dir). You can also provide start and end timestamps to define a subset of metrics to bundle. Times are specified as UTC and should be of the format `YYYYMMDDTHHMMSS`. + +Running the script without timestamps bundles all of the data in the snapshot directory. For example: + +{% include_cached copy-clipboard.html %} +~~~shell +./export-metrics-snapshots.sh ./replicator-data/metrics-snapshots +~~~ + +Running the script with one timestamp bundles all of the data in the snapshot directory beginning at that timestamp. For example: + +{% include_cached copy-clipboard.html %} +~~~shell +./export-metrics-snapshots.sh ./replicator-data/metrics-snapshots 20260115T120000 +~~~ + +Running the script with two timestamps bundles all of the data in the snapshot directory within the two timestamps. For example: + +{% include_cached copy-clipboard.html %} +~~~shell +./export-metrics-snapshots.sh ./replicator-data/metrics-snapshots 20260115T120000 20260115T140000 +~~~ + +The resulting output is a `.tar.gz` file placed in the directory from which you ran the script (or to a path specified as an optional argument). + +#### Step 3. Upload output file to a support ticket + +Include this bundled metrics snapshot file on a [support ticket]({% link {{ site.current_cloud_version }}/support-resources.md %}) to give support metrics information that's relevant to your issue. + ## See also - [MOLT Replicator]({% link molt/molt-replicator.md %}) diff --git a/src/current/molt/userscript-metrics.md b/src/current/molt/userscript-metrics.md index dad4a1eebf4..6fa11c4d52c 100644 --- a/src/current/molt/userscript-metrics.md +++ b/src/current/molt/userscript-metrics.md @@ -9,7 +9,7 @@ To improve observability and debugging in the field, [MOLT Replicator]({% link m All userscript metrics include a `script_` prefix and are automatically labeled with the relevant schema or table for each configured handler (for example, `schema="target.public"`). If a userscript defines both schema-level and table-level handlers, separate label values will be created for each. -These metrics are part of the default [Replicator Prometheus metrics]({% link molt/replicator-metrics.md %}) set and can be visualized immediately using the provided [`replicator.json` Grafana dashboard file](https://replicator.cockroachdb.com/replicator_grafana_dashboard.json). +These metrics are part of the default [Replicator Prometheus metrics]({% link molt/replicator-metrics.md %}) set and can be visualized immediately using the provided [`replicator.json` Grafana dashboard file](https://replicator.cockroachdb.com/replicator_grafana_dashboard.json). They are also included in [Replicator metrics snapshots]({% link molt/replicator-metrics.md %}#metrics-snapshots). Consider using these metrics to: