Skip to content

feat: add node telemetry#4281

Draft
sveitser wants to merge 6 commits into
mainfrom
ma/espresso-node-telemetry
Draft

feat: add node telemetry#4281
sveitser wants to merge 6 commits into
mainfrom
ma/espresso-node-telemetry

Conversation

@sveitser
Copy link
Copy Markdown
Collaborator

@sveitser sveitser commented May 10, 2026

Our side of the telemetry code is in https://github.com/EspressoSystems/espresso-node-telemetry

Extract the operator-side telemetry library from crates/espresso/node into a new shared lib crate so external consumers (the auth proxy + mock-node in espresso-node-telemetry) can depend on it directly instead of duplicating the protocol layer.

The new crate hosts:

  • Vendored Prometheus remote-write protos (proto/{remote,types}.proto pinned to prometheus/prometheus v2.55.1) + prost-build codegen. Replaces the unmaintained prometheus_remote_write crate.
  • build_write_request + encode_to_snappy: registry snapshot to snappy-compressed remote-write protobuf, labels sorted per spec 1.0.
  • BLS-BN254 JWT mint/verify (was the sibling crates/token in the telemetry repo; now embedded here as a private module).
  • TelemetryOptions, TelemetryHandle, init: clap-derived options, lifecycle wrapper, OTel HTTP exporter setup. Stays subscriber-agnostic by exposing a generic tracing::Layer.
  • push_task: dedicated tokio thread that scrapes prometheus::Registry, encodes, and POSTs on the configured interval.

crates/espresso/node consumes the new crate; src/telemetry.rs and src/telemetry/{remote_write,push_task}.rs are deleted. The lifecycle tests move to tests/lifecycle.rs in the new crate.

scripts/update-prometheus-protos.sh fetches the upstream protos from a configurable tag, strips the Go-only gogoproto annotations, and bumps the pinned-tag references in the headers + proto/README.md. Idempotent.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the espresso-telemetry crate, which provides in-process telemetry for Espresso nodes by exporting logs via OTLP/HTTP and metrics via Prometheus remote-write, authenticated with BLS-signed JWTs. The review feedback identifies several opportunities to improve the robustness and idiomaticity of the implementation. Key suggestions include using Url::join for safer endpoint construction, avoiding blocking std::thread::sleep within async contexts, and replacing brittle string-based error classification in the retry logic. Additionally, the feedback highlights potential deadlocks in the shutdown sequence and the risk of panics if the system clock is adjusted before the Unix epoch.

});
match join {
Ok(j) => {
if j.join().is_err() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The shutdown logic spawns a thread to call provider.shutdown() to avoid deadlocks in current-thread runtimes, but then immediately joins that thread. If provider.shutdown() actually deadlocks, this shutdown call will still block indefinitely. Consider using a timeout for the join or allowing the shutdown thread to detach if the goal is a best-effort graceful exit.

Comment on lines +282 to +298
let endpoint = opts
.endpoint
.as_ref()
.map(|u| u.as_str().to_owned())
.unwrap_or_else(|| DEFAULT_OTLP_ENDPOINT.to_owned());

// Both signals share this base URL. Reject non-http(s) once here so the
// metrics push can't silently no-op later; the logs path used to do this
// check, the metrics path used to skip it.
let parsed =
Url::parse(&endpoint).with_context(|| format!("invalid telemetry endpoint: {endpoint}"))?;
let scheme = parsed.scheme();
if scheme != "http" && scheme != "https" {
anyhow::bail!(
"telemetry endpoint must use http or https scheme, got {scheme:?}: {endpoint}"
);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is redundant URL parsing and string manipulation here. opts.endpoint is already an Option<Url>, so you can work with the Url type directly instead of converting it to a String and parsing it again. Furthermore, using Url::join is more idiomatic and robust than manual string concatenation and trimming for constructing the OTLP and metrics endpoints.

interval: Duration,
mut shutdown: oneshot::Receiver<()>,
) {
let url = format!("{}/api/v1/write", endpoint.as_str().trim_end_matches('/'));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since endpoint is already a Url, you can use endpoint.join("api/v1/write") instead of manual string formatting and trimming. This is more idiomatic and handles path segments correctly.

if attempt >= MAX_ATTEMPTS {
return result;
}
std::thread::sleep(backoff_for(attempt));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using std::thread::sleep inside an async fn is generally discouraged as it blocks the entire executor thread, which can lead to performance degradation or deadlocks if this exporter is ever used within a shared async runtime (like a Tokio worker pool). While the current OTel SDK BatchLogProcessor uses a dedicated thread, this implementation makes the component less reusable and more fragile. Consider using tokio::time::sleep if a runtime is available, or documenting this blocking behavior more prominently.

match err {
OTelSdkError::Timeout(_) => true,
OTelSdkError::AlreadyShutdown => false,
OTelSdkError::InternalFailure(msg) => !msg.contains("Status Code: 4"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The retry logic relies on string matching (msg.contains("Status Code: 4")) to identify non-retryable client errors. This is brittle because the error message format is an internal detail of the opentelemetry-otlp crate and is not guaranteed to be stable across versions. If the format changes, the exporter might incorrectly retry permanent failures or stop retrying transient ones.

Comment thread crates/espresso/telemetry/src/token.rs Outdated
fn now_unix_secs() -> u64 {
SystemTime::now()
.duration_since(UNIX_EPOCH)
.expect("system clock before unix epoch")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using .expect() on duration_since(UNIX_EPOCH) will cause a panic if the system clock is ever adjusted to a time before the Unix epoch. It is safer to handle this case gracefully (e.g., by returning an error or defaulting to 0), similar to how now_unix_millis is implemented in lib.rs.

sveitser added 2 commits May 13, 2026 17:20
Extract the operator-side telemetry library from crates/espresso/node
into a new shared lib crate so external consumers (the auth proxy +
mock-node in espresso-node-telemetry) can depend on it directly instead
of duplicating the protocol layer.

The new crate hosts:
- Vendored Prometheus remote-write protos (proto/{remote,types}.proto
  pinned to prometheus/prometheus v2.55.1) + prost-build codegen.
  Replaces the unmaintained prometheus_remote_write crate.
- build_write_request + encode_to_snappy: registry snapshot to
  snappy-compressed remote-write protobuf, labels sorted per spec 1.0.
- BLS-BN254 JWT mint/verify (was the sibling crates/token in the
  telemetry repo; now embedded here as a private module).
- TelemetryOptions, TelemetryHandle, init: clap-derived options,
  lifecycle wrapper, OTel HTTP exporter setup. Stays subscriber-agnostic
  by exposing a generic tracing::Layer<S>.
- push_task: dedicated tokio thread that scrapes prometheus::Registry,
  encodes, and POSTs on the configured interval.

crates/espresso/node consumes the new crate; src/telemetry.rs and
src/telemetry/{remote_write,push_task}.rs are deleted. The lifecycle
tests move to tests/lifecycle.rs in the new crate.

scripts/update-prometheus-protos.sh fetches the upstream protos from a
configurable tag, strips the Go-only gogoproto annotations, and bumps
the pinned-tag references in the headers + proto/README.md. Idempotent.
- Ships logs via OTLP/HTTP (gzipped protobuf) and metrics via Prometheus
  remote-write (snappy protobuf) to an aggregator behind an auth proxy
- Authenticates with a BLS-BN254 JWT minted from the staking key, sent
  as `Authorization: Bearer <jwt>`
- Logs use the OTel SDK BatchLogProcessor on a dedicated std::thread;
  metrics push runs on its own single-threaded tokio runtime so a slow
  proxy can never starve consensus
- Bounded retry on log export (4 attempts, 250ms/1s/4s backoff) covers
  typical proxy/Vector restarts; 4xx and AlreadyShutdown skipped
- Default OTel log filter is `warn` to keep operator bandwidth modest
- Validates endpoint scheme at startup; init failures return Err and
  let the node continue without telemetry
- Sets `service.instance.id` from `node_name` so log records carry
  operator identity in the aggregator
- Shutdown offloads `SdkLoggerProvider::shutdown` to a fresh OS thread
  to dodge the SDK's documented current-thread-runtime deadlock
@sveitser sveitser force-pushed the ma/espresso-node-telemetry branch from e5cc46a to 1328806 Compare May 13, 2026 15:22
@sveitser sveitser changed the title feat: add crates/espresso/telemetry; drop prometheus_remote_write feat: add node telemetry May 13, 2026
sveitser added 4 commits May 19, 2026 09:59
- Bound logger-provider shutdown with a 10s timeout via mpsc::sync_channel
  so a deadlocked BatchLogProcessor::shutdown can't hang process exit;
  detach the thread on timeout
- Correct attribution of the deadlock warning in the shutdown rustdoc:
  the BatchLogProcessor docs are the canonical source, link added
- token::now_unix_secs returns anyhow::Result instead of panicking on
  pre-UNIX_EPOCH clocks, matching now_unix_millis in lib.rs; new
  TokenVerifyError::SystemClock variant for the verify path
- Bump opentelemetry, opentelemetry-appender-tracing, opentelemetry-otlp,
  opentelemetry_sdk workspace deps from 0.31 to 0.32
  Mirrors the OTel `service.name` / `service.instance.id` resource
  attributes on the logs side so the aggregator can partition metrics
  and logs under the same `<service>/<instance>/` layout.

  Labels are applied at push time and skip any name the metric already
  carries, so registry-provided const labels still win.
- replace `eprintln!` with `tracing::error!` for telemetry init failures
  (deferred until after the subscriber is installed)
- drop speculative "node startup will fail next" branch; the keyset error
  surfaces through the main flow
- point proto README to `scripts/update-prometheus-protos.sh` instead of
  the manual recipe; remove reference to a "previously-used" crate
@github-actions
Copy link
Copy Markdown
Contributor

Nextest failures (1) in this run

Test Attempts Time (s) Main history
hotshot-testing::test_vid2_upgrade::test_vid2_upgrade::testtypes_::memoryimpl_::test_vid2_upgrade 1 20.62 passing

See the step summary for flaky tests and slowest tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant