feat: add node telemetry#4281
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the espresso-telemetry crate, which provides in-process telemetry for Espresso nodes by exporting logs via OTLP/HTTP and metrics via Prometheus remote-write, authenticated with BLS-signed JWTs. The review feedback identifies several opportunities to improve the robustness and idiomaticity of the implementation. Key suggestions include using Url::join for safer endpoint construction, avoiding blocking std::thread::sleep within async contexts, and replacing brittle string-based error classification in the retry logic. Additionally, the feedback highlights potential deadlocks in the shutdown sequence and the risk of panics if the system clock is adjusted before the Unix epoch.
| }); | ||
| match join { | ||
| Ok(j) => { | ||
| if j.join().is_err() { |
There was a problem hiding this comment.
The shutdown logic spawns a thread to call provider.shutdown() to avoid deadlocks in current-thread runtimes, but then immediately joins that thread. If provider.shutdown() actually deadlocks, this shutdown call will still block indefinitely. Consider using a timeout for the join or allowing the shutdown thread to detach if the goal is a best-effort graceful exit.
| let endpoint = opts | ||
| .endpoint | ||
| .as_ref() | ||
| .map(|u| u.as_str().to_owned()) | ||
| .unwrap_or_else(|| DEFAULT_OTLP_ENDPOINT.to_owned()); | ||
|
|
||
| // Both signals share this base URL. Reject non-http(s) once here so the | ||
| // metrics push can't silently no-op later; the logs path used to do this | ||
| // check, the metrics path used to skip it. | ||
| let parsed = | ||
| Url::parse(&endpoint).with_context(|| format!("invalid telemetry endpoint: {endpoint}"))?; | ||
| let scheme = parsed.scheme(); | ||
| if scheme != "http" && scheme != "https" { | ||
| anyhow::bail!( | ||
| "telemetry endpoint must use http or https scheme, got {scheme:?}: {endpoint}" | ||
| ); | ||
| } |
There was a problem hiding this comment.
There is redundant URL parsing and string manipulation here. opts.endpoint is already an Option<Url>, so you can work with the Url type directly instead of converting it to a String and parsing it again. Furthermore, using Url::join is more idiomatic and robust than manual string concatenation and trimming for constructing the OTLP and metrics endpoints.
| interval: Duration, | ||
| mut shutdown: oneshot::Receiver<()>, | ||
| ) { | ||
| let url = format!("{}/api/v1/write", endpoint.as_str().trim_end_matches('/')); |
| if attempt >= MAX_ATTEMPTS { | ||
| return result; | ||
| } | ||
| std::thread::sleep(backoff_for(attempt)); |
There was a problem hiding this comment.
Using std::thread::sleep inside an async fn is generally discouraged as it blocks the entire executor thread, which can lead to performance degradation or deadlocks if this exporter is ever used within a shared async runtime (like a Tokio worker pool). While the current OTel SDK BatchLogProcessor uses a dedicated thread, this implementation makes the component less reusable and more fragile. Consider using tokio::time::sleep if a runtime is available, or documenting this blocking behavior more prominently.
| match err { | ||
| OTelSdkError::Timeout(_) => true, | ||
| OTelSdkError::AlreadyShutdown => false, | ||
| OTelSdkError::InternalFailure(msg) => !msg.contains("Status Code: 4"), |
There was a problem hiding this comment.
The retry logic relies on string matching (msg.contains("Status Code: 4")) to identify non-retryable client errors. This is brittle because the error message format is an internal detail of the opentelemetry-otlp crate and is not guaranteed to be stable across versions. If the format changes, the exporter might incorrectly retry permanent failures or stop retrying transient ones.
| fn now_unix_secs() -> u64 { | ||
| SystemTime::now() | ||
| .duration_since(UNIX_EPOCH) | ||
| .expect("system clock before unix epoch") |
There was a problem hiding this comment.
Extract the operator-side telemetry library from crates/espresso/node
into a new shared lib crate so external consumers (the auth proxy +
mock-node in espresso-node-telemetry) can depend on it directly instead
of duplicating the protocol layer.
The new crate hosts:
- Vendored Prometheus remote-write protos (proto/{remote,types}.proto
pinned to prometheus/prometheus v2.55.1) + prost-build codegen.
Replaces the unmaintained prometheus_remote_write crate.
- build_write_request + encode_to_snappy: registry snapshot to
snappy-compressed remote-write protobuf, labels sorted per spec 1.0.
- BLS-BN254 JWT mint/verify (was the sibling crates/token in the
telemetry repo; now embedded here as a private module).
- TelemetryOptions, TelemetryHandle, init: clap-derived options,
lifecycle wrapper, OTel HTTP exporter setup. Stays subscriber-agnostic
by exposing a generic tracing::Layer<S>.
- push_task: dedicated tokio thread that scrapes prometheus::Registry,
encodes, and POSTs on the configured interval.
crates/espresso/node consumes the new crate; src/telemetry.rs and
src/telemetry/{remote_write,push_task}.rs are deleted. The lifecycle
tests move to tests/lifecycle.rs in the new crate.
scripts/update-prometheus-protos.sh fetches the upstream protos from a
configurable tag, strips the Go-only gogoproto annotations, and bumps
the pinned-tag references in the headers + proto/README.md. Idempotent.
- Ships logs via OTLP/HTTP (gzipped protobuf) and metrics via Prometheus remote-write (snappy protobuf) to an aggregator behind an auth proxy - Authenticates with a BLS-BN254 JWT minted from the staking key, sent as `Authorization: Bearer <jwt>` - Logs use the OTel SDK BatchLogProcessor on a dedicated std::thread; metrics push runs on its own single-threaded tokio runtime so a slow proxy can never starve consensus - Bounded retry on log export (4 attempts, 250ms/1s/4s backoff) covers typical proxy/Vector restarts; 4xx and AlreadyShutdown skipped - Default OTel log filter is `warn` to keep operator bandwidth modest - Validates endpoint scheme at startup; init failures return Err and let the node continue without telemetry - Sets `service.instance.id` from `node_name` so log records carry operator identity in the aggregator - Shutdown offloads `SdkLoggerProvider::shutdown` to a fresh OS thread to dodge the SDK's documented current-thread-runtime deadlock
e5cc46a to
1328806
Compare
- Bound logger-provider shutdown with a 10s timeout via mpsc::sync_channel so a deadlocked BatchLogProcessor::shutdown can't hang process exit; detach the thread on timeout - Correct attribution of the deadlock warning in the shutdown rustdoc: the BatchLogProcessor docs are the canonical source, link added - token::now_unix_secs returns anyhow::Result instead of panicking on pre-UNIX_EPOCH clocks, matching now_unix_millis in lib.rs; new TokenVerifyError::SystemClock variant for the verify path - Bump opentelemetry, opentelemetry-appender-tracing, opentelemetry-otlp, opentelemetry_sdk workspace deps from 0.31 to 0.32
Mirrors the OTel `service.name` / `service.instance.id` resource attributes on the logs side so the aggregator can partition metrics and logs under the same `<service>/<instance>/` layout. Labels are applied at push time and skip any name the metric already carries, so registry-provided const labels still win.
- replace `eprintln!` with `tracing::error!` for telemetry init failures (deferred until after the subscriber is installed) - drop speculative "node startup will fail next" branch; the keyset error surfaces through the main flow - point proto README to `scripts/update-prometheus-protos.sh` instead of the manual recipe; remove reference to a "previously-used" crate
Nextest failures (1) in this run
See the step summary for flaky tests and slowest tests. |
Our side of the telemetry code is in https://github.com/EspressoSystems/espresso-node-telemetry
Extract the operator-side telemetry library from crates/espresso/node into a new shared lib crate so external consumers (the auth proxy + mock-node in espresso-node-telemetry) can depend on it directly instead of duplicating the protocol layer.
The new crate hosts:
.crates/espresso/node consumes the new crate; src/telemetry.rs and src/telemetry/{remote_write,push_task}.rs are deleted. The lifecycle tests move to tests/lifecycle.rs in the new crate.
scripts/update-prometheus-protos.sh fetches the upstream protos from a configurable tag, strips the Go-only gogoproto annotations, and bumps the pinned-tag references in the headers + proto/README.md. Idempotent.