Add OpenTelemetry local observability#72
Open
tony wants to merge 92 commits into
Open
Conversation
why: The MCP middleware wrapped each request and tool in a pass-through span and used bespoke attributes, missing the standard SpanKind.SERVER, mcp.method.name, and gen_ai.tool.name conventions and never linking an inbound traceparent. what: - Rename the request/tool roots to mcp.server.request / mcp.server.tool and delete the mcp.operation / mcp.call_next pass-through spans. - Nest FastMCP's server_span inside each root for the native MCP/GenAI conventions; the agentgrep root keeps the redacted attributes and audit log, so raw arguments never reach the SERVER span. - Extract an inbound traceparent from the request metadata and inherit it so the caller's trace links to agentgrep's. - Update the taxonomy, acceptance allow-list, and MCP tests; pin fastmcp<4 (it does not auto-emit server spans); document the shape.
why: Logs now carry agentgrep_* and the trace/span ids as OTLP structured metadata instead of a JSON body, so the acceptance log check's "| json" parser stage no longer matches the exported records. what: - Drop the "| json" stage; filter run-scoped logs on structured metadata and read trace/span ids from the per-entry metadata map. - Replace the JSON-parser-error and unstructured-body rejections with a missing-trace-link and a missing-VCS-label rejection; update the tests and the acceptance-evidence doc. note: the broader trace and metric acceptance still use the existing walkers, which pass their unit tests; folding them into declarative TraceQL/PromQL is a measured follow-up that needs live LGTM confirmation rather than a blind rewrite behind the offline gate.
why: Under AGENTGREP_OTEL the conftest wraps every test in an agentgrep.pytest.test root, leaving _CURRENT_SPAN set for the test body. Tests that swap to the in-memory backend and assert their own span is a root (parent_id is None) or that sql_span is a no-op without an active span then saw the ambient pytest.test span as parent and failed, so the suite was not green under the develop-with-OTel-on workflow. what: - configure_backend (test-only; no production callers) now also clears _CURRENT_SPAN, so each test starts from a clean root exactly as the uninstrumented suite does. - Add a parametrized test for the clear; the whole suite now passes under AGENTGREP_OTEL=test (was 5 failures), and OTel-off is unchanged.
why: FastMCP 3.x auto-emits its own SpanKind.SERVER spans (tools/list,
tools/call {name}) with mcp.method.name/gen_ai.tool.name/mcp.session.id
whenever a global OTel TracerProvider is present — the same provider
agentgrep installs. Composing server_span on top produced a redundant
second SERVER layer per request (verified live: 11 spans, with both my
request {method} and FastMCP's {method}).
what:
- Delete _native_server_span and its two call sites; FastMCP's native
SERVER span now nests directly under the agentgrep mcp.server.request /
mcp.server.tool roots, which keep trace ownership, redaction, metrics,
the audit log, and inbound-traceparent inheritance.
- Correct the cost-model paragraph that wrongly claimed fastmcp does not
auto-emit. Verified live: one SERVER span per request/tool, 8 spans.
why: The trace->profile span pivot tagged the thread that creates a span, but agentgrep's CPU runs on executor / to_thread / Textual worker threads, so the busy thread carried no span_id and the old child-span tagging was on the wrong thread. what: - Install a worker-thread tagger (gated on profiling) that runs on the worker via the context-propagation helpers, reads the propagated native span id, and sets it as the Pyroscope span_id thread-tag for the callable's duration. Keep _telemetry.py SDK-free via an installed hook. - Remove the wrong-thread _pyroscope_span_scope from start_span. - Cover the seam across all three helpers, the SDK tagger, and id-format parity. note: root-level profile linking works via pyroscope.profile.id; live checks show grafana/otel-lgtm's Pyroscope does not surface the high-cardinality span_id tag as a label (the upstream processor's root tag is hidden too), so the per-span pivot needs a Pyroscope ingestion change, not an agentgrep one.
why: The MCP middleware extracts an inbound W3C traceparent and opens mcp.server.request with inherit_otel_context, but nothing proved a caller's traceparent links the caller's trace to agentgrep's request. what: - Add a parametrized SDK-backed test that drives the real middleware seam (_inbound_otel_context + _attach_otel_context + span(inherit_otel_context)) with a faked request_ctx meta: a present traceparent links to the caller's trace; absent, meta-less, malformed, and no-request-ctx all sever cleanly without raising. note: live cross-process linking additionally needs the caller to inject a traceparent; fastmcp's stock client never calls inject_trace_context, so the server-side inheritance is ready but only fires for callers that inject one.
why: Logs now carry trace_id as OTLP structured metadata. The Loki derived field already resolves a clickable log->trace link (verified live on Grafana 13.0.1; byte-identical to upstream docker-otel-lgtm), so no config change is needed — but nothing guarded it against a datasource-uid rename, and the O5 declarative-acceptance question was still open. what: - Parametrize a datasource cross-link coherence test: the Loki log->trace derived field and the three Tempo trace-out links each resolve to the uid of their named datasource (catches uid drift); replaces the narrower traces-to-metrics/profiles test. - Record why _orphan_trace_spans stays a walker: TraceQL cannot test a span's parentID against the trace's own span-id set, so O5's declarative rewrite is a no-go for the load-bearing orphan policy.
why: An adversarial review found the new profiler-tagger and inbound- traceparent tests failed under AGENTGREP_OTEL=1 — the develop-with-OTel-on loop. The conftest wraps each item in an agentgrep.pytest.test span; clearing the facade span in configure_backend (F1) did not detach the native OTel context that these tests read, so the ambient span leaked in. Two tagger tests also under-proved their claims. what: - Add a _clean_native_otel_context helper and isolate the three native- context tests (worker tagger, inbound traceparent) with it; monkeypatch the session-wide profiler tagger to None in the no-tagger test. - Cover the worker tagger's pyroscope-failure degrade path. - Compare the worker span id against PyroscopeSpanProcessor's real join key instead of OpenTelemetry's own formatter. The telemetry and script suites now pass under both AGENTGREP_OTEL unset and =1.
why: Live Pyroscope evidence is decisive: the bundled grafana/otel-lgtm Pyroscope indexes span_name but drops the high-cardinality span_id, so span_id is never a queryable label. The worker-thread tagger therefore set a Pyroscope thread-tag that ingestion discards -- every span_id selector (root and child) returns zero samples. It was dead instrumentation threaded through three context-propagation helpers for a join the backend cannot serve. Root-level trace->profile linking already works via the PyroscopeSpanProcessor root tag, pyroscope.profile.id, and span_name. what: - Remove the worker tagger plumbing from _telemetry (_PROFILER_THREAD_TAGGER, _profiler_thread_tag) and restore the plain context.run call sites in executor_submit, to_thread, and wrap_callable_context. - Remove _worker_profiler_thread_tag and _current_native_span_id from the OTel backend plus the build/shutdown wiring and the now unused functools import; keep the PyroscopeSpanProcessor root pivot. - Drop the five worker-tagger tests; keep the executor context propagation test. - Rewrite the cost-model profile-pivot paragraph: root pivot works, per-span span_id filtering is a Pyroscope ingestion limitation.
why: The inbound-context inheritance is correct and unit-tested, but in practice it fires only when a caller propagates a W3C traceparent in request metadata. Stock MCP clients, including agentgrep's own CLI, do not inject one, so the cross-service link is dormant in everyday use. Recording that scope keeps the cost model honest about which pivots resolve without a propagating upstream tracer. what: - Note in the cost model that traceparent inheritance fires only for a propagating caller and is otherwise covered by the unit test. - Add a concise note to _inbound_otel_context documenting that it returns None unless the caller propagated a traceparent.
why: The MCP server lifecycle except block sat outside the
`with _telemetry.span("agentgrep.mcp.server.lifecycle")`, so by the
time it ran the lifecycle span had closed and _CURRENT_SPAN had reset
to the parent agentgrep.mcp.server root. The error annotations
(agentgrep_outcome=error, agentgrep_error_type, agentgrep_duration_ms)
therefore landed on the parent root, not the lifecycle span, on any
exception out of run(). The success path set them inside the with.
what:
- Extract the lifecycle block into _run_server_lifecycle(run), opening
the lifecycle span and keeping the ok and error annotation/log paths
inside the span context.
- Drive it from main() with build_mcp_server().run().
- Test (parametrized ok/error) that the outcome lands on the lifecycle
span and never on the server root.
why: span() wrapped backend.start_span in an outer try/finally whose finally always called finish_span -- even if start_span failed to enter, recording a phantom finished span. finish_span only reads the passed span state, so it belongs inside the span context. what: - Move finish_span into the inner finally (inside backend.start_span), dropping the outer try/finally. A failed start now records nothing. - Test (parametrized) that a backend whose start_span entry raises records no finished span, while the normal path records one.
why: The MCP middleware imported `from opentelemetry import context` directly to attach an inbound traceparent, the one OTel import left in a normal application path. AGENTS.md keeps SDK imports behind the lazy agentgrep._telemetry_otel backend. what: - Add attach_otel_context to _telemetry_otel (holds the otel import) and a facade _telemetry.attach_otel_context that no-ops without an inbound context and otherwise delegates to the lazy backend. - Drop _attach_otel_context and the now-unused cabc import from the middleware; call the facade at the on_request site (behavior is unchanged: a real attach iff inbound is present). - Point the inbound-context test at the facade; cover the no-op path.
why: AgentgrepTelemetryMiddleware and on_request described their spans as app-level "roots", but they open them with span(), so they are children of the agentgrep.mcp.server root. The wording misstated the span hierarchy. what: - Reword both docstrings to "app-level span" without "roots".
why: APP_ROOT_SPAN_NAMES and the acceptance APPROVED_ROOTS already canonicalize agentgrep.mcp.server, but the AGENTS.md illustrative root-span list omitted it, so the doc lagged the code. what: - Add agentgrep.mcp.server to the approved root-span examples.
why: Under concurrent workload the 0.5s git timeout in the VCS resource-attribute probe fired, so `git branch --show-current` returned nothing and the `git name-rev` fallback resolved an arbitrary ref at HEAD — often a local-only `backup/*` branch that is not the checked-out branch and not on the remote. Telemetry source attribution then split between the real branch and an unpushable one, breaking source links. what: - Raise the one-time git probe timeout from 0.5s to 2.0s so the cheap branch lookup survives load and the deterministic path is used. - Skip throwaway local backup refs in the name-rev fallback so VCS attribution never points telemetry at an unpushed `backup/*` branch.
why: The metric->trace pivot relies on exemplars (trace_id + span_id on metric points recorded inside a sampled span). The MeterProvider relied on the SDK default exemplar filter; pinning it keeps the link working regardless of SDK-version default changes or an OTEL_METRICS_EXEMPLAR_FILTER env override. what: - Set MeterProvider(exemplar_filter=TraceBasedExemplarFilter()).
why: The LGTM coverage audit found three gaps. `agentgrep ui` (and the `--ui` overlay) reported as the `agentgrep-cli` service, burying TUI traces, metrics, and profiles under one-shot CLI runs. `find` and the TUI search had spans and logs but no work metrics, so they were invisible on metric dashboards while `search` and `grep` were not. what: - Report the interactive explorer as the `agentgrep-tui` service via a cheap argv check (`ui` subcommand or `--ui` overlay). - Emit `agentgrep.find.sources` / `agentgrep.find.results` on both the CLI engine path (iter_find_events) and the MCP path (run_find_query). - Emit `agentgrep.tui.search.results` for interactive searches.
why: scripts/lgtm/up.sh and the acceptance script tracked :latest, which silently pulled a months-old build. Pinning a known release keeps the dev and CI stacks reproducible. The container config label is bumped so an existing container is recreated (docker run) rather than restarted (docker start) — recreation also re-stages the single-file bind mounts cleanly under Rancher Desktop / WSL, where docker start reuses a stale empty mount folder. what: - Pin grafana/otel-lgtm:0.28.0 (override via AGENTGREP_LGTM_IMAGE) in scripts/lgtm/up.sh and scripts/otel_acceptance.py. - Bump the up.sh container config label to force recreation on change. note: 0.28.0 runs Prometheus 3.11.3 with --web.enable-otlp-receiver and --enable-feature=exemplar-storage, so it ingests the OTLP exemplars the app emits (trace-based filter) and the metric->trace pivot works in Grafana. A host process already listening on :9090 (e.g. a Debian-packaged system Prometheus) shadows the container's published port under Rancher Desktop / WSL, so confirm exemplars in Grafana or via docker exec, not a bare host curl to :9090.
why: The OTel signals agentgrep exports to the local LGTM stack had no curated way to view them. A provisioned dashboard suite turns the raw metric surface into a usable DX and agentic-debugging view that is always present when the stack starts. what: - Add scripts/lgtm/generate_dashboards.py: a deterministic generator that renders six dashboards (Overview RED, Search Engine, Find & Grep, MCP Server, Surfaces & Versions, Agentic Debug Session) from the generic span instruments plus the domain histograms and counters. Latency panels enable exemplars for the metric->trace pivot. - Add the grafana-dashboards-agentgrep.yaml provider and mount it plus the generated folder in up.sh, alongside a startup regeneration step so a fresh checkout always has the boards. Bump the container config label to force recreation with the new mounts. - Suppress E501 for the generator: its panels embed PromQL as f-strings that read better unwrapped, matching the scripts/benchmark.py precedent.
why: agentgrep's exported log bodies are plain text — the structured attributes ride in Loki structured metadata, not the line. The agentic dashboard's logs panel piped the stream through `| json`, which errors on the non-JSON body. The bare stream selector still surfaces every attribute and keeps the trace_id derived-field link to Tempo. what: - Render the agentic Telemetry logs panel with a bare stream selector.
why: The benchmark surface ran as a PEP 723 script in an isolated uv env
where the agentgrep dist is not installed, so
importlib.metadata.version("agentgrep") raised and service.version fell
back to "0+unknown" — a source-attribution gap that violates the
package-version-only rule, while cli/tui/profile-engine (installed dist)
reported the real version.
what:
- package_version() takes an optional repo_root and falls back to the
static [project].version in pyproject.toml (via tomllib) before
"0+unknown"; setup() threads repo_root through.
- Add a unit test covering the dist-missing fallback and the repo_root
None / missing-file degradation to "0+unknown".
why: shutdown() called pyroscope.shutdown() unconditionally. In passive local mode (a git checkout with AGENTGREP_OTEL unset) the backend never calls pyroscope.configure(), so the teardown found no agent to drop and logged "Pyroscope Agent shutdown failed" to stderr via the SDK last-resort handler — user-visible noise on every --help and other early-exit path. what: - Guard the Pyroscope teardown with the existing profiles_started flag, mirroring the start-side guard so teardown matches startup. - Add a unit test exercising the real shutdown path (profiles_started False makes no pyroscope.shutdown call; True does).
why: The trace-to-profile pivot resolves only to a service, never to a single span, and that is worth recording so it is not mistaken for a configuration gap to chase. what: - Document that pyroscope-otel tags only root spans with pyroscope.profile.id and that span id is a per-sample profile attribute, not a queryable Pyroscope series label, so Grafana pivots at service granularity by design.
why: At the 100 Hz default a sub-second one-shot CLI run yields only a handful of CPU samples — too coarse to read — so short-lived profiles were effectively empty while only long-lived TUI/benchmark runs profiled usefully. what: - Raise the Pyroscope sample rate to 997 Hz (prime near 1 kHz) in the explicit profile modes, so a short run captures tens of samples and resolves real call graphs. gil_only stays on; native GIL-free CPU is still excluded by design.
…lity why: One-shot CLI/benchmark/profile-engine processes each export a fresh cumulative metric stream that plateaus at its in-process total and then dies, so rate()/increase() read 0 and instant queries drop the series — only the long-lived TUI showed live throughput. The series labelset is stable across processes, so per-process deltas can accumulate on one stream in the long-lived collector. what: - Export DELTA temporality for counters and histograms in live mode (OTLPMetricExporter preferred_temporality), leaving passive-local, debug, and debug-console cumulative. - Add a repo-owned collector config (scripts/lgtm/otelcol-config.yaml) with a deltatocumulative processor on the metrics pipeline, mounted by up.sh and the acceptance stack; bump both config labels so the container recreates. Exemplars survive the delta->cumulative conversion. - Bring scripts/otel_acceptance.py in line (mount + constant) and correct its stale exemplar note; fix the acceptance test's image assertion.
why: The grep-invert metrics attached the per-call candidate/emitted counts and the duration as metric LABELS. Those are unbounded per-call values, so every grep -v invocation minted a unique single-sample series and rate()/increase() could never aggregate — leaving the Find & Grep dashboard's grep duration, selectivity, and funnel panels empty. what: - Split a low-cardinality metric_attributes set (surface, operation, command, scope, output_mode, invert flags, outcome) from the full span/log attributes, and record the three grep metrics with the low-cardinality set only. The per-call values still ride the span and the structured log.
why: The Overview error-rate stat went blank instead of 0% when there were no errors (an empty rate() numerator yields no series), and the Search board's Scope template variable queried a non-existent `scope` label instead of `agentgrep_scope`, so it never offered values. what: - Wrap the error-rate numerator in `or vector(0)` so the stat renders 0% at zero errors. - Add a label_name override to the template-variable helper and point the Scope variable at the real agentgrep_scope label. - Regenerate the overview and search dashboards.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a passive OpenTelemetry observability layer for local development while keeping packaged use quiet and best-effort. The implementation keeps SDK and exporter imports behind the lazy telemetry backend, adds optional OTel/Pyroscope dependencies, and instruments the CLI, MCP, TUI, engine, pytest, logs, metrics, and live LGTM smoke paths.
The branch also adds project-level SQLite shortcut tracing. The upstream SQLite DB-API instrumentor sees explicit cursor execution, but agentgrep reads SQLite stores through connection shortcut methods. Those methods now emit child spans under existing app roots without recording bound values, prompt text, file contents, or local database paths.
Validation
just otel-acceptance --timeout 300confirmed traces, metrics, logs, and profiles in LGTM, including bothagentgrep.cli.invocationandagentgrep.otel.smokemulti-span roots plusagentgrep.sqlite.executeandagentgrep.sqlite.executemanyspans.rm -rf docs/_build; uv run ruff check . --fix --show-fixes; uv run ruff format .; uv run ty check; uv run py.test --reruns 0 -vvv; just build-docs;completed successfully.