Skip to content

Add OpenTelemetry local observability#72

Open
tony wants to merge 92 commits into
masterfrom
otel-bootstrap
Open

Add OpenTelemetry local observability#72
tony wants to merge 92 commits into
masterfrom
otel-bootstrap

Conversation

@tony

@tony tony commented Jun 20, 2026

Copy link
Copy Markdown
Owner

Summary

Adds a passive OpenTelemetry observability layer for local development while keeping packaged use quiet and best-effort. The implementation keeps SDK and exporter imports behind the lazy telemetry backend, adds optional OTel/Pyroscope dependencies, and instruments the CLI, MCP, TUI, engine, pytest, logs, metrics, and live LGTM smoke paths.

The branch also adds project-level SQLite shortcut tracing. The upstream SQLite DB-API instrumentor sees explicit cursor execution, but agentgrep reads SQLite stores through connection shortcut methods. Those methods now emit child spans under existing app roots without recording bound values, prompt text, file contents, or local database paths.

Validation

  • just otel-acceptance --timeout 300 confirmed traces, metrics, logs, and profiles in LGTM, including both agentgrep.cli.invocation and agentgrep.otel.smoke multi-span roots plus agentgrep.sqlite.execute and agentgrep.sqlite.executemany spans.
  • rm -rf docs/_build; uv run ruff check . --fix --show-fixes; uv run ruff format .; uv run ty check; uv run py.test --reruns 0 -vvv; just build-docs; completed successfully.
  • Commit message line audit confirmed the checked commit messages are at or below the requested 80-character line budget.

@tony tony force-pushed the otel-bootstrap branch from c3f2c48 to 6b662c4 Compare June 20, 2026 15:27
@tony tony force-pushed the otel-bootstrap branch from dca4d4c to 575bed3 Compare June 20, 2026 20:01
@tony tony force-pushed the otel-bootstrap branch from fbe0378 to f52d1b1 Compare June 21, 2026 10:01
tony added 28 commits June 27, 2026 12:43
why: The MCP middleware wrapped each request and tool in a pass-through
span and used bespoke attributes, missing the standard SpanKind.SERVER,
mcp.method.name, and gen_ai.tool.name conventions and never linking an
inbound traceparent.

what:
- Rename the request/tool roots to mcp.server.request / mcp.server.tool
  and delete the mcp.operation / mcp.call_next pass-through spans.
- Nest FastMCP's server_span inside each root for the native MCP/GenAI
  conventions; the agentgrep root keeps the redacted attributes and audit
  log, so raw arguments never reach the SERVER span.
- Extract an inbound traceparent from the request metadata and inherit it
  so the caller's trace links to agentgrep's.
- Update the taxonomy, acceptance allow-list, and MCP tests; pin
  fastmcp<4 (it does not auto-emit server spans); document the shape.
why: Logs now carry agentgrep_* and the trace/span ids as OTLP structured
metadata instead of a JSON body, so the acceptance log check's
"| json" parser stage no longer matches the exported records.

what:
- Drop the "| json" stage; filter run-scoped logs on structured metadata
  and read trace/span ids from the per-entry metadata map.
- Replace the JSON-parser-error and unstructured-body rejections with a
  missing-trace-link and a missing-VCS-label rejection; update the tests
  and the acceptance-evidence doc.

note: the broader trace and metric acceptance still use the existing
walkers, which pass their unit tests; folding them into declarative
TraceQL/PromQL is a measured follow-up that needs live LGTM confirmation
rather than a blind rewrite behind the offline gate.
why: Under AGENTGREP_OTEL the conftest wraps every test in an
agentgrep.pytest.test root, leaving _CURRENT_SPAN set for the test body.
Tests that swap to the in-memory backend and assert their own span is a
root (parent_id is None) or that sql_span is a no-op without an active
span then saw the ambient pytest.test span as parent and failed, so the
suite was not green under the develop-with-OTel-on workflow.

what:
- configure_backend (test-only; no production callers) now also clears
  _CURRENT_SPAN, so each test starts from a clean root exactly as the
  uninstrumented suite does.
- Add a parametrized test for the clear; the whole suite now passes under
  AGENTGREP_OTEL=test (was 5 failures), and OTel-off is unchanged.
why: FastMCP 3.x auto-emits its own SpanKind.SERVER spans (tools/list,
tools/call {name}) with mcp.method.name/gen_ai.tool.name/mcp.session.id
whenever a global OTel TracerProvider is present — the same provider
agentgrep installs. Composing server_span on top produced a redundant
second SERVER layer per request (verified live: 11 spans, with both my
request {method} and FastMCP's {method}).

what:
- Delete _native_server_span and its two call sites; FastMCP's native
  SERVER span now nests directly under the agentgrep mcp.server.request /
  mcp.server.tool roots, which keep trace ownership, redaction, metrics,
  the audit log, and inbound-traceparent inheritance.
- Correct the cost-model paragraph that wrongly claimed fastmcp does not
  auto-emit. Verified live: one SERVER span per request/tool, 8 spans.
why: The trace->profile span pivot tagged the thread that creates a span,
but agentgrep's CPU runs on executor / to_thread / Textual worker
threads, so the busy thread carried no span_id and the old child-span
tagging was on the wrong thread.

what:
- Install a worker-thread tagger (gated on profiling) that runs on the
  worker via the context-propagation helpers, reads the propagated native
  span id, and sets it as the Pyroscope span_id thread-tag for the
  callable's duration. Keep _telemetry.py SDK-free via an installed hook.
- Remove the wrong-thread _pyroscope_span_scope from start_span.
- Cover the seam across all three helpers, the SDK tagger, and id-format
  parity.

note: root-level profile linking works via pyroscope.profile.id; live
checks show grafana/otel-lgtm's Pyroscope does not surface the
high-cardinality span_id tag as a label (the upstream processor's root
tag is hidden too), so the per-span pivot needs a Pyroscope ingestion
change, not an agentgrep one.
why: The MCP middleware extracts an inbound W3C traceparent and opens
mcp.server.request with inherit_otel_context, but nothing proved a
caller's traceparent links the caller's trace to agentgrep's request.

what:
- Add a parametrized SDK-backed test that drives the real middleware seam
  (_inbound_otel_context + _attach_otel_context + span(inherit_otel_context))
  with a faked request_ctx meta: a present traceparent links to the
  caller's trace; absent, meta-less, malformed, and no-request-ctx all
  sever cleanly without raising.

note: live cross-process linking additionally needs the caller to inject
a traceparent; fastmcp's stock client never calls inject_trace_context,
so the server-side inheritance is ready but only fires for callers that
inject one.
why: Logs now carry trace_id as OTLP structured metadata. The Loki
derived field already resolves a clickable log->trace link (verified live
on Grafana 13.0.1; byte-identical to upstream docker-otel-lgtm), so no
config change is needed — but nothing guarded it against a datasource-uid
rename, and the O5 declarative-acceptance question was still open.

what:
- Parametrize a datasource cross-link coherence test: the Loki log->trace
  derived field and the three Tempo trace-out links each resolve to the
  uid of their named datasource (catches uid drift); replaces the narrower
  traces-to-metrics/profiles test.
- Record why _orphan_trace_spans stays a walker: TraceQL cannot test a
  span's parentID against the trace's own span-id set, so O5's declarative
  rewrite is a no-go for the load-bearing orphan policy.
why: An adversarial review found the new profiler-tagger and inbound-
traceparent tests failed under AGENTGREP_OTEL=1 — the develop-with-OTel-on
loop. The conftest wraps each item in an agentgrep.pytest.test span;
clearing the facade span in configure_backend (F1) did not detach the
native OTel context that these tests read, so the ambient span leaked in.
Two tagger tests also under-proved their claims.

what:
- Add a _clean_native_otel_context helper and isolate the three native-
  context tests (worker tagger, inbound traceparent) with it; monkeypatch
  the session-wide profiler tagger to None in the no-tagger test.
- Cover the worker tagger's pyroscope-failure degrade path.
- Compare the worker span id against PyroscopeSpanProcessor's real join
  key instead of OpenTelemetry's own formatter.

The telemetry and script suites now pass under both AGENTGREP_OTEL unset
and =1.
why: Live Pyroscope evidence is decisive: the bundled
grafana/otel-lgtm Pyroscope indexes span_name but drops the
high-cardinality span_id, so span_id is never a queryable label.
The worker-thread tagger therefore set a Pyroscope thread-tag that
ingestion discards -- every span_id selector (root and child)
returns zero samples. It was dead instrumentation threaded through
three context-propagation helpers for a join the backend cannot
serve. Root-level trace->profile linking already works via the
PyroscopeSpanProcessor root tag, pyroscope.profile.id, and span_name.

what:
- Remove the worker tagger plumbing from _telemetry
  (_PROFILER_THREAD_TAGGER, _profiler_thread_tag) and restore the
  plain context.run call sites in executor_submit, to_thread, and
  wrap_callable_context.
- Remove _worker_profiler_thread_tag and _current_native_span_id
  from the OTel backend plus the build/shutdown wiring and the now
  unused functools import; keep the PyroscopeSpanProcessor root pivot.
- Drop the five worker-tagger tests; keep the executor context
  propagation test.
- Rewrite the cost-model profile-pivot paragraph: root pivot works,
  per-span span_id filtering is a Pyroscope ingestion limitation.
why: The inbound-context inheritance is correct and unit-tested, but
in practice it fires only when a caller propagates a W3C traceparent
in request metadata. Stock MCP clients, including agentgrep's own
CLI, do not inject one, so the cross-service link is dormant in
everyday use. Recording that scope keeps the cost model honest about
which pivots resolve without a propagating upstream tracer.

what:
- Note in the cost model that traceparent inheritance fires only for
  a propagating caller and is otherwise covered by the unit test.
- Add a concise note to _inbound_otel_context documenting that it
  returns None unless the caller propagated a traceparent.
why: The MCP server lifecycle except block sat outside the
`with _telemetry.span("agentgrep.mcp.server.lifecycle")`, so by the
time it ran the lifecycle span had closed and _CURRENT_SPAN had reset
to the parent agentgrep.mcp.server root. The error annotations
(agentgrep_outcome=error, agentgrep_error_type, agentgrep_duration_ms)
therefore landed on the parent root, not the lifecycle span, on any
exception out of run(). The success path set them inside the with.

what:
- Extract the lifecycle block into _run_server_lifecycle(run), opening
  the lifecycle span and keeping the ok and error annotation/log paths
  inside the span context.
- Drive it from main() with build_mcp_server().run().
- Test (parametrized ok/error) that the outcome lands on the lifecycle
  span and never on the server root.
why: span() wrapped backend.start_span in an outer try/finally whose
finally always called finish_span -- even if start_span failed to
enter, recording a phantom finished span. finish_span only reads the
passed span state, so it belongs inside the span context.

what:
- Move finish_span into the inner finally (inside backend.start_span),
  dropping the outer try/finally. A failed start now records nothing.
- Test (parametrized) that a backend whose start_span entry raises
  records no finished span, while the normal path records one.
why: The MCP middleware imported `from opentelemetry import context`
directly to attach an inbound traceparent, the one OTel import left in
a normal application path. AGENTS.md keeps SDK imports behind the lazy
agentgrep._telemetry_otel backend.

what:
- Add attach_otel_context to _telemetry_otel (holds the otel import)
  and a facade _telemetry.attach_otel_context that no-ops without an
  inbound context and otherwise delegates to the lazy backend.
- Drop _attach_otel_context and the now-unused cabc import from the
  middleware; call the facade at the on_request site (behavior is
  unchanged: a real attach iff inbound is present).
- Point the inbound-context test at the facade; cover the no-op path.
why: AgentgrepTelemetryMiddleware and on_request described their spans
as app-level "roots", but they open them with span(), so they are
children of the agentgrep.mcp.server root. The wording misstated the
span hierarchy.

what:
- Reword both docstrings to "app-level span" without "roots".
why: APP_ROOT_SPAN_NAMES and the acceptance APPROVED_ROOTS already
canonicalize agentgrep.mcp.server, but the AGENTS.md illustrative
root-span list omitted it, so the doc lagged the code.

what:
- Add agentgrep.mcp.server to the approved root-span examples.
why: Under concurrent workload the 0.5s git timeout in the VCS
resource-attribute probe fired, so `git branch --show-current` returned
nothing and the `git name-rev` fallback resolved an arbitrary ref at
HEAD — often a local-only `backup/*` branch that is not the checked-out
branch and not on the remote. Telemetry source attribution then split
between the real branch and an unpushable one, breaking source links.

what:
- Raise the one-time git probe timeout from 0.5s to 2.0s so the cheap
  branch lookup survives load and the deterministic path is used.
- Skip throwaway local backup refs in the name-rev fallback so VCS
  attribution never points telemetry at an unpushed `backup/*` branch.
why: The metric->trace pivot relies on exemplars (trace_id + span_id on
metric points recorded inside a sampled span). The MeterProvider relied
on the SDK default exemplar filter; pinning it keeps the link working
regardless of SDK-version default changes or an OTEL_METRICS_EXEMPLAR_FILTER
env override.

what:
- Set MeterProvider(exemplar_filter=TraceBasedExemplarFilter()).
why: The LGTM coverage audit found three gaps. `agentgrep ui` (and the
`--ui` overlay) reported as the `agentgrep-cli` service, burying TUI
traces, metrics, and profiles under one-shot CLI runs. `find` and the TUI
search had spans and logs but no work metrics, so they were invisible on
metric dashboards while `search` and `grep` were not.

what:
- Report the interactive explorer as the `agentgrep-tui` service via a
  cheap argv check (`ui` subcommand or `--ui` overlay).
- Emit `agentgrep.find.sources` / `agentgrep.find.results` on both the
  CLI engine path (iter_find_events) and the MCP path (run_find_query).
- Emit `agentgrep.tui.search.results` for interactive searches.
why: scripts/lgtm/up.sh and the acceptance script tracked :latest, which
silently pulled a months-old build. Pinning a known release keeps the dev
and CI stacks reproducible. The container config label is bumped so an
existing container is recreated (docker run) rather than restarted (docker
start) — recreation also re-stages the single-file bind mounts cleanly under
Rancher Desktop / WSL, where docker start reuses a stale empty mount folder.

what:
- Pin grafana/otel-lgtm:0.28.0 (override via AGENTGREP_LGTM_IMAGE) in
  scripts/lgtm/up.sh and scripts/otel_acceptance.py.
- Bump the up.sh container config label to force recreation on change.

note: 0.28.0 runs Prometheus 3.11.3 with --web.enable-otlp-receiver and
--enable-feature=exemplar-storage, so it ingests the OTLP exemplars the app
emits (trace-based filter) and the metric->trace pivot works in Grafana. A
host process already listening on :9090 (e.g. a Debian-packaged system
Prometheus) shadows the container's published port under Rancher Desktop /
WSL, so confirm exemplars in Grafana or via docker exec, not a bare host
curl to :9090.
why: The OTel signals agentgrep exports to the local LGTM stack had no
curated way to view them. A provisioned dashboard suite turns the raw
metric surface into a usable DX and agentic-debugging view that is always
present when the stack starts.

what:
- Add scripts/lgtm/generate_dashboards.py: a deterministic generator that
  renders six dashboards (Overview RED, Search Engine, Find & Grep, MCP
  Server, Surfaces & Versions, Agentic Debug Session) from the generic
  span instruments plus the domain histograms and counters. Latency panels
  enable exemplars for the metric->trace pivot.
- Add the grafana-dashboards-agentgrep.yaml provider and mount it plus the
  generated folder in up.sh, alongside a startup regeneration step so a
  fresh checkout always has the boards. Bump the container config label to
  force recreation with the new mounts.
- Suppress E501 for the generator: its panels embed PromQL as f-strings
  that read better unwrapped, matching the scripts/benchmark.py precedent.
why: agentgrep's exported log bodies are plain text — the structured
attributes ride in Loki structured metadata, not the line. The agentic
dashboard's logs panel piped the stream through `| json`, which errors on
the non-JSON body. The bare stream selector still surfaces every attribute
and keeps the trace_id derived-field link to Tempo.

what:
- Render the agentic Telemetry logs panel with a bare stream selector.
why: The benchmark surface ran as a PEP 723 script in an isolated uv env
where the agentgrep dist is not installed, so
importlib.metadata.version("agentgrep") raised and service.version fell
back to "0+unknown" — a source-attribution gap that violates the
package-version-only rule, while cli/tui/profile-engine (installed dist)
reported the real version.

what:
- package_version() takes an optional repo_root and falls back to the
  static [project].version in pyproject.toml (via tomllib) before
  "0+unknown"; setup() threads repo_root through.
- Add a unit test covering the dist-missing fallback and the repo_root
  None / missing-file degradation to "0+unknown".
why: shutdown() called pyroscope.shutdown() unconditionally. In passive
local mode (a git checkout with AGENTGREP_OTEL unset) the backend never
calls pyroscope.configure(), so the teardown found no agent to drop and
logged "Pyroscope Agent shutdown failed" to stderr via the SDK
last-resort handler — user-visible noise on every --help and other
early-exit path.

what:
- Guard the Pyroscope teardown with the existing profiles_started flag,
  mirroring the start-side guard so teardown matches startup.
- Add a unit test exercising the real shutdown path (profiles_started
  False makes no pyroscope.shutdown call; True does).
why: The trace-to-profile pivot resolves only to a service, never to a
single span, and that is worth recording so it is not mistaken for a
configuration gap to chase.

what:
- Document that pyroscope-otel tags only root spans with
  pyroscope.profile.id and that span id is a per-sample profile
  attribute, not a queryable Pyroscope series label, so Grafana pivots at
  service granularity by design.
why: At the 100 Hz default a sub-second one-shot CLI run yields only a
handful of CPU samples — too coarse to read — so short-lived profiles
were effectively empty while only long-lived TUI/benchmark runs profiled
usefully.

what:
- Raise the Pyroscope sample rate to 997 Hz (prime near 1 kHz) in the
  explicit profile modes, so a short run captures tens of samples and
  resolves real call graphs. gil_only stays on; native GIL-free CPU is
  still excluded by design.
…lity

why: One-shot CLI/benchmark/profile-engine processes each export a fresh
cumulative metric stream that plateaus at its in-process total and then
dies, so rate()/increase() read 0 and instant queries drop the series —
only the long-lived TUI showed live throughput. The series labelset is
stable across processes, so per-process deltas can accumulate on one
stream in the long-lived collector.

what:
- Export DELTA temporality for counters and histograms in live mode
  (OTLPMetricExporter preferred_temporality), leaving passive-local,
  debug, and debug-console cumulative.
- Add a repo-owned collector config (scripts/lgtm/otelcol-config.yaml)
  with a deltatocumulative processor on the metrics pipeline, mounted by
  up.sh and the acceptance stack; bump both config labels so the
  container recreates. Exemplars survive the delta->cumulative conversion.
- Bring scripts/otel_acceptance.py in line (mount + constant) and correct
  its stale exemplar note; fix the acceptance test's image assertion.
why: The grep-invert metrics attached the per-call candidate/emitted
counts and the duration as metric LABELS. Those are unbounded per-call
values, so every grep -v invocation minted a unique single-sample series
and rate()/increase() could never aggregate — leaving the Find & Grep
dashboard's grep duration, selectivity, and funnel panels empty.

what:
- Split a low-cardinality metric_attributes set (surface, operation,
  command, scope, output_mode, invert flags, outcome) from the full
  span/log attributes, and record the three grep metrics with the
  low-cardinality set only. The per-call values still ride the span and
  the structured log.
why: The Overview error-rate stat went blank instead of 0% when there
were no errors (an empty rate() numerator yields no series), and the
Search board's Scope template variable queried a non-existent `scope`
label instead of `agentgrep_scope`, so it never offered values.

what:
- Wrap the error-rate numerator in `or vector(0)` so the stat renders 0%
  at zero errors.
- Add a label_name override to the template-variable helper and point the
  Scope variable at the real agentgrep_scope label.
- Regenerate the overview and search dashboards.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant