Skip to content

PR -> feat: Unified Telemetry Layer for Non-LangGraph Trace Pipelines (M2) Body#64

Draft
mjehanzaib999 wants to merge 62 commits intoAgentOpt:experimentalfrom
mjehanzaib999:m2-unified-telemetry
Draft

PR -> feat: Unified Telemetry Layer for Non-LangGraph Trace Pipelines (M2) Body#64
mjehanzaib999 wants to merge 62 commits intoAgentOpt:experimentalfrom
mjehanzaib999:m2-unified-telemetry

Conversation

@mjehanzaib999
Copy link

Summary

This PR implements the "Generic Unified Telemetry" layer (Milestone 2), enabling OTEL span emission for non-LangGraph Trace pipelines while preserving all existing LangGraph instrumentation behavior.

After M1, only LangGraph pipelines could emit OTEL spans. This PR extends telemetry coverage so that any Trace pipeline using @trace.bundle or call_llm can produce OTEL-compatible spans when a TelemetrySession is active — with zero changes to existing code when no session is active.

What's new

  • Session activation via contextvarsTelemetrySession supports with context manager and activate() for global discovery by Trace hooks
  • OTEL spans around @trace.bundle ops — controlled by BundleSpanConfig (enable/disable, suppress default ops, capture inputs)
  • MessageNode-to-span bindingMessageNodeTelemetryConfig binds message.id to current span for stable node identity in TGJ conversion
  • call_llm provider span — emits a child OTEL span with trace.temporal_ignore=true when a session is active (visible for monitoring, excluded from output node selection)
  • Session activation in LangGraph root spanInstrumentedGraph._root_invocation_span now calls session.activate() so Trace-level hooks discover the session automatically
  • Optional MLflow autologgingopto.features.mlflow.autolog() enables mlflow.trace wrapping on bundle ops; safe no-op when MLflow is not installed
  • Export naming alignmentexport_run_bundle() now writes otlp.json / tgj.json (aligned with repo demos) with backward-compatible aliases (otlp_trace.json / trace_graph.json)
  • Manifest + node recordsmanifest.json and message_nodes.jsonl included in export bundle for debugging

Files changed (9 files, +664 / -71)

File Change
opto/trace/settings.py New — global MLflow autologging toggle
opto/features/mlflow/__init__.py New — MLflow integration package
opto/features/mlflow/autolog.py Newautolog() / disable_autolog()
opto/trace/__init__.py Expose settings and mlflow in public API
opto/trace/bundle.py Optional OTEL span in sync_forward/async_forward; MLflow mlflow.trace wrapping
opto/trace/io/telemetry_session.py Major expansion: activation, BundleSpanConfig, MessageNodeTelemetryConfig, span helpers, MLflow helpers, export alignment
opto/trace/io/instrumentation.py Wrap root span with session.activate()
opto/trace/nodes.py Hook MessageNode.__init__ to call on_message_node_created()
opto/trace/operators.py call_llm emits temporal-ignore provider span

Non-breaking guarantees

  • No session active → identical behavior — all hooks are guarded by TelemetrySession.current() is None checks
  • postprocess_output signature unchanged — preserves compatibility with existing callers
  • preprocess_inputs preserved — data extraction inside trace_nodes context is untouched
  • MLflow is optional — all imports are guarded; code works without MLflow installed

Test plan

  • opto.trace import works without errors
  • TelemetrySession + BundleSpanConfig + MessageNodeTelemetryConfig import correctly
  • Bundle ops without a session produce identical results to M1 (no regression)
  • Bundle ops with active session emit OTEL spans with trace.bundle.* and inputs.* attributes
  • TelemetrySession.current() returns None outside context, active session inside
  • export_run_bundle() produces otlp.json, tgj.json, manifest.json + legacy aliases
  • autolog(silent=True) gracefully disables when MLflow is not installed
  • Run M1 notebook end-to-end to confirm no regressions
  • Run M2 demo notebook (generic_unified_telemetry_demo.ipynb)
  • pytest suite passes in clean environment

doxav and others added 30 commits February 12, 2026 15:01
…tion do not lose initial node to optimize (TODO: trainer might have a better solution)
- Add T1 technical plan for LangGraph OTEL Instrumentation API
- Add architecture & strategy doc (unified OTEL instrumentation design)
- Add M0 README with before/after boilerplate reduction comparison
- Add feedback analysis and API strategy comparison (Trace-first, dual semconv)
- Add prototype_api_validation.py with real LangGraph StateGraph + OpenRouter/StubLLM
- Add Jupyter notebook (prototype_api_validation.ipynb) for Colab-ready demo
- Add example trace output JSON files (notebook_trace_output, optimization_traces)
- Add .env.example for OpenRouter configuration
- Replace hardcoded API key with 3-tier auto-lookup (Colab Secrets → env → .env)
- Save all trace outputs to RUN_FOLDER (Google Drive on Colab, local fallback)
- Add run_summary.json export with scores and history
- Update configuration docs with key setup priority table
- Fix Colab badge URL with actual repo/branch path
Deliver Milestone 1 — drop-in OTEL instrumentation and end-to-end
optimization for any LangGraph agent via two function calls.

New modules (opto/trace/io/):
- instrumentation.py: instrument_graph() + InstrumentedGraph wrapper
- optimization.py: optimize_graph() loop + EvalResult/EvalFn contracts
- telemetry_session.py: TelemetrySession (TracerProvider + flush/export)
- bindings.py: Binding dataclass + apply_updates() + make_dict_binding()
- otel_semconv.py: emit_reward(), emit_trace(), record_genai_chat()

Modified modules:
- langgraph_otel_runtime.py: TracingLLM dual semconv (param.* parent +
  gen_ai.* child spans with trace.temporal_ignore)
- __init__.py: export all new M1 public APIs

Tests (63 passing, StubLLM-only, CI-safe):
- Unit tests for bindings, semconv, session, instrumentation, optimization
- E2E integration test (test_e2e_m1_pipeline.py): real LangGraph with
  StubLLM proving full pipeline instrument → invoke → OTLP → TGJ →
  optimizer → apply_updates → re-invoke with updated template

Notebook + docs:
- 01_m1_instrument_and_optimize.ipynb: dual-mode (StubLLM + live
  OpenRouter), Colab badge, executed outputs, <=3 item dataset,
  temperature=0, max_tokens=256 budget guard
- docs/m1_README.md: architecture, API reference, data flow, semantic
  conventions, acceptance criteria status
- requirements.txt: pinned dependencies for uv/pip environments
A. Live mode error handling:
 - A1: TracingLLM raises LLMCallError on HTTP errors/empty content instead of passing error strings as assistant content
 - A2: Notebook only prints [OK] when provider call actually succeeds with non-empty content
 - A3: gen_ai.provider.name correctly set to "openrouter" (not "openai") when using OpenRouter
 - A4: optimize_graph forces score=0 on invocation failure, bypassing eval_fn

B. TelemetrySession API correctness + redaction:
 - B5: flush_otlp(clear=False) properly peeks at spans without clearing the exporter
 - B6: span_attribute_filter now applied during flush_otlp; supports drop (return {}), redact, and truncate

C. TGJ/ingest correctness and optimizer safety:
 - C7: _deduplicate_param_nodes() strips numeric suffixes to collapse duplicate ParameterNodes
 - C8: _select_output_node() excludes child LLM spans, selects the true sink (synthesizer)

D. OTEL topology and temporal chaining:
 - D9: Root invocation span wraps graph.invoke(), producing a single trace ID per invocation
 - D10: Temporal chaining uses trace.temporal_ignore attribute instead of OTEL parent presence

E. optimize_graph semantics + trace-linked reward:
 - E11: best_parameters is a real snapshot captured at the best-scoring iteration
 - E12: eval.score attached to root invocation span before flush, linking reward to trace

F. Non-saturating scoring for Stub mode:
 - F13: StubLLM and eval_fn are structure-aware; stub optimization demonstrates score improvement

Files changed:
 - langgraph_otel_runtime.py: LLMCallError, _validate_content, flush_otlp(clear=)
 - telemetry_session.py: flush_otlp delegation, _apply_attribute_filter
 - otel_adapter.py: root span exclusion, trace.temporal_ignore chaining
 - instrumentation.py: _root_invocation_span context manager, root span on invoke/stream
 - optimization.py: _deduplicate_param_nodes, _select_output_node, _snapshot_parameters, eval-in-trace
 - __init__.py: export LLMCallError
 - test_optimization.py: updated for best_parameters field
 - 01_m1_instrument_and_optimize.ipynb: all fixes reflected in notebook
 - test_client_feedback_fixes.py: 20 new tests covering all 13 issues
… code

Make the instrumentation layer fully generic and provider-agnostic:

- TracingLLM: default provider_name "openai" → "llm",
  default llm_span_name "openai.chat.completion" → "llm.chat.completion"
- init_otel_runtime: default service_name "trace-langgraph-demo" → "trace-otel-runtime"
- DEFAULT_EVAL_METRIC_KEYS: remove example-specific "plan_quality",
  add generic "score"
- instrument_graph: add llm_span_name, input_key, output_key parameters
  so callers explicitly configure provider/schema specifics
- InstrumentedGraph: add input_key field; invoke()/stream() use it
  instead of hardcoded "query" for the root span hint
- optimize_graph: add output_key parameter; _make_state uses
  graph.input_key instead of hardcoded "query"; error fallback
  no longer assumes result["answer"]
- _select_output_node: replace hardcoded "openai"/"chat.completion"
  name checks with trace.temporal_ignore attribute from info.otel
- otel_adapter: propagate temporal_ignore flag into TGJ info dict
- tgj_ingest: preserve info.otel metadata through conversion and
  onto MessageNode objects

Tests and notebook updated to explicitly pass example-specific values
(provider_name, llm_span_name, output_key) rather than relying on defaults.

All 88 tests pass.
…st iteration

Previously, best_updates was overwritten on every iteration where updates
were applied, regardless of whether that iteration achieved the best score.
This caused best_updates to always contain the last applied updates rather
than the updates that produced the best-performing parameters.

Introduce last_applied_updates to track the most recently applied updates
separately, and snapshot it at the start of each iteration as
applied_updates_for_this_iter. best_updates is now only assigned inside
the best-score guard (avg_score > best_score), ensuring it accurately
reflects the updates that led to best_parameters.

Addresses PR feedback item doxav#1: optimize_graph() best_updates tracking.
optimize_graph() previously ignored the graph's configured output_key
unless the caller explicitly passed output_key=..., causing incorrect
eval payload shape. Now auto-inherits graph.output_key when the parameter
is not provided, and logs a debug note when an explicit override disagrees
with the graph's configuration.

Addresses PR feedback item doxav#2: output_key fallback in optimize_graph.
enable_code_optimization was accepted by instrument_graph() but never
used — TracingLLM.emit_code_param always remained None. Now constructs
a _emit_code_param callback when the flag is True that emits source code,
SHA-256 hash, truncation metadata, and trainable marker as param.__code_*
span attributes. Source is capped at 10K chars with truncation flag.

Addresses PR feedback item doxav#3: enable_code_optimization no-op.
(4A) otel_adapter: after temporal hierarchy resolution, null out
effective_psid when it still references a skipped root invocation span,
preventing dangling parent edges in the TGJ graph.

(4B) langgraph_otel_runtime: capture child LLM span ref and propagate
error/error.type attributes to it on LLMCallError and unexpected
exceptions, so OTEL UIs correctly flag the LLM call as failed.

Addresses PR feedback item doxav#4.
…race validation

Notebook trace validation used "openai" in name to detect child spans,
which silently matched nothing after the generic refactoring. Now uses
trace.temporal_ignore attribute for provider-agnostic detection and
asserts the set is non-empty. Also adds root invocation span assertion
to enforce the D9 single-trace-ID invariant.

Addresses PR feedback item doxav#6.
…into m1-for-upstream

# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
…e spans

Library (langgraph_otel_runtime.py):
- Restructure child LLM span error handling: catch errors inside the
  child span context manager so attributes are set before the span ends
- Add error.message attribute (truncated to 500 chars) on both parent
  and child spans for LLMCallError and unexpected exceptions

Notebook (01_m1_instrument_and_optimize.ipynb):
- Rewrite graph to 6-node architecture aligned with reference demo:
  planner → executor → web_researcher/wikidata_researcher → synthesizer → evaluator
- Use Command routing from langgraph.types for dynamic node dispatch
- Switch to DEMO_QUERIES (French Revolution / Tesla / CRISPR)
- Add 3 trainable templates (planner, executor, synthesizer) with output_key=final_answer
- Rewrite StubLLM to produce JSON plans, routing JSON, and topic-aware
  answers; respond to prompt template changes for non-saturating scoring
- Rewrite stub_eval_fn: base 0.2 + plan richness + answer length, cap 0.95
- Fix live section: provider_name="openrouter", trace invariant checks,
  only print [OK] on actual success
- Fix ParameterNode deduplication in TGJ inspection (id-based dedup)
- Update Colab Drive paths to OpenTrace_runs/M1/{OPENTRACE_REF}
- Add optimization table output (iteration → avg_score → best_score)

Verified: 41 tests pass, notebook runs end-to-end, baseline=0.75 → best=0.95
mjehanzaib999 and others added 30 commits February 21, 2026 00:10
- Replace rate-limited meta-llama/llama-3.3-70b-instruct:free with
  qwen/qwen3-next-80b-a3b-instruct:free (instruction-tuned, no thinking traces)
- Use eval_fn=None in Section 9 live optimization so optimize_graph()
  uses the library's _default_eval_fn which reads eval.score from the
  evaluator span in the OTLP trace
- Fix Cell 30 header to say 'openai client' instead of 'Trace LiteLLM'
apply_updates() now normalizes ParameterNode object keys to strings
via _normalize_key(), so OptoPrimeV2 updates are no longer silently
skipped. ingest_tgj() gains a param_cache to reuse stable
ParameterNode instances across multi-query iterations. The backward
pass now iterates all output nodes, and stale OTLP spans are flushed
at the start of optimize_graph().

- bindings.py: accept Dict[Any, Any], return applied dict
- tgj_ingest.py: add param_cache kwarg for ParameterNode reuse
- optimization.py: flush stale spans, use param_cache, fix backward
  loop, use applied dict from apply_updates()
- notebook: enable INFO logging in live optimization cell
The GraphPropagator asserts that user_feedback is identical when
aggregating across multiple backward passes. Running zero_feedback →
backward → step per query (matching the BBEH notebook pattern) avoids
this and lets each query contribute updates independently.
…optimizer steps

Replace the per-query backward/step loop with Trace's canonical minibatch
pattern: batchify all output nodes into a single batched target and all
per-query feedback into a single batched feedback string, then call
backward() and step() once. This avoids the GraphPropagator assertion
("user feedback should be the same for all children") while ensuring all
queries' graph paths contribute to the optimization gradient.

The batchify import is lazy-loaded via _ensure_trace_imports() to avoid
pulling in numpy and the trainer package at module level.
Implement TelemetrySession activation via contextvars so @trace.bundle
ops and MessageNode creation can emit OTEL spans outside LangGraph.

- Add BundleSpanConfig and MessageNodeTelemetryConfig to control span
  emission and node-to-span binding (message.id)
- Add bundle_span() context manager and on_message_node_created() hook
  in TelemetrySession for non-LangGraph OTEL visibility
- Wrap sync_forward/async_forward in optional OTEL span when session active
- Emit temporal-ignore child span in call_llm for provider monitoring
- Activate session inside InstrumentedGraph root span so Trace hooks
  discover it automatically
- Add opto.features.mlflow with autolog/disable_autolog (safe no-op
  when MLflow not installed)
- Add opto.trace.settings for global MLflow toggle
- Align export naming to otlp.json/tgj.json with legacy aliases
- Add manifest.json and message_nodes.jsonl to export bundle
Covers all M2 features: TelemetrySession activation, bundle span
emission, default-op silencing, MessageNode binding, call_llm
temporal-ignore spans, export bundle naming, MLflow autolog API,
M1 non-breaking compatibility, and end-to-end non-LangGraph pipeline.
Includes live OpenRouter sections (auto-skipped if no API key).
…ooks and remove stale files

Move 02_m2_unified_telemetry.ipynb into examples/notebooks/ for
consistency with the M1 notebook location. Remove leftover files
from the repo root: M1 notebook copy, OVERVIEW.md, and PR diff files."

git push origin m2-unified-telemetry
…ng works

postprocess_output (which creates the MessageNode) was called after the
bundle span had closed, so on_message_node_created could never find an
active span to attach message.id to. Move it inside the span_cm block
for both sync_forward and async_forward.
The install cell only cloned on first run but never pulled updates
when the repo folder already existed, causing stale code to persist
across runtime restarts. Added git fetch + pull to guarantee the
- Updated M2 notebook install cell to add repo root to sys.path
  when running locally, eliminating the need for pip install
- Added git fetch + pull to Colab install cell so restarts pick up
  latest commits instead of using stale cloned code
- Removed debug probe from MessageNode binding cell
- Relaxed setup.py python_requires from >=3.13 to >=3.12
Added Sections 8a-8c that install MLflow and validate real integration
paths: autolog enabling, bundle wrapping via mlflow.trace(), artifact
logging via TelemetrySession, and log_metric/log_param recording.
… compatibility

mlflow.trace() wrapping accesses fn.__name__ on the decorated callable.
FunModule (the object returned by @Bundle) did not expose this attribute,
causing an AttributeError when executing bundle-decorated functions inside
an active MLflow run. Forward the original function's __name__ and
__qualname__ onto the FunModule instance.
…flow.trace()

- Add Section 8d to M2 notebook: launches MLflow UI inline on Colab
  (port 5000) for visual inspection of experiments, runs, artifacts,
  and metrics. Falls back to terminal instructions when running locally.
- Expose __name__ and __qualname__ on FunModule so mlflow.trace()
  can resolve the function name without AttributeError.
- Update notebook summary tables (header + footer) to include Section 8d.
…n 8d)

Renders an embedded iframe and a direct "Open in new tab" link using
Colab's proxyPort API so users can visually inspect MLflow experiments,
runs, artifacts, and metrics logged by the preceding test cells.
Also exposes __name__/__qualname__ on FunModule to fix AttributeError
when mlflow.trace() wraps @bundle-decorated functions.
Replace proxyPort-based link (blocked by Colab pop-up blocker) with
subprocess.Popen + serve_kernel_port_as_iframe for reliable inline
rendering of the MLflow UI in notebook output.
The call was passing unsupported kwargs (operation, output_messages,
response, temperature, max_tokens) which silently raised TypeError
under the bare except, leaving gen_ai.input.messages and
gen_ai.output.messages unset. Use the actual signature parameters
(provider, model, input_messages, output_text) so the semconv
attributes are recorded on the LLM child span.
Replace single _token with _token_stack list so that nesting
with session: on the same TelemetrySession instance correctly
restores the context variable on each exit instead of leaking
the active session.
Allow activating a TelemetrySession without indenting all pipeline
code under a with-block. Useful in notebooks and long scripts.
Both methods share the _token_stack so they compose safely with
context-manager activation and nested calls.
… cells

Add Section 8e validating MessageNodeTelemetryConfig(mode="span")
which creates dedicated spans when no active span exists.
Add Section 8f validating the full OTLP -> TGJ -> ingest_tgj()
round-trip that underpins the optimization data path.
Update header and summary tables accordingly.
…_signature__

MLflow's capture_function_input_args uses inspect.signature(func) to bind
args. FunModule inherited Module.__call__(self, *args, **kwargs), so
inspect returned the wrong signature and bind failed or produced bad data.
Set __signature__ = inspect.signature(fun) so MLflow sees the real
parameter names (x, y) and can capture inputs correctly.
Remove the previous warning suppression and note from the notebook.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants