Skip to content

feat(tracing): add per-SDK tracing adapters for 9 frameworks#171

Open
itsarbit wants to merge 64 commits into
mainfrom
feat/sdk-tracing-adapters
Open

feat(tracing): add per-SDK tracing adapters for 9 frameworks#171
itsarbit wants to merge 64 commits into
mainfrom
feat/sdk-tracing-adapters

Conversation

@itsarbit

Copy link
Copy Markdown
Contributor

Summary

Adds 8 per-SDK tracing adapters under arksim/tracing/integrations/ so users of LangChain/LangGraph, CrewAI, Claude Agent SDK, Google ADK, LiveKit Agents, Strands Agents, LlamaIndex, and Smolagents get zero-config tool-call capture through the Python connector. All 13 existing integration examples updated; 2 new examples added (LiveKit, Strands).

What ships

  • BaseTracingAdapter (contextvars-based, no registry) under arksim/tracing/integrations/_base.py
  • PendingToolCalls split-event correlation helper (used by LangChain and LlamaIndex)
  • parse_tool_arguments shared argument normalization helper
  • 8 adapter modules covering 9 frameworks (LangChain adapter covers LangGraph)
  • 10 new ToolCallSource enum variants (8 adapters + Dify + Rasa)
  • 8 new pip extras in pyproject.toml
  • detect-secrets pre-commit hook + initial baseline
  • Multi-extra resolver CI check
  • 95+ unit tests + cross-adapter contract test

Non-goals (explicit)

  • OpenAI Agents SDK adapter changes (already shipped in feat(tracing): add OTel trace receiver for tool call capture #114; canonical path arksim.tracing.openai)
  • Restructuring arksim/tracing/openai.py (file stays in place)
  • OTel + W3C wrapper instrumentation track (post-launch follow-up)
  • Platform-bridge work
  • Response-body tool-call evaluation on raw Chat Completions (decided no per wiki/decisions/no-response-parsed-tool-call-eval.md)

Test plan

  • Unit suite: 915 passing, 4 skipped (was 820 on origin/main, +95 across the rollout)
  • Per-adapter tests cover happy path, error path, source field, no-context, missing fields, multiple calls, split-event correlation (where applicable)
  • Cross-adapter contract test parameterized across 8 adapters
  • ruff check + ruff format --check clean
  • Multi-extra resolver: pip install 'arksim[langchain,crewai,llamaindex,claude-agent,google-adk,livekit,strands,smolagents]' resolves cleanly
  • Pre-commit clean across the full 37-commit diff (120 files, +5525 / -4755)

Manual verification needed before merge

  • Each of the 8 per-SDK adapter examples (LangChain, LangGraph, CrewAI, Claude Agent SDK, Google ADK, LiveKit, Strands, LlamaIndex, Smolagents) runs end-to-end against the respective SDK with a real API key, producing tool_calls in simulation.json.
  • Native-OTel examples (AutoGen, Pydantic AI, Mastra) run E2E. Mastra requires npm install first.
  • Doesn't-fit examples: Dify needs a Dify Agent app with lookup_order + book_table tools configured server-side. Rasa needs a running Rasa server with the included rasa_project. Vercel AI SDK needs npm install && npm start first.

Follow-ups (intentionally NOT in this PR)

  • LangGraph migration from create_react_agent to langchain.agents.create_agent (v1 deprecation)
  • CrewAI ToolCall.id synthesis (currently empty)
  • Claude Agent SDK error path via PostToolUseFailure event
  • README / quickstart retool leading with examples/customer-service/ as canonical demo
  • Ecosystem scan: Agno, Haystack, LangGraph Platform, OpenAI Swarm handoff coverage
  • parse_openai honoring message.tool_calls (would let agent servers skip OTel for simple cases)
  • Mastra deprecation: migrate to Mastra's proprietary AI Tracing when stable

itsarbit added 30 commits May 15, 2026 06:54
Bumps 5 lower bounds to track current stable SDK majors:
- langchain-core: 0.3.0 -> 1.0.0
- llama-index-core: 0.10.0 -> 0.14.0
- google-adk: 0.5.0 -> 1.0.0
- strands-agents: 0.1.0 -> 1.0.0
- claude-agent-sdk: pin to >=0.1.0,<0.3.0 (still pre-1.0)
Pins Yelp/detect-secrets v1.5.0 and seeds a baseline of known-OK
matches so the upcoming adapter test fixtures (which contain
synthetic API tokens) cannot accidentally leak real credentials.

Baseline entries are placeholder strings in README, quickstart docs,
and existing test fixtures. All hashed; no plaintext secrets in the
baseline file. Excludes common dep-lock files and node_modules from
scanning.
…class name

Lifts the JSON-or-fallback tool-argument parsing out of the LangChain
adapter into arksim/tracing/integrations/_args.parse_tool_arguments so
the remaining 7 SDK adapters consume the same helper rather than each
copying subtly different fallback logic.

Two adapter-template changes apply alongside:
- Error string now includes the exception class name
  (e.g. "ValueError: nope" rather than "nope") so downstream evaluators
  can bucket failures by type without scraping the message.
- New test asserts that on_tool_start/on_tool_end/on_tool_error stay
  sync; if a refactor accidentally makes one async, LangChain's
  iscoroutinefunction dispatch routes around our override.
Subscribes to ToolUsageFinishedEvent and ToolUsageErrorEvent on the
crewai event bus (crewai 1.6+ moved these into crewai.events). Each
emission becomes one ToolCall with source=ToolCallSource.CREWAI, with
arguments normalized via the shared parse_tool_arguments helper and
result captured from the event's output field.
Registers a callback on AfterToolCallEvent via the strands-agents
HookProvider protocol. Each event becomes one ToolCall with
source=ToolCallSource.STRANDS, with arguments pulled from
tool_use['input'], the tool name from tool_use['name'], and
success/error branching driven by event.exception (which is a
dedicated field separate from event.result in strands-agents 1.33+).
Subscribes to FunctionToolsExecutedEvent on AgentSession and emits
one ToolCall per parallel function call/output pair. Verified against
livekit-agents 1.5.9.
Consumes ToolCall and ToolCallResult events from AgentWorkflow's
stream and correlates start/result pairs by tool_id. Verified
against llama-index-core 0.14.15. The instrumentation dispatcher
does not receive these events; observer attaches via the workflow
stream (handler.stream_events() or consume_stream).
Parameterized test asserting every SDK tracing adapter produces a single
ToolCall tagged with its corresponding ToolCallSource when fed a synthetic
tool-use event. One factory per adapter (LangChain, CrewAI, Claude Agent SDK,
Google ADK, LiveKit, Strands, LlamaIndex, smolagents) keeps the contract
discoverable from a single file.
…oughts

Dify has no SDK callback for tool calls; the Python wrapper now parses
agent_thoughts out of the Chat API blocking-mode response and returns
an AgentResponse carrying ToolCall instances tagged source=dify. The
scenarios switch to the standard order_status_lookup and
dinner_reservation pair, and the README documents the matching Dify
Agent app + tools setup required to exercise the capture path.

Adds ToolCallSource.DIFY and ToolCallSource.RASA enum variants so the
two HTTP-wrapper examples that capture outside the per-SDK adapter
path can tag their tool calls with their own provenance.
@itsarbit itsarbit requested a review from a team as a code owner May 20, 2026 12:51
itsarbit added 27 commits May 20, 2026 06:12
Wrap receiver.submit_tool_calls in try/except so a misbehaving receiver
cannot propagate back into the SDK callback that invoked the adapter.
Add a debug log on the no-routing-context drop branch so the missing-ids
case is diagnosable. Document the str() coercion contract for subclasses
on ToolCall.result/error in the class docstring.
…tion

If LiveKit ever emits a structured error or result object instead of the
documented string, ToolCall(result=...) / ToolCall(error=...) would raise
ValidationError inside the event-loop callback. Wrap fn_output.output in
str() for symmetry with the other adapters.
… READMEs

The autogen, pydantic-ai, mastra, and vercel-ai-sdk examples rely on
arksim's OTLP trace receiver but did not tell first-time readers to
install 'arksim[otel]'. A reader following the docs literally would hit
'arksim: command not found' on the first run. Add the install step as
step 1 in the Setup section and renumber the following steps.
LangChain dispatches sync BaseCallbackHandler methods on a thread pool
when tools block, so the pending map is reachable from multiple threads.
Without a lock, dict iteration during _sweep_stale can race with mutation
from another thread and raise RuntimeError. Lock scope covers only
in-memory dict operations, never I/O.
ToolOutput.content is typed Any upstream; a custom ToolOutput that
returns a non-string would raise Pydantic ValidationError when
constructing ToolCall.result / ToolCall.error. Wrap with str() to match
the livekit and langchain adapters.
Bring the openai-agents-sdk example in line with the other SDK adapter
examples (langchain, pydantic-ai, strands, livekit, etc.):

- Add tools.py with lookup_order and book_table wrapped in @function_tool.
- Wire ArksimTracingProcessor in custom_agent.py at module load so
  FunctionSpanData lands as ToolCall on the active turn.
- Enable trace_receiver in config.yaml for in-process span capture.
- Replace generic Q&A scenarios with order_status_lookup and
  dinner_reservation, matching the lookup_order/book_table tool surface.
- Rewrite README to follow the 5-section template (Setup, Run, How it
  works, Expected output, Files) used by the rest of the rollout.
CrewAI's event bus dispatches handlers via ThreadPoolExecutor.submit
after contextvars.copy_context(). The adapter's routing context survives
across the thread boundary because of that copy. Record the dependency
inline so a future CrewAI change to the dispatch path is easy to spot.
The langchain/crewai/livekit/etc. floors aren't arbitrary. Each one
corresponds to a specific SDK version whose callback or hook surface was
verified by the adapter's module docstring. Lowering a floor requires
re-verifying the adapter against the older shape, so the connection
shouldn't have to be reverse-engineered.
…tract

The contract test now covers three rails for every SDK adapter: happy
path (existing), no-routing-context drop, and error-event population on
ToolCall.error. SDKs that don't expose a distinct error event (Claude
Agent SDK, Google ADK, Smolagents) are skipped explicitly with a reason
so the gap is visible.
The example agents pin gpt-5.1 in both config.yaml (used by the
simulated user and evaluator) and custom_agent.py (used by the agent
under test). arksim's config loader does not interpolate env vars
into the top-level model field, so the README cannot rely on
${OPENAI_MODEL} expansion.

Add a single-sentence README note in each affected example pointing
to OPENAI_MODEL plus an inline edit path. Read OPENAI_MODEL with a
gpt-5.1 fallback in the six custom_agent.py files that hardcoded the
model id (langchain, langgraph, crewai, llamaindex, livekit, strands)
so the env-var path actually works for the agent side.
The pip package is livekit-agents; naming the arksim extra livekit
collides with the broader LiveKit SDK family. Match the package name
the way the other extras do (langchain-core via langchain, etc.).

- Rename [project.optional-dependencies] key livekit -> livekit-agents
  in pyproject.toml. Dep list unchanged.
- Update the multi-extra resolver step in .github/workflows/ci.yml.
- Update examples/integrations/livekit/README.md and the
  custom_agent.py docstring to install 'arksim[livekit-agents]'.

The error message in arksim/tracing/integrations/livekit.py still
references 'arksim[livekit]'; that update is owned by the tracing
adapter agent.
The previous step 3 told users to 'attach two tools matching the
scenarios' without orienting them to the Dify Studio screens that
matter. A new reader had to discover the Agent-app builder and the
tool plugin docs on their own.

Link the canonical Dify docs (Build an Agent application, Build a
tool plugin) directly and call out the Tools panel as the
attach-and-authorize step. Keep the tool signatures and return
shapes inline so the agent's deterministic output stays clear.
Every test_*_adapter.py in tests/unit/integrations/ defined its own
_clean_context autouse fixture and _only_call helper. Move both to a
shared conftest.py: _clean_context stays an autouse fixture (now applied
once), and _only_call becomes a fixture each test takes as a parameter.
Drift risk drops to zero when the trace-context API changes.
A domain reader will ask: when the per-SDK adapter and the OTel path
disagree for the same agent, which is canonical? The Mastra and Vercel
AI SDK examples emit gen_ai.tool.* spans while LangChain, CrewAI,
Strands, etc. use SDK-native callbacks; the two paths can produce
different field sets for the same call.

Add an Info callout to the Automatic Capture section explaining that
the per-SDK adapter is canonical when both paths are available
(because it captures fields OTel semconv doesn't standardize, like
LangChain run_id split events), and the OTel path is the fallback
when the SDK has no Python-side adapter.
…mment

Two readability cleanups: complete the orphan ANN401 noqa on the
LangChain on_tool_end kwargs to match the surrounding three sites, and
explain why PendingToolCalls defaults to a 60s TTL so future changes
treat the value as load-bearing rather than arbitrary.
- claude-agent-sdk: change 'this example runs against claude-sonnet-4-6'
  to 'uses claude-sonnet-4-6 by default; change in config.yaml if your
  account uses a different model name'. The original phrasing implied
  the model was hardcoded and unchangeable.
- livekit: clarify that LiveKit Cloud proxies the LLM call to OpenAI
  but the inference cost is billed against OPENAI_API_KEY, not LiveKit
  credits. The previous claim was ambiguous about which budget pays.
- strands: call out that the quotes around 'strands-agents[openai]' are
  required on zsh, where the brackets glob otherwise.
…sdk, mastra, vercel

The Mastra and Vercel AI SDK example READMEs already documented an
OPENAI_MODEL environment variable for overriding the agent's model
without editing config.yaml, but neither agent_server.ts (and neither
of the AutoGen or OpenAI Agents SDK custom_agent.py files) actually
read the variable. Wire it through in all four agents so the
documented override works, and tighten the Mastra/Vercel README
sentences to point at the correct default model name and to clarify
that the simulator's own model is set by the top-level model field in
config.yaml.
… field

The smolagents adapter used a truthiness check on the observations
field, which coerced empty strings (and any other falsy non-None
value) to None on the emitted ToolCall.result. That contradicts the
BaseTracingAdapter contract that distinguishes 'no result captured'
(None) from 'result captured but empty' (''). Switch to an explicit
'is not None' check and add a str() wrap so non-string observation
payloads are still serialized safely. Cover the empty-string round
trip with a new unit test.
…ations/

The OpenAI Agents SDK tracing adapter lives at arksim.tracing.openai
because it predates the arksim.tracing.integrations subpackage that
hosts the eight newer SDK adapters. Add a short note to the module
docstring so readers and future contributors understand the
asymmetry: the existing path is preserved for import stability, and
new SDK adapters should land under arksim.tracing.integrations.<sdk>.
…AI_MODEL

Capture two user-visible changes in the [Unreleased] section: the
rename of the livekit pip extra to livekit-agents (which forces a
reinstall for anyone with arksim[livekit] in their environment) and
the OPENAI_MODEL environment variable that now overrides the agent
model across the integration examples without requiring config.yaml
edits.
…l fixture

These two adapter test files still hand-rolled the receiver
inspection (assert call_count == 1; args, _ = call_args; ...)
instead of using the only_call fixture that the six other adapter
test files adopted when fixtures were hoisted into conftest. Wire
single-submit tests through only_call and keep the multi-submit
tests (source field set across two events, batch events with
parallel calls) inline since only_call only covers the single-call
case.
Both example agents use OpenAI under the hood. Their READMEs already
documented the OPENAI_MODEL override, but the code hard-coded gpt-4o.
Add a module-level _MODEL constant that reads OPENAI_MODEL with a
gpt-5.1 default and use it as the model id.
…agents-sdk READMEs

Both example agents already read OPENAI_MODEL from the environment, but
their READMEs didn't tell users that. Match the one-liner the other
OpenAI-based example READMEs carry.
The autogen Python agent and the mastra and vercel-ai-sdk TypeScript
agents defaulted to gpt-4o; the openai-agents-sdk agent and the rest
defaulted to gpt-5.1. Align all OpenAI-based example agents to gpt-5.1
so users see one default model across the suite. Override-note prose
in the affected READMEs is updated to match.
…GELOG

After wiring OPENAI_MODEL into smolagents and pydantic-ai, the Changed
bullet that listed them as 'unaffected' is no longer accurate. Move
both to the OpenAI-honoring list (12 OpenAI-based + 4 non-OpenAI = 16
total examples).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant