feat(tracing): add per-SDK tracing adapters for 9 frameworks by itsarbit · Pull Request #171 · arklexai/arksim

itsarbit · 2026-05-20T12:51:44Z

Summary

Adds 8 per-SDK tracing adapters under arksim/tracing/integrations/ so users of LangChain/LangGraph, CrewAI, Claude Agent SDK, Google ADK, LiveKit Agents, Strands Agents, LlamaIndex, and Smolagents get zero-config tool-call capture through the Python connector. All 13 existing integration examples updated; 2 new examples added (LiveKit, Strands).

What ships

BaseTracingAdapter (contextvars-based, no registry) under arksim/tracing/integrations/_base.py
PendingToolCalls split-event correlation helper (used by LangChain and LlamaIndex)
parse_tool_arguments shared argument normalization helper
8 adapter modules covering 9 frameworks (LangChain adapter covers LangGraph)
10 new ToolCallSource enum variants (8 adapters + Dify + Rasa)
8 new pip extras in pyproject.toml
detect-secrets pre-commit hook + initial baseline
Multi-extra resolver CI check
95+ unit tests + cross-adapter contract test

Non-goals (explicit)

OpenAI Agents SDK adapter changes (already shipped in feat(tracing): add OTel trace receiver for tool call capture #114; canonical path arksim.tracing.openai)
Restructuring arksim/tracing/openai.py (file stays in place)
OTel + W3C wrapper instrumentation track (post-launch follow-up)
Platform-bridge work
Response-body tool-call evaluation on raw Chat Completions (decided no per wiki/decisions/no-response-parsed-tool-call-eval.md)

Test plan

Unit suite: 915 passing, 4 skipped (was 820 on origin/main, +95 across the rollout)
Per-adapter tests cover happy path, error path, source field, no-context, missing fields, multiple calls, split-event correlation (where applicable)
Cross-adapter contract test parameterized across 8 adapters
ruff check + ruff format --check clean
Multi-extra resolver: pip install 'arksim[langchain,crewai,llamaindex,claude-agent,google-adk,livekit,strands,smolagents]' resolves cleanly
Pre-commit clean across the full 37-commit diff (120 files, +5525 / -4755)

Manual verification needed before merge

Each of the 8 per-SDK adapter examples (LangChain, LangGraph, CrewAI, Claude Agent SDK, Google ADK, LiveKit, Strands, LlamaIndex, Smolagents) runs end-to-end against the respective SDK with a real API key, producing tool_calls in simulation.json.
Native-OTel examples (AutoGen, Pydantic AI, Mastra) run E2E. Mastra requires npm install first.
Doesn't-fit examples: Dify needs a Dify Agent app with lookup_order + book_table tools configured server-side. Rasa needs a running Rasa server with the included rasa_project. Vercel AI SDK needs npm install && npm start first.

Follow-ups (intentionally NOT in this PR)

LangGraph migration from create_react_agent to langchain.agents.create_agent (v1 deprecation)
CrewAI ToolCall.id synthesis (currently empty)
Claude Agent SDK error path via PostToolUseFailure event
README / quickstart retool leading with examples/customer-service/ as canonical demo
Ecosystem scan: Agno, Haystack, LangGraph Platform, OpenAI Swarm handoff coverage
parse_openai honoring message.tool_calls (would let agent servers skip OTel for simple cases)
Mastra deprecation: migrate to Mastra's proprietary AI Tracing when stable

Bumps 5 lower bounds to track current stable SDK majors: - langchain-core: 0.3.0 -> 1.0.0 - llama-index-core: 0.10.0 -> 0.14.0 - google-adk: 0.5.0 -> 1.0.0 - strands-agents: 0.1.0 -> 1.0.0 - claude-agent-sdk: pin to >=0.1.0,<0.3.0 (still pre-1.0)

Pins Yelp/detect-secrets v1.5.0 and seeds a baseline of known-OK matches so the upcoming adapter test fixtures (which contain synthetic API tokens) cannot accidentally leak real credentials. Baseline entries are placeholder strings in README, quickstart docs, and existing test fixtures. All hashed; no plaintext secrets in the baseline file. Excludes common dep-lock files and node_modules from scanning.

…class name Lifts the JSON-or-fallback tool-argument parsing out of the LangChain adapter into arksim/tracing/integrations/_args.parse_tool_arguments so the remaining 7 SDK adapters consume the same helper rather than each copying subtly different fallback logic. Two adapter-template changes apply alongside: - Error string now includes the exception class name (e.g. "ValueError: nope" rather than "nope") so downstream evaluators can bucket failures by type without scraping the message. - New test asserts that on_tool_start/on_tool_end/on_tool_error stay sync; if a refactor accidentally makes one async, LangChain's iscoroutinefunction dispatch routes around our override.

Subscribes to ToolUsageFinishedEvent and ToolUsageErrorEvent on the crewai event bus (crewai 1.6+ moved these into crewai.events). Each emission becomes one ToolCall with source=ToolCallSource.CREWAI, with arguments normalized via the shared parse_tool_arguments helper and result captured from the event's output field.

Registers a callback on AfterToolCallEvent via the strands-agents HookProvider protocol. Each event becomes one ToolCall with source=ToolCallSource.STRANDS, with arguments pulled from tool_use['input'], the tool name from tool_use['name'], and success/error branching driven by event.exception (which is a dedicated field separate from event.result in strands-agents 1.33+).

Subscribes to FunctionToolsExecutedEvent on AgentSession and emits one ToolCall per parallel function call/output pair. Verified against livekit-agents 1.5.9.

Consumes ToolCall and ToolCallResult events from AgentWorkflow's stream and correlates start/result pairs by tool_id. Verified against llama-index-core 0.14.15. The instrumentation dispatcher does not receive these events; observer attaches via the workflow stream (handler.stream_events() or consume_stream).

Parameterized test asserting every SDK tracing adapter produces a single ToolCall tagged with its corresponding ToolCallSource when fed a synthetic tool-use event. One factory per adapter (LangChain, CrewAI, Claude Agent SDK, Google ADK, LiveKit, Strands, LlamaIndex, smolagents) keeps the contract discoverable from a single file.

… mock tools

…ock tools

…th mock tools

…tools

…ith mock tools

…oughts Dify has no SDK callback for tool calls; the Python wrapper now parses agent_thoughts out of the Chat API blocking-mode response and returns an AgentResponse carrying ToolCall instances tagged source=dify. The scenarios switch to the standard order_status_lookup and dinner_reservation pair, and the README documents the matching Dify Agent app + tools setup required to exercise the capture path. Adds ToolCallSource.DIFY and ToolCallSource.RASA enum variants so the two HTTP-wrapper examples that capture outside the per-SDK adapter path can tag their tool calls with their own provenance.

Wrap receiver.submit_tool_calls in try/except so a misbehaving receiver cannot propagate back into the SDK callback that invoked the adapter. Add a debug log on the no-routing-context drop branch so the missing-ids case is diagnosable. Document the str() coercion contract for subclasses on ToolCall.result/error in the class docstring.

…tion If LiveKit ever emits a structured error or result object instead of the documented string, ToolCall(result=...) / ToolCall(error=...) would raise ValidationError inside the event-loop callback. Wrap fn_output.output in str() for symmetry with the other adapters.

… READMEs The autogen, pydantic-ai, mastra, and vercel-ai-sdk examples rely on arksim's OTLP trace receiver but did not tell first-time readers to install 'arksim[otel]'. A reader following the docs literally would hit 'arksim: command not found' on the first run. Add the install step as step 1 in the Setup section and renumber the following steps.

LangChain dispatches sync BaseCallbackHandler methods on a thread pool when tools block, so the pending map is reachable from multiple threads. Without a lock, dict iteration during _sweep_stale can race with mutation from another thread and raise RuntimeError. Lock scope covers only in-memory dict operations, never I/O.

ToolOutput.content is typed Any upstream; a custom ToolOutput that returns a non-string would raise Pydantic ValidationError when constructing ToolCall.result / ToolCall.error. Wrap with str() to match the livekit and langchain adapters.

Bring the openai-agents-sdk example in line with the other SDK adapter examples (langchain, pydantic-ai, strands, livekit, etc.): - Add tools.py with lookup_order and book_table wrapped in @function_tool. - Wire ArksimTracingProcessor in custom_agent.py at module load so FunctionSpanData lands as ToolCall on the active turn. - Enable trace_receiver in config.yaml for in-process span capture. - Replace generic Q&A scenarios with order_status_lookup and dinner_reservation, matching the lookup_order/book_table tool surface. - Rewrite README to follow the 5-section template (Setup, Run, How it works, Expected output, Files) used by the rest of the rollout.

CrewAI's event bus dispatches handlers via ThreadPoolExecutor.submit after contextvars.copy_context(). The adapter's routing context survives across the thread boundary because of that copy. Record the dependency inline so a future CrewAI change to the dispatch path is easy to spot.

The langchain/crewai/livekit/etc. floors aren't arbitrary. Each one corresponds to a specific SDK version whose callback or hook surface was verified by the adapter's module docstring. Lowering a floor requires re-verifying the adapter against the older shape, so the connection shouldn't have to be reverse-engineered.

…tract The contract test now covers three rails for every SDK adapter: happy path (existing), no-routing-context drop, and error-event population on ToolCall.error. SDKs that don't expose a distinct error event (Claude Agent SDK, Google ADK, Smolagents) are skipped explicitly with a reason so the gap is visible.

The example agents pin gpt-5.1 in both config.yaml (used by the simulated user and evaluator) and custom_agent.py (used by the agent under test). arksim's config loader does not interpolate env vars into the top-level model field, so the README cannot rely on ${OPENAI_MODEL} expansion. Add a single-sentence README note in each affected example pointing to OPENAI_MODEL plus an inline edit path. Read OPENAI_MODEL with a gpt-5.1 fallback in the six custom_agent.py files that hardcoded the model id (langchain, langgraph, crewai, llamaindex, livekit, strands) so the env-var path actually works for the agent side.

The pip package is livekit-agents; naming the arksim extra livekit collides with the broader LiveKit SDK family. Match the package name the way the other extras do (langchain-core via langchain, etc.). - Rename [project.optional-dependencies] key livekit -> livekit-agents in pyproject.toml. Dep list unchanged. - Update the multi-extra resolver step in .github/workflows/ci.yml. - Update examples/integrations/livekit/README.md and the custom_agent.py docstring to install 'arksim[livekit-agents]'. The error message in arksim/tracing/integrations/livekit.py still references 'arksim[livekit]'; that update is owned by the tracing adapter agent.

The previous step 3 told users to 'attach two tools matching the scenarios' without orienting them to the Dify Studio screens that matter. A new reader had to discover the Agent-app builder and the tool plugin docs on their own. Link the canonical Dify docs (Build an Agent application, Build a tool plugin) directly and call out the Tools panel as the attach-and-authorize step. Keep the tool signatures and return shapes inline so the agent's deterministic output stays clear.

Every test_*_adapter.py in tests/unit/integrations/ defined its own _clean_context autouse fixture and _only_call helper. Move both to a shared conftest.py: _clean_context stays an autouse fixture (now applied once), and _only_call becomes a fixture each test takes as a parameter. Drift risk drops to zero when the trace-context API changes.

A domain reader will ask: when the per-SDK adapter and the OTel path disagree for the same agent, which is canonical? The Mastra and Vercel AI SDK examples emit gen_ai.tool.* spans while LangChain, CrewAI, Strands, etc. use SDK-native callbacks; the two paths can produce different field sets for the same call. Add an Info callout to the Automatic Capture section explaining that the per-SDK adapter is canonical when both paths are available (because it captures fields OTel semconv doesn't standardize, like LangChain run_id split events), and the OTel path is the fallback when the SDK has no Python-side adapter.

…mment Two readability cleanups: complete the orphan ANN401 noqa on the LangChain on_tool_end kwargs to match the surrounding three sites, and explain why PendingToolCalls defaults to a 60s TTL so future changes treat the value as load-bearing rather than arbitrary.

- claude-agent-sdk: change 'this example runs against claude-sonnet-4-6' to 'uses claude-sonnet-4-6 by default; change in config.yaml if your account uses a different model name'. The original phrasing implied the model was hardcoded and unchangeable. - livekit: clarify that LiveKit Cloud proxies the LLM call to OpenAI but the inference cost is billed against OPENAI_API_KEY, not LiveKit credits. The previous claim was ambiguous about which budget pays. - strands: call out that the quotes around 'strands-agents[openai]' are required on zsh, where the brackets glob otherwise.

…sdk, mastra, vercel The Mastra and Vercel AI SDK example READMEs already documented an OPENAI_MODEL environment variable for overriding the agent's model without editing config.yaml, but neither agent_server.ts (and neither of the AutoGen or OpenAI Agents SDK custom_agent.py files) actually read the variable. Wire it through in all four agents so the documented override works, and tighten the Mastra/Vercel README sentences to point at the correct default model name and to clarify that the simulator's own model is set by the top-level model field in config.yaml.

… field The smolagents adapter used a truthiness check on the observations field, which coerced empty strings (and any other falsy non-None value) to None on the emitted ToolCall.result. That contradicts the BaseTracingAdapter contract that distinguishes 'no result captured' (None) from 'result captured but empty' (''). Switch to an explicit 'is not None' check and add a str() wrap so non-string observation payloads are still serialized safely. Cover the empty-string round trip with a new unit test.

…ations/ The OpenAI Agents SDK tracing adapter lives at arksim.tracing.openai because it predates the arksim.tracing.integrations subpackage that hosts the eight newer SDK adapters. Add a short note to the module docstring so readers and future contributors understand the asymmetry: the existing path is preserved for import stability, and new SDK adapters should land under arksim.tracing.integrations.<sdk>.

…AI_MODEL Capture two user-visible changes in the [Unreleased] section: the rename of the livekit pip extra to livekit-agents (which forces a reinstall for anyone with arksim[livekit] in their environment) and the OPENAI_MODEL environment variable that now overrides the agent model across the integration examples without requiring config.yaml edits.

…l fixture These two adapter test files still hand-rolled the receiver inspection (assert call_count == 1; args, _ = call_args; ...) instead of using the only_call fixture that the six other adapter test files adopted when fixtures were hoisted into conftest. Wire single-submit tests through only_call and keep the multi-submit tests (source field set across two events, batch events with parallel calls) inline since only_call only covers the single-call case.

Both example agents use OpenAI under the hood. Their READMEs already documented the OPENAI_MODEL override, but the code hard-coded gpt-4o. Add a module-level _MODEL constant that reads OPENAI_MODEL with a gpt-5.1 default and use it as the model id.

…agents-sdk READMEs Both example agents already read OPENAI_MODEL from the environment, but their READMEs didn't tell users that. Match the one-liner the other OpenAI-based example READMEs carry.

The autogen Python agent and the mastra and vercel-ai-sdk TypeScript agents defaulted to gpt-4o; the openai-agents-sdk agent and the rest defaulted to gpt-5.1. Align all OpenAI-based example agents to gpt-5.1 so users see one default model across the suite. Override-note prose in the affected READMEs is updated to match.

…GELOG After wiring OPENAI_MODEL into smolagents and pydantic-ai, the Changed bullet that listed them as 'unaffected' is no longer accurate. Move both to the OpenAI-honoring list (12 OpenAI-based + 4 non-OpenAI = 16 total examples).

itsarbit added 30 commits May 15, 2026 06:54

feat(tracing): add 8 ToolCallSource variants for SDK adapter rollout

aed610b

build(deps): add optional extras for 8 SDK tracing adapters

f4c2ada

build(ci): anchor detect-secrets exclude regex to match nested paths

b2d3801

feat(tracing): scaffold integrations package for SDK adapters

123f148

feat(tracing): add BaseTracingAdapter with contextvars-based submission

cf71739

feat(tracing): add PendingToolCalls split-event correlation helper

cb46adb

feat(tracing): add ArksimLangChainHandler for LangChain + LangGraph

3f5def2

feat(tracing): add ArksimClaudeHooks for Claude Agent SDK

65c8a10

feat(tracing): add ArksimADKPlugin for Google ADK

3625b9c

feat(tracing): add ArksimLiveKitHandler for LiveKit Agents

5429ab0

Subscribes to FunctionToolsExecutedEvent on AgentSession and emits one ToolCall per parallel function call/output pair. Verified against livekit-agents 1.5.9.

feat(tracing): add ArksimSmolagentsCallback for Smolagents

fe99ce8

docs(examples): wire langchain example to ArksimLangChainHandler with…

c54bd96

… mock tools

docs(examples): wire langgraph example to ArksimLangChainHandler with…

65074be

… mock tools

docs(examples): wire crewai example to ArksimCrewEventListener with m…

4faac5a

…ock tools

docs(examples): wire claude-agent-sdk example to ArksimClaudeHooks wi…

fd3c94f

…th mock tools

docs(examples): wire google-adk example to ArksimADKPlugin with mock …

179b24e

…tools

docs(examples): wire llamaindex example to ArksimLlamaIndexObserver w…

d46a3ba

…ith mock tools

docs(examples): wire smolagents example to ArksimSmolagentsCallback w…

b9370d6

…ith mock tools

docs(examples): use explicit tool names for llamaindex FunctionTool

b069957

docs(examples): wire autogen example to fire mock tools via OTel

691e1a7

docs(examples): wire pydantic-ai example to fire mock tools via OTel

6e7089b

docs(examples): wire mastra example to fire mock tools via HTTP wrapper

fef081b

itsarbit added 2 commits May 19, 2026 16:17

ci: add multi-extra resolver check for SDK adapter extras

044b5f4

docs: changelog entry for SDK adapter rollout

92bdb2a

itsarbit requested a review from a team as a code owner May 20, 2026 12:51

itsarbit added 27 commits May 20, 2026 06:12

fix(tracing): update livekit ImportError to match renamed extra

5557d32

docs: scope OPENAI_MODEL changelog bullet to OpenAI-based examples

327e9b7

docs(examples): add OPENAI_MODEL override note to autogen and openai-…

5b42a61

…agents-sdk READMEs Both example agents already read OPENAI_MODEL from the environment, but their READMEs didn't tell users that. Match the one-liner the other OpenAI-based example READMEs carry.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(tracing): add per-SDK tracing adapters for 9 frameworks#171

feat(tracing): add per-SDK tracing adapters for 9 frameworks#171
itsarbit wants to merge 64 commits into
mainfrom
feat/sdk-tracing-adapters

itsarbit commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

itsarbit commented May 20, 2026

Summary

What ships

Non-goals (explicit)

Test plan

Manual verification needed before merge

Follow-ups (intentionally NOT in this PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant