feat(tracing): add per-SDK tracing adapters for 9 frameworks#171
Open
itsarbit wants to merge 64 commits into
Open
feat(tracing): add per-SDK tracing adapters for 9 frameworks#171itsarbit wants to merge 64 commits into
itsarbit wants to merge 64 commits into
Conversation
Bumps 5 lower bounds to track current stable SDK majors: - langchain-core: 0.3.0 -> 1.0.0 - llama-index-core: 0.10.0 -> 0.14.0 - google-adk: 0.5.0 -> 1.0.0 - strands-agents: 0.1.0 -> 1.0.0 - claude-agent-sdk: pin to >=0.1.0,<0.3.0 (still pre-1.0)
Pins Yelp/detect-secrets v1.5.0 and seeds a baseline of known-OK matches so the upcoming adapter test fixtures (which contain synthetic API tokens) cannot accidentally leak real credentials. Baseline entries are placeholder strings in README, quickstart docs, and existing test fixtures. All hashed; no plaintext secrets in the baseline file. Excludes common dep-lock files and node_modules from scanning.
…class name Lifts the JSON-or-fallback tool-argument parsing out of the LangChain adapter into arksim/tracing/integrations/_args.parse_tool_arguments so the remaining 7 SDK adapters consume the same helper rather than each copying subtly different fallback logic. Two adapter-template changes apply alongside: - Error string now includes the exception class name (e.g. "ValueError: nope" rather than "nope") so downstream evaluators can bucket failures by type without scraping the message. - New test asserts that on_tool_start/on_tool_end/on_tool_error stay sync; if a refactor accidentally makes one async, LangChain's iscoroutinefunction dispatch routes around our override.
Subscribes to ToolUsageFinishedEvent and ToolUsageErrorEvent on the crewai event bus (crewai 1.6+ moved these into crewai.events). Each emission becomes one ToolCall with source=ToolCallSource.CREWAI, with arguments normalized via the shared parse_tool_arguments helper and result captured from the event's output field.
Registers a callback on AfterToolCallEvent via the strands-agents HookProvider protocol. Each event becomes one ToolCall with source=ToolCallSource.STRANDS, with arguments pulled from tool_use['input'], the tool name from tool_use['name'], and success/error branching driven by event.exception (which is a dedicated field separate from event.result in strands-agents 1.33+).
Subscribes to FunctionToolsExecutedEvent on AgentSession and emits one ToolCall per parallel function call/output pair. Verified against livekit-agents 1.5.9.
Consumes ToolCall and ToolCallResult events from AgentWorkflow's stream and correlates start/result pairs by tool_id. Verified against llama-index-core 0.14.15. The instrumentation dispatcher does not receive these events; observer attaches via the workflow stream (handler.stream_events() or consume_stream).
Parameterized test asserting every SDK tracing adapter produces a single ToolCall tagged with its corresponding ToolCallSource when fed a synthetic tool-use event. One factory per adapter (LangChain, CrewAI, Claude Agent SDK, Google ADK, LiveKit, Strands, LlamaIndex, smolagents) keeps the contract discoverable from a single file.
…oughts Dify has no SDK callback for tool calls; the Python wrapper now parses agent_thoughts out of the Chat API blocking-mode response and returns an AgentResponse carrying ToolCall instances tagged source=dify. The scenarios switch to the standard order_status_lookup and dinner_reservation pair, and the README documents the matching Dify Agent app + tools setup required to exercise the capture path. Adds ToolCallSource.DIFY and ToolCallSource.RASA enum variants so the two HTTP-wrapper examples that capture outside the per-SDK adapter path can tag their tool calls with their own provenance.
Wrap receiver.submit_tool_calls in try/except so a misbehaving receiver cannot propagate back into the SDK callback that invoked the adapter. Add a debug log on the no-routing-context drop branch so the missing-ids case is diagnosable. Document the str() coercion contract for subclasses on ToolCall.result/error in the class docstring.
…tion If LiveKit ever emits a structured error or result object instead of the documented string, ToolCall(result=...) / ToolCall(error=...) would raise ValidationError inside the event-loop callback. Wrap fn_output.output in str() for symmetry with the other adapters.
… READMEs The autogen, pydantic-ai, mastra, and vercel-ai-sdk examples rely on arksim's OTLP trace receiver but did not tell first-time readers to install 'arksim[otel]'. A reader following the docs literally would hit 'arksim: command not found' on the first run. Add the install step as step 1 in the Setup section and renumber the following steps.
LangChain dispatches sync BaseCallbackHandler methods on a thread pool when tools block, so the pending map is reachable from multiple threads. Without a lock, dict iteration during _sweep_stale can race with mutation from another thread and raise RuntimeError. Lock scope covers only in-memory dict operations, never I/O.
ToolOutput.content is typed Any upstream; a custom ToolOutput that returns a non-string would raise Pydantic ValidationError when constructing ToolCall.result / ToolCall.error. Wrap with str() to match the livekit and langchain adapters.
Bring the openai-agents-sdk example in line with the other SDK adapter examples (langchain, pydantic-ai, strands, livekit, etc.): - Add tools.py with lookup_order and book_table wrapped in @function_tool. - Wire ArksimTracingProcessor in custom_agent.py at module load so FunctionSpanData lands as ToolCall on the active turn. - Enable trace_receiver in config.yaml for in-process span capture. - Replace generic Q&A scenarios with order_status_lookup and dinner_reservation, matching the lookup_order/book_table tool surface. - Rewrite README to follow the 5-section template (Setup, Run, How it works, Expected output, Files) used by the rest of the rollout.
CrewAI's event bus dispatches handlers via ThreadPoolExecutor.submit after contextvars.copy_context(). The adapter's routing context survives across the thread boundary because of that copy. Record the dependency inline so a future CrewAI change to the dispatch path is easy to spot.
The langchain/crewai/livekit/etc. floors aren't arbitrary. Each one corresponds to a specific SDK version whose callback or hook surface was verified by the adapter's module docstring. Lowering a floor requires re-verifying the adapter against the older shape, so the connection shouldn't have to be reverse-engineered.
…tract The contract test now covers three rails for every SDK adapter: happy path (existing), no-routing-context drop, and error-event population on ToolCall.error. SDKs that don't expose a distinct error event (Claude Agent SDK, Google ADK, Smolagents) are skipped explicitly with a reason so the gap is visible.
The example agents pin gpt-5.1 in both config.yaml (used by the
simulated user and evaluator) and custom_agent.py (used by the agent
under test). arksim's config loader does not interpolate env vars
into the top-level model field, so the README cannot rely on
${OPENAI_MODEL} expansion.
Add a single-sentence README note in each affected example pointing
to OPENAI_MODEL plus an inline edit path. Read OPENAI_MODEL with a
gpt-5.1 fallback in the six custom_agent.py files that hardcoded the
model id (langchain, langgraph, crewai, llamaindex, livekit, strands)
so the env-var path actually works for the agent side.
The pip package is livekit-agents; naming the arksim extra livekit collides with the broader LiveKit SDK family. Match the package name the way the other extras do (langchain-core via langchain, etc.). - Rename [project.optional-dependencies] key livekit -> livekit-agents in pyproject.toml. Dep list unchanged. - Update the multi-extra resolver step in .github/workflows/ci.yml. - Update examples/integrations/livekit/README.md and the custom_agent.py docstring to install 'arksim[livekit-agents]'. The error message in arksim/tracing/integrations/livekit.py still references 'arksim[livekit]'; that update is owned by the tracing adapter agent.
The previous step 3 told users to 'attach two tools matching the scenarios' without orienting them to the Dify Studio screens that matter. A new reader had to discover the Agent-app builder and the tool plugin docs on their own. Link the canonical Dify docs (Build an Agent application, Build a tool plugin) directly and call out the Tools panel as the attach-and-authorize step. Keep the tool signatures and return shapes inline so the agent's deterministic output stays clear.
Every test_*_adapter.py in tests/unit/integrations/ defined its own _clean_context autouse fixture and _only_call helper. Move both to a shared conftest.py: _clean_context stays an autouse fixture (now applied once), and _only_call becomes a fixture each test takes as a parameter. Drift risk drops to zero when the trace-context API changes.
A domain reader will ask: when the per-SDK adapter and the OTel path disagree for the same agent, which is canonical? The Mastra and Vercel AI SDK examples emit gen_ai.tool.* spans while LangChain, CrewAI, Strands, etc. use SDK-native callbacks; the two paths can produce different field sets for the same call. Add an Info callout to the Automatic Capture section explaining that the per-SDK adapter is canonical when both paths are available (because it captures fields OTel semconv doesn't standardize, like LangChain run_id split events), and the OTel path is the fallback when the SDK has no Python-side adapter.
…mment Two readability cleanups: complete the orphan ANN401 noqa on the LangChain on_tool_end kwargs to match the surrounding three sites, and explain why PendingToolCalls defaults to a 60s TTL so future changes treat the value as load-bearing rather than arbitrary.
- claude-agent-sdk: change 'this example runs against claude-sonnet-4-6' to 'uses claude-sonnet-4-6 by default; change in config.yaml if your account uses a different model name'. The original phrasing implied the model was hardcoded and unchangeable. - livekit: clarify that LiveKit Cloud proxies the LLM call to OpenAI but the inference cost is billed against OPENAI_API_KEY, not LiveKit credits. The previous claim was ambiguous about which budget pays. - strands: call out that the quotes around 'strands-agents[openai]' are required on zsh, where the brackets glob otherwise.
…sdk, mastra, vercel The Mastra and Vercel AI SDK example READMEs already documented an OPENAI_MODEL environment variable for overriding the agent's model without editing config.yaml, but neither agent_server.ts (and neither of the AutoGen or OpenAI Agents SDK custom_agent.py files) actually read the variable. Wire it through in all four agents so the documented override works, and tighten the Mastra/Vercel README sentences to point at the correct default model name and to clarify that the simulator's own model is set by the top-level model field in config.yaml.
… field
The smolagents adapter used a truthiness check on the observations
field, which coerced empty strings (and any other falsy non-None
value) to None on the emitted ToolCall.result. That contradicts the
BaseTracingAdapter contract that distinguishes 'no result captured'
(None) from 'result captured but empty' (''). Switch to an explicit
'is not None' check and add a str() wrap so non-string observation
payloads are still serialized safely. Cover the empty-string round
trip with a new unit test.
…ations/ The OpenAI Agents SDK tracing adapter lives at arksim.tracing.openai because it predates the arksim.tracing.integrations subpackage that hosts the eight newer SDK adapters. Add a short note to the module docstring so readers and future contributors understand the asymmetry: the existing path is preserved for import stability, and new SDK adapters should land under arksim.tracing.integrations.<sdk>.
…AI_MODEL Capture two user-visible changes in the [Unreleased] section: the rename of the livekit pip extra to livekit-agents (which forces a reinstall for anyone with arksim[livekit] in their environment) and the OPENAI_MODEL environment variable that now overrides the agent model across the integration examples without requiring config.yaml edits.
…l fixture These two adapter test files still hand-rolled the receiver inspection (assert call_count == 1; args, _ = call_args; ...) instead of using the only_call fixture that the six other adapter test files adopted when fixtures were hoisted into conftest. Wire single-submit tests through only_call and keep the multi-submit tests (source field set across two events, batch events with parallel calls) inline since only_call only covers the single-call case.
Both example agents use OpenAI under the hood. Their READMEs already documented the OPENAI_MODEL override, but the code hard-coded gpt-4o. Add a module-level _MODEL constant that reads OPENAI_MODEL with a gpt-5.1 default and use it as the model id.
…agents-sdk READMEs Both example agents already read OPENAI_MODEL from the environment, but their READMEs didn't tell users that. Match the one-liner the other OpenAI-based example READMEs carry.
The autogen Python agent and the mastra and vercel-ai-sdk TypeScript agents defaulted to gpt-4o; the openai-agents-sdk agent and the rest defaulted to gpt-5.1. Align all OpenAI-based example agents to gpt-5.1 so users see one default model across the suite. Override-note prose in the affected READMEs is updated to match.
…GELOG After wiring OPENAI_MODEL into smolagents and pydantic-ai, the Changed bullet that listed them as 'unaffected' is no longer accurate. Move both to the OpenAI-honoring list (12 OpenAI-based + 4 non-OpenAI = 16 total examples).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds 8 per-SDK tracing adapters under
arksim/tracing/integrations/so users of LangChain/LangGraph, CrewAI, Claude Agent SDK, Google ADK, LiveKit Agents, Strands Agents, LlamaIndex, and Smolagents get zero-config tool-call capture through the Python connector. All 13 existing integration examples updated; 2 new examples added (LiveKit, Strands).What ships
BaseTracingAdapter(contextvars-based, no registry) underarksim/tracing/integrations/_base.pyPendingToolCallssplit-event correlation helper (used by LangChain and LlamaIndex)parse_tool_argumentsshared argument normalization helperToolCallSourceenum variants (8 adapters + Dify + Rasa)pyproject.tomldetect-secretspre-commit hook + initial baselineNon-goals (explicit)
arksim.tracing.openai)arksim/tracing/openai.py(file stays in place)Test plan
ruff check+ruff format --checkcleanpip install 'arksim[langchain,crewai,llamaindex,claude-agent,google-adk,livekit,strands,smolagents]'resolves cleanlyManual verification needed before merge
tool_callsinsimulation.json.npm installfirst.lookup_order+book_tabletools configured server-side. Rasa needs a running Rasa server with the includedrasa_project. Vercel AI SDK needsnpm install && npm startfirst.Follow-ups (intentionally NOT in this PR)
create_react_agenttolangchain.agents.create_agent(v1 deprecation)PostToolUseFailureeventexamples/customer-service/as canonical demoparse_openaihonoringmessage.tool_calls(would let agent servers skip OTel for simple cases)