From 82915a5c9808c3542048e5f4d13e4fa073da89d7 Mon Sep 17 00:00:00 2001 From: Rhyannon Joy Rodriguez Date: Fri, 8 May 2026 16:13:18 -0700 Subject: [PATCH] [add] platforms (Issue #27) --- README.md | 2 +- SPEC.md | 20 +---------- site/config/_default/menus.toml | 5 +++ site/content/platforms.md | 63 +++++++++++++++++++++++++++++++++ 4 files changed, 70 insertions(+), 20 deletions(-) create mode 100644 site/content/platforms.md diff --git a/README.md b/README.md index f0c012e..a3254c5 100644 --- a/README.md +++ b/README.md @@ -98,7 +98,7 @@ This spec is open for community review. We welcome: - **Proposed changes**: Submit a pull request (open an issue first for significant changes) - **Platform data**: If you know a platform's truncation limits, contribute to - the [Known Platform Limits](SPEC.md#known-platform-limits) table + [the Platforms tables](./site/content/platforms.md) - **Real-world results**: If you've evaluated your docs against this spec, we'd love to hear what you found diff --git a/SPEC.md b/SPEC.md index cc85418..3c01912 100644 --- a/SPEC.md +++ b/SPEC.md @@ -1503,25 +1503,7 @@ becomes available. ### Known Platform Limits -| Platform | Truncation Limit | Source | Confidence | Notes | -| ---------- | ----------------- | -------- | ------------ | ------- | -| Claude Code | ~100,000 chars | [Reverse engineering](https://giuseppegurgone.com/claude-webfetch) | High | Trusted sites serving `text/markdown` under 100K chars bypass summarization model entirely. Content over this threshold goes through a summarization model that may lose information. | -| MCP Fetch (reference server) | 5,000 chars (default) | [Official docs](https://pypi.org/project/mcp-server-fetch/) | High | Default `max_length` is 5,000 chars. Configurable up to 1,000,000. Supports chunked reading via `start_index`. | -| Claude API (web_fetch tool) | ~20,700 chars - default, unset | [empirical testing](https://rhyannonjoy.github.io/agent-ecosystem-testing/) | Medium | Optional `max_content_tokens` parameter can cap content length, but no default truncation limit is documented. Distinct implementation from Claude Code client-side tool. Default truncation ~20,700 chars when unset - ended mid-word. `max_content_tokens` is approximate — setting 5,000 returned 17,186 chars. Truncation occurs mid-token. CSS stripped effectively unlike Claude Code. HTML boilerplate 81–97.5% before first heading; Markdown reduces content 77%. JS-rendered pages return static shell only. | -| Google Gemini (URL context) | Unknown | [empirical testing](https://rhyannonjoy.github.io/agent-ecosystem-testing/) | Medium | Docs state a 34 MB max fetch size per URL, but this is a retrieval ceiling, not a processing limit. How much content actually reaches the model after fetching is undocumented. 20 URL hard limit per request, `400 INVALID_ARGUMENT` if exceeded, zero tokens consumed. Truncation boundary unknown — retrieved content is injected into context without a testable field; `tool_use_prompt_token_count` is the only available size proxy, <1% variance across runs. PDF failed consistently despite being a documented supported type; YouTube succeeded despite being documented as unsupported. `url_context_metadata` order is non-deterministic. Tested on `gemini-2.5-flash` only — behavior may vary across supported models. | -| OpenAI (web search) | Unknown | [empirical testing](https://rhyannonjoy.github.io/agent-ecosystem-testing/) | Medium | 128K token context window for web search. `search_context_size` parameter (low/medium/high) controls context amount but no per-page truncation limit is surfaced; when the tool invokes, any truncation of retrieved source content occurs before the model generates a response and isn't observable via the APIs. Consistent latency lever in Chat Completions API track, high ~1.5–1.7× slower, inconsistent in Responses API track. Source count stable at 12 regardless of context size. Tool invocation conditional and deterministic: static facts and trivial math don't invoke the tool. Domain filtering documented but non-functional via Python SDK — allow-list worked once on `web_search_preview`, never on `web_search`; block-list never succeeded across 6 runs, 2 tool types, 2 models. `search_queries_issued` appends training-era year strings despite running in 2026. Tested on `gpt-4o` + `gpt-4o-mini-search-preview` - behavior may vary across supported models. | -| Cursor | Method-dependent | [empirical testing](https://rhyannonjoy.github.io/agent-ecosystem-testing/) | High | No documented truncation limit, behavior varies between backend methods `WebFetch MCP` ~28KB, `urllib` ~72KB, other routes 240KB+; `Auto` agent routing opaque; Cursor autonomously selects fetch mechanism. On timeout, falls back to `curl` (unfiltered HTML, 16MB+ observed). Requests `text/markdown` via `Accept` header. No token limit detected (tested 6.68M tokens). Perfect reproducibility for same URL; high variance for small files across sessions. | -| GitHub Copilot | No fixed ceiling detected | [empirical testing](https://rhyannonjoy.github.io/agent-ecosystem-testing/) | Medium | No documented web fetch or truncation details; tool selection is non-deterministic and not controllable by prompt. `fetch_webpage` identified through logs only; performs relevance-ranked semantic excerpts with `...` elision markers in HTML-to-Markdown transformation with chunk-based reassembly; output order doesn't always reflect page reading order. No size limit detected across 55 runs; `curl` substitution delivers full retrieval, raw bytes in server format with no transformation layer. `Auto` model routing dispatches across multiple models with no documented routing logic. Tested on `Claude Haiku 4.5`, `Claude Sonnet 4.6`, `GPT-5.3-Codex`, `GPT-5.4`, `Grok Code Fast 1`, `Raptor mini (Preview)`. | -| Windsurf Cascade | No fixed ceiling detected at retrieval stage, but agent-dependent write ceiling | [empirical testing](https://rhyannonjoy.github.io/agent-ecosystem-testing/) | High | Two-stage pipeline `read_url_content` returns chunk index with summaries, metadata, requires sequential `view_content_chunk` calls. Full retrieval agent, doc size dependent. Full retrieval doesn't guarantee full content delivery. Agents often retrieve fully ~<14 chunks, spotty ~35, sparse sampling 50+. Includes per-chunk trucation, some chunk summaries include byte-count loss notices. CSS-heavy, SPAs often retrieve ~20-35% expected rendered size. `@web` syntax redundant with URL. Read-write asymmetry: agents that self-report full retrieval frequently fail to reproduce semantically-meaningful content with `curl` HTML/JS shells, false completions, cross-agent file reuse. | - -**Thank you to contributors!** - -- Claude API (web_fetch tool) limitations contributed by [Rhyannon Rodriguez](https://rhyannonjoy.github.io/agent-ecosystem-testing/) -- Cursor limitations contributed by [Rhyannon Rodriguez](https://rhyannonjoy.github.io/agent-ecosystem-testing/) -- GitHub Copilot limitations contributed by [Rhyannon Rodriguez](https://rhyannonjoy.github.io/agent-ecosystem-testing/) -- Google Gemini (URL context) limitations contributed by [Rhyannon Rodriguez](https://rhyannonjoy.github.io/agent-ecosystem-testing/) -- OpenAI (web search) limitations contributed by [Rhyannon Rodriguez](https://rhyannonjoy.github.io/agent-ecosystem-testing/) -- Windsurf Cascade limitations contributed by [Rhyannon Rodriguez](https://rhyannonjoy.github.io/agent-ecosystem-testing/) +Compare platform architecture and truncation limits in [Platforms](./site/content/platforms.md). ### What This Means for Threshold Selection diff --git a/site/config/_default/menus.toml b/site/config/_default/menus.toml index 6a87b6b..1e41ba9 100644 --- a/site/config/_default/menus.toml +++ b/site/config/_default/menus.toml @@ -3,6 +3,11 @@ name = "Spec" pageRef = "spec" weight = 10 +[[main]] +name = "Platforms" +pageRef = "platforms" +weight = 15 + [[main]] name = "GitHub" url = "https://github.com/agent-ecosystem/agent-docs-spec" diff --git a/site/content/platforms.md b/site/content/platforms.md new file mode 100644 index 0000000..51c27fc --- /dev/null +++ b/site/content/platforms.md @@ -0,0 +1,63 @@ +--- +title: "Platforms" +description: "Agent platform comparisons for retrieval, truncation, and summarization layers." +--- + +| **Section** | **Description** | +| ----------- | ------------------ | +| [Retrieval](#retrieval) | How and when an agent fetches content | +| [Truncation](#truncation) | What gets lost and whether agents report it | +| [Summarization](#summarization) | What happens to content between retrieval and generation | + +## Retrieval + +The web fetch gap isn't in retrieval, but in what follows: how agents attend to various content types +during generation, whether that's context window handling, chunking losses, or summarization. Platform +links lead to each tool's official documentation. + +| **Platform** | **Prompt Syntax** | **Invocation Pattern** | **Retrieval Behavior** | +| ---------- | ------------ | -------------------- | ----------------------- | +| [Claude API web fetch](https://platform.claude.com/docs/en/agents-and-tools/tool-use/web-fetch-tool) | Enable tool to augment Claude's context with URL | _Mid-generation deterministic_: tool requires enablement in API request, includes URL validation and results cache, may or may not provide live web content | _Visibility high_: only platform where response body includes raw tool result; no JavaScript execution, CSS-heavy pages and/or SPAs often return little to no prose | +| [Claude Code](https://code.claude.com/docs/en/overview) | `WebFetch` invoked automatically with prompt URL | _Mid-generation deterministic_: returns cached result if available, otherwise fetches live | _Visibility high_: Markdown result returned directly if trusted, text/markdown, <100k; otherwise a smaller LLM extracts relevant content before passing to Claude; no JavaScript execution | +| [Cursor](https://cursor.com/docs) | _No web fetch behavior publicly documented_, `@Web` context attachment redundant, agents don't correct misuse | _Mid-generation nondeterministic_: `Auto` default setting autonomous LLM and fetch method selection per request | _Visibility low_: fetch method not explicitly named, no JavaScript execution, CSS-heavy pages and/or SPAs often return little to no prose; prefers Markdown, content negotation documented with `Accept: text/markdown`; sends full browser fingerprint: `User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36` with Chrome client hints `Sec-Ch-Ua`, `Sec-Fetch-*` | +| [Gemini API URL context](https://ai.google.dev/gemini-api/docs/url-context) | Enable tool to augment Gemini's context with URL, request requires `url_context` with full, unnested URLs | _Pre-generation injection deterministic_: two-step process, fetches from internal cache, if unsuccessful, then live fetch; documentation includes parsing limitations | _Visibility low_: retrieved content injected into context without a testable field, retrieval orchestration and generation process opaque; `url_context_metadata` order _nondeterministic_, authoritative signal `url_retrieval_status`, `tool_use_prompt_token_count` only size proxy | +| [GitHub Copilot](https://code.visualstudio.com/docs/copilot/overview) | _No web fetch behavior publicly documented_, prompt with URL | _Mid-generation nondeterministic_: `Auto` default setting autonomous LLM and fetch method selection per request | _Visibility medium_: intermittently tools named via error, `fetch_webpage` returns relevance-ranked excerpts with elision markers, occasional nonlinear, inaccurate reassembly, `curl` byte-perfect retrieval, but no prose; content negotiation tool-dependent, presents as a browser, but overclaims `User-Agent`: `Mozilla/5.0`, `AppleWebKit` `Accept`: full HTML, `curl/8.7.1` no preference,`Accept`: `*/*` | +| [MCP Fetch (reference server)](https://pypi.org/project/mcp-server-fetch/) | `url` required; `max_length`, `start_index`, `raw` optional | _Mid-generation deterministic_: fetch invoked automatically with URL | _Visibility high_: returns extracted contents as Markdown; supports chunked reading via `start_index`, allowing LLM to page through content until it finds what's needed; no JavaScript execution | +| [OpenAI web search](https://developers.openai.com/api/docs/guides/tools-web-search) | Chat Completions API augments `GPT`'s search with URL, Responses API for `web_search` | _Mid-generation nondeterministic: integration and agent-dependent_: static facts and trivial math don't invoke the tool; Chat Completions search implicit, Responses `web_search_preview` conditional, control cached/indexed or live content `external_web_access` | _Visibility low_: Responses `response.output`'s `web_search_call` names tools, but search context not equal to LLM context window; no JavaScript execution; `search_context_size: low/medium/high` controls context amount, Chat Completions latency lever consistent, but Responses inconsistent | +| [Windsurf Cascade](https://docs.windsurf.com/windsurf/cascade/web-search) | Web and docs search partially documented, `@web` directive redundant with URL, agents don't correct misuse | _Mid-generation deterministic_: autonomous two-stage pipeline designed to emulate human browsing and skimming, documentation acknowledges not all pages parseable | _Visibility medium_: `read_url_content` returns chunk index with summaries, metadata and requires sequential `view_content_chunk` calls; `curl` substitution for CSS-heavy pages, SPAs return ~20–35% of expected size, little or no prose; agents used `@web`'s `web_search` as verification once every ~60 turns; presentation transparent about using crawler-scaper, but underdelivers, `User-Agent`: [Colly](https://github.com/gocolly/colly) | + +## Truncation + +Pipelines are lossy by design in attempt to balance token cost, speed, and access to fresh content. +Agents intermittently acknowledge architectural constraints, misattribute truncation causes, or +self-report completeness when content is incomplete or unusable. Platform links lead to empirical +testing analysis and/or tool documentation. + + +| **Platform** | **Truncation Limit** | **Observations** | +| ---------- | ----------------- | ------- | +| [Claude API web fetch](https://rhyannonjoy.github.io/agent-ecosystem-testing/docs/anthropic-claude-api-web-fetch-tool/claude-interpreted-vs-raw) | ~20,700 chars and/or ~100 KB of rendered content _default unset_ | `max_content_tokens` approximate, setting 5,000 returned 17,186 chars, truncation occurs mid-token. Default limit identified in raw track, self-report attributed missing content to JavaScript rendering, masking character limit. | +| [Claude Code](https://giuseppegurgone.com/claude-webfetch) | ~100,000 chars | Trusted sites serving `text/markdown` under 100K chars bypass summarization, while content over 100K chars are passed to a summarization LLM. | +| [Cursor](https://rhyannonjoy.github.io/agent-ecosystem-testing/docs/anysphere-cursor/cursor-interpreted-vs-raw)** | 28 KB–240 KB+ _method-dependent_, _nondeterministic filtering_ | `WebFetch MCP` ~28 KB, `urllib` ~72 KB, unknown path 245 KB+, `curl` no ceiling detected; appears to apply structure-aware content filtering, navigation and CSS stripped, but content selection heuristic presents as complete, so agents don't report truncation. | +| [Gemini API URL context](https://rhyannonjoy.github.io/agent-ecosystem-testing/docs/google-gemini-url-context-tool/gemini-interpreted-vs-raw) | _No fixed ceiling or silent dropping detected_, 20 URLs hard limit per request | API-layer rejection returns `400` and doesn't consume tokens; retrieval-layer failure completes the request, but records `URL_RETRIEVAL_STATUS_ERROR`. Format support inconsistent with documentation: PDF fails, YouTube succeeds, JSON nondeterministic; Google Docs fail consistently. | +| [GitHub Copilot](https://rhyannonjoy.github.io/agent-ecosystem-testing/docs/microsoft-github-copilot/copilot-interpreted-vs-raw) | _No fixed ceiling detected, _nondeterministic excerpting_, tested 6.68M tokens | Pipeline with `fetch_webpage` discards whole sections or more granularly before generation, `curl` delivers all raw bytes but unreadable, chat rendering cutoff visible in output, not persisted as requested, but agents don't reliably report these results as truncation. | +| [MCP Fetch (reference server)](https://pypi.org/project/mcp-server-fetch/) | Default 5,000 chars | Default `max_length` is 5,000 chars, but configurable up to 1,000,000; uniquely user-controlled truncation. | +| [OpenAI web search](https://rhyannonjoy.github.io/agent-ecosystem-testing/docs/open-ai-web-search-tool/chatgpt-interpreted-vs-raw) | _No fixed ceiling or silent dropping detected_ | Raw source count stable at 12 regardless of `search_context_size` setting. Query construction not temporally aware, internal queries append training-era date strings despite running in 2026. Documented domain filtering limits not functional in Python SDK. | +| [Windsurf Cascade](https://rhyannonjoy.github.io/agent-ecosystem-testing/docs/cognition-windsurf-cascade/cascade-interpreted-explicit-vs-raw) | _No fixed ceiling detected at retrieval stage_, _nondeterministic agent-dependent write ceiling_ | Full retrieval agent and doc-size-dependent. Agents often retrieve fully under ~14 chunks, spotty at ~35, sparse sampling at 50+. Chunk index summary population not guaranteed, those present often include byte-count loss notices. Unique read-write asymmetry. Agents often self-report full retrieval, but fail to prove it with a write task or report truncation. | + +## Summarization + +Processing layer observability vary by implementation. Platforms often offer user-configured subagents +while turn-by-turn chat interactions abstract any default orchestrator-subagent relationships away. +Observable outputs from default settings primarily inform the conclusions below. + +| **Platform** | **Processing Layer** | **Inference** | +| ---------- | -------------------- | ------------ | +| [Claude API web fetch](https://rhyannonjoy.github.io/agent-ecosystem-testing/docs/anthropic-claude-api-web-fetch-tool/methodology) | _Dynamic filtering optional_, `web_fetch_20260209` | Server-side tool called directly with inspectable tool result in response. [Dynamic filtering](https://platform.claude.com/docs/en/agents-and-tools/tool-use/web-fetch-tool) available with certain LLMs in which Claude writes, executes code to filter before content reaches the context window, but it's not default behavior. | +| [Claude Code](https://giuseppegurgone.com/claude-webfetch) | _Summarization threshold-triggered_ | Content under ~100K chars from trusted text/markdown sources reaches the context window directly without intermediate processing, but content exceeding this threshold goes through a summarization LLM that may lose information. | +| [Cursor](https://rhyannonjoy.github.io/agent-ecosystem-testing/docs/anysphere-cursor/methodology) | _Inferred via filtering, undocumented for web fetch_ | Codebase research, terminal commands, and browser automation requests trigger [built-in subagents](https://cursor.com/docs/subagents) `explore`, `bash`, and `browser`. Test prompts likely invoked `explore` and `bash` alongside web fetch. Backend routing and structure-aware content filtering suggest a pre-generation processing layer, not a passive, linear pipeline. | +| [Gemini API URL context](https://rhyannonjoy.github.io/agent-ecosystem-testing/docs/google-gemini-url-context-tool/methodology) | _API layer pipeline, undocumented_ | Pre-generation injection suggests processing occurs before LLM invocation. No transformation layer between retrieval and generation; LLM receives content directly and any summarization occurs as part of generation, not as an intermediate pipeline stage. | +| [GitHub Copilot](https://rhyannonjoy.github.io/agent-ecosystem-testing/docs/microsoft-github-copilot/methodology) | _Inferred via relevance-ranking, undocumented for web fetch_ | Reassembled excerpts, outputs that don't note discarded content, browser masquerading, and tool substitution patterns suggests an orchestrator-subagent relationship and not a linear, passive pipeline. Agent loop descriptions vary by implementation. [VS Code-Copilot docs](https://code.visualstudio.com/docs/copilot/agents/subagents) describe subagent delegation as _main agent-initiated_ for complex tasks with further config available, but [Copilot SDK docs](https://docs.github.com/en/copilot/how-tos/copilot-sdk/use-copilot-sdk/custom-agents) only describe subagents as configurable, and not default architecture. | +| MCP Fetch (reference server) | _None_ hard truncation at `max_length` | Passive, linear pipeline without a processing layer. | +| [OpenAI web search](./open-ai-web-search-tool/methodology.md) | _Differs by API surface, undocumented_ | Chat Completions autonomously retrieves, but Responses' LLM actively manages search in the chain of thought with `open_page` and `find_in_page`, suggesting a processing layer, but not explicitly documented or named in either API responses. | +| [Windsurf Cascade](https://rhyannonjoy.github.io/agent-ecosystem-testing/docs/cognition-windsurf-cascade/methodology) | _Inferred via chunking, undocumented for web and docs search_ | Codebase research triggers [built-in subagent Fast Context](https://docs.windsurf.com/context-awareness/fast-context). Test prompts likely invoked Fast Context alongside web search. Chunk analysis, tool substitution, terminal execution, and workspace referencing suggest an extensive processing layer, not a passive, linear pipeline. |