From 2f40315c2997753286d772065475b61ca1d911a2 Mon Sep 17 00:00:00 2001 From: David Hurley Date: Sat, 2 May 2026 18:35:45 +0000 Subject: [PATCH] Add Plasmate integration MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Plasmate is an open-source (Apache 2.0) browser engine for AI agents that produces the Semantic Object Model (SOM) — a flat, typed JSON document representing a web page in a form optimized for LLM consumption. This integration adds PlasmateWebFetcher and PlasmateSOMConverter components for Haystack 2.0 RAG pipelines as a drop-in alternative to LinkContentFetcher and HTMLToDocument, with ~17x average token reduction across the public WebTaskBench benchmark. Reopening the redirect from deepset-ai/haystack#11056 (closed Apr 13 with @anakin87 suggesting this venue). Co-Authored-By: Claude Opus 4.7 (1M context) --- integrations/plasmate.md | 146 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 146 insertions(+) create mode 100644 integrations/plasmate.md diff --git a/integrations/plasmate.md b/integrations/plasmate.md new file mode 100644 index 0000000..f21099b --- /dev/null +++ b/integrations/plasmate.md @@ -0,0 +1,146 @@ +--- +layout: integration +name: Plasmate +description: Browser engine for AI agents. Fetches web pages as Semantic Object Model (SOM) — typed JSON optimized for LLM consumption with ~17× fewer tokens than raw HTML on average, and peaks above 100×. +authors: + - name: Plasmate Labs + socials: + github: plasmate-labs +pypi: https://pypi.org/project/haystack-plasmate +repo: https://github.com/plasmate-labs/haystack-plasmate +type: Data Ingestion +report_issue: https://github.com/plasmate-labs/haystack-plasmate/issues +logo: /logos/plasmate.png +version: Haystack 2.0 +toc: true +--- + +### Table of Contents + +- [Overview](#overview) +- [Installation](#installation) +- [Components](#components) + - [PlasmateWebFetcher](#plasmatewebfetcher) + - [PlasmateSOMConverter](#plasmatesomconverter) +- [RAG pipeline example](#rag-pipeline-example) +- [License](#license) + +## Overview + +[Plasmate](https://plasmate.app) is an open-source (Apache 2.0) browser engine designed from the ground up for AI agents. Instead of rendering pixels, Plasmate produces the Semantic Object Model (SOM) — a flat, typed JSON document representing a web page in a form optimized for LLM consumption. + +Across 38 measured production sites, Plasmate achieves an average **~17× token reduction** versus raw HTML, with peaks above **100×** on large SaaS marketing pages. The reproducible benchmark is published at [webtaskbench.com](https://webtaskbench.com), and the SOM/1.0 format is specified at [somspec.org](https://somspec.org/spec). + +This integration exposes Plasmate to Haystack 2.0 RAG pipelines as a drop-in alternative to `LinkContentFetcher` and `HTMLToDocument`. The headline benefit is dramatically lower token cost per page in any pipeline that fetches web content for downstream LLM consumption. + +## Installation + +```bash +pip install haystack-plasmate +``` + +You also need the Plasmate engine available on PATH. The engine ships as a single binary: + +```bash +# Python install of Plasmate itself +pip install plasmate + +# Or via the project's release binaries +# https://github.com/plasmate-labs/plasmate/releases +``` + +## Components + +### PlasmateWebFetcher + +Fetches web pages and converts them to Haystack `Document` objects with SOM content. + +```python +from haystack_plasmate import PlasmateWebFetcher + +# Basic usage — fetches each URL and returns a Haystack Document +fetcher = PlasmateWebFetcher() +result = fetcher.run(urls=["https://example.com"]) +docs = result["documents"] + +print(docs[0].content) # Concise SOM text representation +print(docs[0].meta["url"]) # https://example.com +print(docs[0].meta["title"]) # Page title +print(docs[0].meta["som_tokens"]) # ~hundreds +print(docs[0].meta["html_tokens"]) # ~tens of thousands +print(docs[0].meta["compression_ratio"]) # e.g. 47.3 + +# With custom headers (e.g. for authenticated pages) +fetcher = PlasmateWebFetcher( + headers={"Authorization": "Bearer token123"}, + timeout=60, +) + +# Text-only mode — extracts readable text without SOM structure +fetcher = PlasmateWebFetcher(text_only=True) +``` + +### PlasmateSOMConverter + +Converts raw HTML strings to SOM `Document` objects without making HTTP requests. Useful when HTML is already in hand (from a database, a different fetcher, or in-process rendering). + +```python +from haystack_plasmate import PlasmateSOMConverter + +converter = PlasmateSOMConverter() + +# Convert a single HTML string +result = converter.run(html="

Hello

") +doc = result["documents"][0] + +# Convert multiple HTML sources, attaching metadata to each +result = converter.run(sources=[ + {"html": "...", "meta": {"source": "page1.html"}}, + {"html": "...", "meta": {"source": "page2.html"}}, +]) +``` + +## RAG pipeline example + +A web-aware RAG pipeline that fetches documentation pages, embeds them, and answers questions: + +```python +from haystack import Pipeline +from haystack.components.builders import PromptBuilder +from haystack.components.embedders import OpenAITextEmbedder, OpenAIDocumentEmbedder +from haystack.components.generators import OpenAIGenerator +from haystack.components.writers import DocumentWriter +from haystack.document_stores.in_memory import InMemoryDocumentStore +from haystack_plasmate import PlasmateWebFetcher + +document_store = InMemoryDocumentStore() + +# Indexing pipeline — fetch with Plasmate, embed, write +indexing = Pipeline() +indexing.add_component("fetcher", PlasmateWebFetcher()) +indexing.add_component("embedder", OpenAIDocumentEmbedder()) +indexing.add_component("writer", DocumentWriter(document_store=document_store)) +indexing.connect("fetcher.documents", "embedder.documents") +indexing.connect("embedder.documents", "writer.documents") + +indexing.run({ + "fetcher": { + "urls": [ + "https://docs.haystack.deepset.ai/docs/intro", + "https://docs.haystack.deepset.ai/docs/components_overview", + ], + }, +}) +``` + +The same indexing pipeline using `LinkContentFetcher` would consume **roughly an order of magnitude more tokens per URL** during embedding and downstream generation, because Plasmate strips layout scaffolding, design-system runtime, analytics initialisation, and other non-content tokens before the document reaches the embedder. + +## SOM directives — making your own pages Plasmate-friendly + +If you publish content that you would like AI agents to read efficiently, advertise a SOM endpoint via [SOM Directives in robots.txt](https://somspec.org/directives). The five-line opt-in tells any compatible agent to fetch a structured representation of your pages instead of the full HTML rendering. Verify your site is SOM-ready at [somready.com](https://somready.com). + +## License + +`haystack-plasmate` is open source under the Apache 2.0 License. The Plasmate engine itself is also Apache 2.0. + +The SOM/1.0 specification is hosted under the W3C Web Content for AI Community Group at [somspec.org](https://somspec.org).