Skip to content

An API for textual link summaries, powered by Chrome and optional third-party AI

Notifications You must be signed in to change notification settings

chatmud/summarizer

Repository files navigation

Summarizer

This is the link preview API used by ChatMUD. Given a URL, returns structured metadata including title, description, images, author, and optionally an AI-generated summary.

What it does

  • Extracts Open Graph, JSON-LD, and standard meta tags from web pages
  • Handles YouTube and Spotify URLs via their oEmbed APIs
  • Renders JavaScript-heavy sites (Twitter, TikTok, Instagram, Reddit, etc.) with a headless browser
  • Detects paywalls and attempts bypass or falls back to Wayback Machine archives
  • Identifies non-HTML content (PDFs, images, documents) and returns file metadata
  • Generates short summaries using an OpenAI-compatible API
  • Caches results in Redis

Quick start

cp .env.example .env
# Edit .env to set REDIS_PASSWORD
docker compose up -d

The API runs on port 8005.

Endpoints

GET /health

Returns service status. Checks Redis connectivity.

{"status": "healthy", "message": "All services operational"}

GET /preview

Generates a link preview.

Query parameters:

  • url (required) — The URL to preview. Max 2048 characters.
  • force_refresh (optional) — Bypass cache and fetch fresh data. Default: false
  • summarizer (optional) — Include an AI-generated summary. Default: false

Example:

GET /preview?url=https://example.com/article&summarizer=true

Response:

{
  "status": "success",
  "url": "https://example.com/article",
  "title": "Article Title",
  "description": "The article description from meta tags.",
  "image": "https://example.com/image.jpg",
  "favicon": "https://example.com/favicon.ico",
  "author": "Jane Smith",
  "keywords": ["news", "tech"],
  "language": "en",
  "metadata": {
    "opengraph": {"og:type": "article", "og:title": "Article Title"},
    "json_ld": {"@type": "Article", "headline": "Article Title"},
    "oembed": null,
    "spotify": null
  },
  "summary": "Two to three sentence summary of the content."
}

The summary field only appears when summarizer=true.

Configuration

Set these in .env:

Variable Default Description
REDIS_HOST localhost Redis hostname
REDIS_PORT 6379 Redis port
REDIS_DB 0 Redis database number
REDIS_PASSWORD Redis password
CACHE_TTL 300 Cache lifetime in seconds
RATE_LIMIT_PER_MINUTE 20 Requests per minute per IP
OPENAI_API_KEY API key for summarization
OPENAI_BASE_URL OpenAI-compatible endpoint
GPT_MODEL Model name for summaries
DEFAULT_TIMEOUT 30 Request timeout in seconds

Configure OPENAI_API_KEY, OPENAI_BASE_URL, and GPT_MODEL to enable the summarization feature. Any OpenAI-compatible API will work.

URL handling

Standard websites — Fetches HTML via HTTP client, extracts metadata with BeautifulSoup and extruct.

JavaScript-required sites — Vimeo, TikTok, Twitter/X, Instagram, Facebook, LinkedIn, Reddit, Medium, and Substack get rendered with Playwright's Chromium browser.

YouTube — Calls the YouTube oEmbed API directly. Falls back to browser rendering if that fails.

Spotify — Calls the Spotify oEmbed API for tracks, albums, playlists, artists, shows, and episodes. Extracts additional metadata (artists, duration, release date) from the embed HTML.

Non-HTML content — PDFs, images, and other files return the filename, MIME type, file size, and last-modified date.

Paywalled content — Detects common paywall patterns, attempts to remove overlay elements, and falls back to Wayback Machine if content remains inaccessible.

Rate limiting

Requests are limited per IP address. Cached responses cost 0.2 against the limit; fresh fetches cost 1.0. When the limit is exceeded, the API returns:

{
  "status": "failure",
  "message": "Rate limit exceeded. Please try again later.",
  "url": "https://example.com"
}

Security

  • Only http:// and https:// schemes allowed
  • Localhost and private IP ranges blocked
  • Security headers set on all responses (HSTS, CSP, X-Frame-Options, etc.)

Running without Docker

Requires Python 3.11+ and a running Redis instance.

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install dependencies
uv sync

# Install browser
uv run playwright install chromium

# Run the server
uv run uvicorn api.main:app --host 0.0.0.0 --port 8005

License

MIT

About

An API for textual link summaries, powered by Chrome and optional third-party AI

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published