AgentVision — Eyes for AI Agents 👁️

Problem: AI coding agents are blind — they write a UI, chart, SVG or PDF and never see the result, shipping breakage they can't perceive.
Result: AgentVision gives them eyes — render → see → report → fix — catching overflow, low contrast, clipped/truncated text (incl. SVG labels), broken images and typos.
So your agent self-corrects before it claims done.

AgentVision is a provider-agnostic framework that closes the visual feedback loop for AI coding agents:

render → perceive → report → (agent fixes) → re-render → diff

It is not human-reviewed visual regression (Percy/Applitools/Argos) and not browser automation (browser-use/Playwright). It is a machine-graded visual critique loop an agent consumes to self-correct before claiming done — with a verdict (pass/warn/fail) and actionable, coordinate-grounded issues.

The 60-second pitch

pip install "agentvision[render]"
playwright install chromium     # see `agentvision doctor` if Chromium won't launch
agentvision demo                # no API key required

agentvision demo renders a deliberately broken page, prints a FAIL report (overflow + low-contrast + a 404 image — all DOM/CV-grounded, no LLM key needed), then loops against the fixed version and prints "what changed: 3 issues resolved → PASS." That command is the product.

What makes it trustworthy

Findings are grounded in sources we can actually trust:

DOM geometry (getBoundingClientRect + scroll offset) — precise element boxes.
Computed-style contrast (getComputedStyle) — real WCAG ratios, with a confidence flag (it degrades honestly over gradients/images/pseudo-elements rather than lying).
OCR word boxes (Tesseract) — precise text locations.
Console / network / 4xx capture — the #1 "looks fine in code, broken live" cause.

A vision LLM (Claude/OpenAI/Gemini) adds semantic critique on top. Its pixel boxes are treated as advisory (bbox_precise: false), never marketed as pixel-accurate.

Full-coverage vision. On a large artifact the model gets a downscaled overview plus full-resolution tiles covering it, so fine detail and small text aren't lost to downscaling. It's pixel-based and source-agnostic — the same coverage applies to HTML, a flat image, or a PDF page, not just elements the DOM enumerates.

Documents, decks & confidential inputs

Point any command at a PDF or Office/OpenDocument file (.docx/.pptx/.xlsx/.odt/…) and it's rasterized per page and graded like a screenshot. PowerPoint decks also get an offline slide inspector — agentvision check deck.pptx runs key-free and no-egress, flagging unreadable text (low / dark-on-dark contrast), clipped/truncated text, off-slide shapes and overlapping boxes, each tagged [slide N].

Processing something confidential? Add --no-cache to any source command to render in a throwaway temp dir that's wiped on exit — nothing touches ~/.cache/agentvision:

agentvision check confidential-deck.pptx --no-cache --backend local   # nothing cached, nothing leaves the box

See docs/security.md for the full model (path confinement, file_root, the ephemeral cache, and the renderer trust boundary).

Match the intent, not just avoid defects

A typo-free, well-laid-out artifact can still be the wrong thing — an infographic that shows the wrong stages, a page missing the panel you asked for, a generated image that ignored half the prompt. Give AgentVision the intent and it grades the render against it, so PASS means "matches what I set out to build," not merely "defect-free":

# Does the render match the thought? (text claims grade deterministically via OCR)
agentvision conform ./infographic.png \
  --brief "launch infographic for AgentVision" \
  --expect 'must: title reads "AgentVision"' \
  --expect 'should: shows 4 stages left to right'

For AI-generated artifacts the fix is a better prompt, not code — so the generative loop generate → see → grade vs intent → refine prompt → regenerate runs until it matches. The image generator is a hook you supply; AgentVision never bundles an image-gen dependency:

agentvision generate --generator mypkg.gen:make_image \
  --brief "minimalist infographic, dark background, no typos" --max-iter 4 -o final.png

See docs/conformance.md. Express intent three ways — a free-text brief (eyes extract the checklist), an explicit checklist (--expect, deterministic), or a reference image (--reference). Claims are must: / should: / nice:.

Eyes → brain: the handoff

In anatomy the eyes are only the afferent half — the retina perceives, the optic nerve carries the signal to the brain, the brain decides, the hand acts, the eyes look again. AgentVision is that afferent pathway for an agent: it perceives and hands a clean signal back to the brain (whatever does your reasoning/planning/memory) — it deliberately doesn't decide for you. Any perception call distills to a Handoff:

agentvision analyze ./page.html --handoff

{ "perceived": "fail", "next_action": "revise", "matches_intent": false,
  "todo": ["[overflow] hero text overflows on the right",
           "[intent/must] a \"Checkout\" button is visible"],
  "open_questions": ["Verify: uses the brand's dark theme"] }

next_action (done / revise / review) drives the brain's loop; todo is the work-list; open_questions is what perception couldn't confirm (never dropped). Available as report.to_handoff(), the MCP perceive_handoff tool, POST /handoff, and a handoff.json per loop iteration — provider- and brain-agnostic. See docs/handoff.md.

Eyes & Brain — AgentVision × Verel

AgentVision is the eyes. It pairs with Verel, the brain — an agent framework where nothing is "done" until a grader returns a verdict. The eyes perceive and grade intent; the brain decides with attestation and compounds only verified work into memory; then the eyes look again.

They ship and version independently (pip install agentvision, pip install verel) yet work in sync: AgentVision plugs into Verel as its verel.senses perception organ — mapped onto a unified verdict bus (vision alongside tests, lint and types), with intent conformance recorded in the brain's memory each iteration. Since 0.9.0 both speak one language: the Report/Handoff types come from the shared agentsensory contract, so a graded Report drops onto that bus with no per-organ translation. AgentVision stays brain-agnostic; Verel is the reference brain. See docs/handoff.md.

Many faces, one core

Surface	Who it's for
Library (`import agentvision`)	Python apps, custom harnesses
CLI (`agentvision …`)	Any agent that can run a shell command; CI
Claude Code Skill	Claude agents — auto-invokes the loop before claiming done
MCP server (`agentvision-mcp`)	Cursor, Claude, any MCP-capable host
REST service (`agentvision-serve`)	Non-MCP / networked / CI agents
Integration recipes	Cursor rules, Aider, generic "agent contract"

⚠️ "Provider-agnostic" describes the API surface, not behavior. The framework can't force a non-Claude agent into the loop — it gives every agent the means. The Claude Code Skill is the one surface that makes an agent use it proactively; MCP is the first-class cross-host path; the recipes cover the rest.

Many agents, one set of eyes

One agent with eyes self-corrects. A swarm of agents sharing one set of eyes is the real prize — dozens of workers each rendering UIs, charts, decks or PDFs, every output graded against the same contract before it counts as done. Run the eyes as a horizontally-scaled service (agentvision serve) or embed the library per worker; the single-shot endpoints (analyze/check/conform) are stateless and scale with zero coordination. The one piece of state to mind is the loop session — kept in-process, so behind multiple workers keep loops client-side or sticky-route them. And because every worker returns the same agentsensory Report/Handoff, a coordinator (or a brain like Verel) aggregates all the verdicts on one bus — vision graded alongside tests, lint and types.

See Swarms & scaling for the topologies, the stateless/stateful split, and a fan-out example.

Vision backends

Pluggable and selectable via --backend / AGENTVISION_VISION_BACKEND:

anthropic (default model claude-haiku-4-5, upgradable to Sonnet/Opus)
openai, gemini
local — CV/OCR heuristics only, no API key, no egress (great for CI / air-gapped)

Install

pip install "agentvision[all]"          # everything
pip install "agentvision[render]"       # just rendering + the no-key local loop
pip install "agentvision[render,anthropic]"  # + Claude analysis

System dependencies (Chromium, Tesseract, poppler) and a doctor that checks them:

agentvision doctor          # attempts a real Chromium launch; lists every missing lib
agentvision doctor --fix    # installs the Chromium browser binary

On a bare RHEL/CentOS box, playwright install-deps does not work (apt-only). See docs/quickstart.md for the dnf line, or use the bundled Dockerfile which bakes the deps in.

Usage

# Analyze a file/URL/HTML string and print a structured report
agentvision analyze ./index.html --backend local --json

# Run the self-correcting loop
agentvision loop ./dashboard.html --max-iter 3

# Responsive contact sheet across breakpoints
agentvision sheet ./index.html --breakpoints 375,768,1280,1920

# Visual regression against a named baseline
agentvision baseline ./index.html --name home
agentvision regress  ./index.html --name home

Live pages, SPAs & dashboards (polling, websockets, canvas/WebGL):

# localhost dev server, wait for the data to render, freeze animation, machine output
agentvision analyze http://localhost:5173 --allow-local \
  --wait-for "#dashboard" --settle-ms 800 --quiet

Streaming / video / over-time behavior — watch, don't just glance:

# Is the video actually playing? Did loading finish? Are captions on?
agentvision watch https://app.example.com/player --frames 6 --interval-ms 500 \
  --expect 'must: the video is playing'

watch reads deterministic <video> state (currentTime/readyState/captions) + pixel liveness/stall/black-frame detection, then adds a time-aware vision pass. See docs/use-cases/streaming.md.

--nav-wait defaults to load (polling pages never go idle); --freeze (default on) pauses animations + requestAnimationFrame so canvas/WebGL pages capture without hanging; --quiet prints only JSON (logs to stderr, exit codes 0 pass/warn · 2 fail · 3 error).

Library:

import asyncio
from agentvision import load_settings
from agentvision.core.loop import LoopSession

async def main():
    settings = load_settings(vision_backend="local")
    session = LoopSession("examples/broken_layout.html", settings=settings)
    result = await session.iterate()
    print(result.report.verdict, [i.message for i in result.report.issues])

asyncio.run(main())

Drop it into your workflow & your agents

# CI gate (GitHub Action): fails the build on a visual FAIL verdict
- uses: amitpatole/agent-vision@v0.10.0
  with: { source: dist/index.html, command: check, args: --full-page }

CI / pre-commit / Makefile — shell out; exit codes 0 pass/warn · 2 fail · 3 error, --quiet for JSON-only output. Reusable GitHub Action + pre-commit hook included.
Your agents — drop integrations/agent-contract.md into the system prompt, use the Claude Code Skill, or the MCP tools (Cursor/Claude/any host).

Full guide: docs/integrations.md.

Documentation

📖 Full docs site: amitpatole.github.io/agent-vision

Quickstart · The Loop · Conformance · Handoff (eyes→brain) · Streaming / temporal · Backends · Adapters · Integrations · Vision

What we do not claim (honesty)

Pixel-accurate vision-model bounding boxes (they're advisory).
WCAG verdicts on rasterized non-HTML (heuristic only).
Bit-reproducible screenshots / deterministic LLM reports.
Uniform provider-agnostic behavior (only the API surface is uniform).

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github/workflows		.github/workflows
brand		brand
docs		docs
examples		examples
integrations		integrations
skill		skill
src/agentvision		src/agentvision
tests		tests
.gitignore		.gitignore
.pre-commit-hooks.yaml		.pre-commit-hooks.yaml
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
action.yml		action.yml
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgentVision — Eyes for AI Agents 👁️

The 60-second pitch

What makes it trustworthy

Documents, decks & confidential inputs

Match the intent, not just avoid defects

Eyes → brain: the handoff

Eyes & Brain — AgentVision × Verel

Many faces, one core

Many agents, one set of eyes

Vision backends

Install

Usage

Drop it into your workflow & your agents

Documentation

What we do not claim (honesty)

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AgentVision — Eyes for AI Agents 👁️

The 60-second pitch

What makes it trustworthy

Documents, decks & confidential inputs

Match the intent, not just avoid defects

Eyes → brain: the handoff

Eyes & Brain — AgentVision × Verel

Many faces, one core

Many agents, one set of eyes

Vision backends

Install

Usage

Drop it into your workflow & your agents

Documentation

What we do not claim (honesty)

License

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages