Problem: AI coding agents are blind β they write a UI, chart, SVG or PDF and never see the result, shipping breakage they can't perceive.
Result: AgentVision gives them eyes β render β see β report β fix β catching overflow, low contrast, clipped/truncated text (incl. SVG labels), broken images and typos.
So your agent self-corrects before it claims done.
AgentVision is a provider-agnostic framework that closes the visual feedback loop for AI coding agents:
render β perceive β report β (agent fixes) β re-render β diff
It is not human-reviewed visual regression (Percy/Applitools/Argos) and not browser
automation (browser-use/Playwright). It is a machine-graded visual critique loop an agent
consumes to self-correct before claiming done β with a verdict (pass/warn/fail) and
actionable, coordinate-grounded issues.
pip install "agentvision[render]"
playwright install chromium # see `agentvision doctor` if Chromium won't launch
agentvision demo # no API key requiredagentvision demo renders a deliberately broken page, prints a FAIL report (overflow +
low-contrast + a 404 image β all DOM/CV-grounded, no LLM key needed), then loops against the
fixed version and prints "what changed: 3 issues resolved β PASS." That command is the
product.
Findings are grounded in sources we can actually trust:
- DOM geometry (
getBoundingClientRect+ scroll offset) β precise element boxes. - Computed-style contrast (
getComputedStyle) β real WCAG ratios, with aconfidenceflag (it degrades honestly over gradients/images/pseudo-elements rather than lying). - OCR word boxes (Tesseract) β precise text locations.
- Console / network / 4xx capture β the #1 "looks fine in code, broken live" cause.
A vision LLM (Claude/OpenAI/Gemini) adds semantic critique on top. Its pixel boxes are
treated as advisory (bbox_precise: false), never marketed as pixel-accurate.
Full-coverage vision. On a large artifact the model gets a downscaled overview plus full-resolution tiles covering it, so fine detail and small text aren't lost to downscaling. It's pixel-based and source-agnostic β the same coverage applies to HTML, a flat image, or a PDF page, not just elements the DOM enumerates.
Point any command at a PDF or Office/OpenDocument file (.docx/.pptx/.xlsx/.odt/β¦) and it's
rasterized per page and graded like a screenshot. PowerPoint decks also get an offline slide
inspector β agentvision check deck.pptx runs key-free and no-egress, flagging unreadable
text (low / dark-on-dark contrast), clipped/truncated text, off-slide shapes and overlapping
boxes, each tagged [slide N].
Processing something confidential? Add --no-cache to any source command to render in a
throwaway temp dir that's wiped on exit β nothing touches ~/.cache/agentvision:
agentvision check confidential-deck.pptx --no-cache --backend local # nothing cached, nothing leaves the boxSee docs/security.md for the full model (path confinement, file_root, the
ephemeral cache, and the renderer trust boundary).
A typo-free, well-laid-out artifact can still be the wrong thing β an infographic that shows the wrong stages, a page missing the panel you asked for, a generated image that ignored half the prompt. Give AgentVision the intent and it grades the render against it, so PASS means "matches what I set out to build," not merely "defect-free":
# Does the render match the thought? (text claims grade deterministically via OCR)
agentvision conform ./infographic.png \
--brief "launch infographic for AgentVision" \
--expect 'must: title reads "AgentVision"' \
--expect 'should: shows 4 stages left to right'For AI-generated artifacts the fix is a better prompt, not code β so the generative loop generate β see β grade vs intent β refine prompt β regenerate runs until it matches. The image generator is a hook you supply; AgentVision never bundles an image-gen dependency:
agentvision generate --generator mypkg.gen:make_image \
--brief "minimalist infographic, dark background, no typos" --max-iter 4 -o final.pngSee docs/conformance.md. Express intent three ways β a free-text
brief (eyes extract the checklist), an explicit checklist (--expect, deterministic),
or a reference image (--reference). Claims are must: / should: / nice:.
In anatomy the eyes are only the afferent half β the retina perceives, the optic nerve
carries the signal to the brain, the brain decides, the hand acts, the eyes look again.
AgentVision is that afferent pathway for an agent: it perceives and hands a clean signal back
to the brain (whatever does your reasoning/planning/memory) β it deliberately doesn't
decide for you. Any perception call distills to a Handoff:
agentvision analyze ./page.html --handoffnext_action (done / revise / review) drives the brain's loop; todo is the work-list;
open_questions is what perception couldn't confirm (never dropped). Available as
report.to_handoff(), the MCP perceive_handoff tool, POST /handoff, and a handoff.json
per loop iteration β provider- and brain-agnostic. See docs/handoff.md.
AgentVision is the eyes. It pairs with Verel, the brain β an agent framework where nothing is "done" until a grader returns a verdict. The eyes perceive and grade intent; the brain decides with attestation and compounds only verified work into memory; then the eyes look again.
They ship and version independently (pip install agentvision, pip install verel) yet work
in sync: AgentVision plugs into Verel as its verel.senses perception organ β mapped onto a
unified verdict bus (vision alongside tests, lint and types), with intent conformance
recorded in the brain's memory each iteration. Since 0.9.0 both speak one language: the
Report/Handoff types come from the shared agentsensory
contract, so a graded Report drops onto that bus with no per-organ translation. AgentVision
stays brain-agnostic; Verel is the reference brain. See docs/handoff.md.
| Surface | Who it's for |
|---|---|
Library (import agentvision) |
Python apps, custom harnesses |
CLI (agentvision β¦) |
Any agent that can run a shell command; CI |
| Claude Code Skill | Claude agents β auto-invokes the loop before claiming done |
MCP server (agentvision-mcp) |
Cursor, Claude, any MCP-capable host |
REST service (agentvision-serve) |
Non-MCP / networked / CI agents |
| Integration recipes | Cursor rules, Aider, generic "agent contract" |
β οΈ "Provider-agnostic" describes the API surface, not behavior. The framework can't force a non-Claude agent into the loop β it gives every agent the means. The Claude Code Skill is the one surface that makes an agent use it proactively; MCP is the first-class cross-host path; the recipes cover the rest.
One agent with eyes self-corrects. A swarm of agents sharing one set of eyes is the real
prize β dozens of workers each rendering UIs, charts, decks or PDFs, every output graded against
the same contract before it counts as done. Run the eyes as a horizontally-scaled service
(agentvision serve) or embed the library per worker; the single-shot endpoints
(analyze/check/conform) are stateless and scale with zero coordination. The one piece of
state to mind is the loop session β kept in-process, so behind multiple workers keep loops
client-side or sticky-route them. And because every worker returns the same agentsensory
Report/Handoff, a coordinator (or a brain like Verel)
aggregates all the verdicts on one bus β vision graded alongside tests, lint and types.
See Swarms & scaling for the topologies, the stateless/stateful split, and a fan-out example.
Pluggable and selectable via --backend / AGENTVISION_VISION_BACKEND:
anthropic(default modelclaude-haiku-4-5, upgradable to Sonnet/Opus)openai,geminilocalβ CV/OCR heuristics only, no API key, no egress (great for CI / air-gapped)
pip install "agentvision[all]" # everything
pip install "agentvision[render]" # just rendering + the no-key local loop
pip install "agentvision[render,anthropic]" # + Claude analysisSystem dependencies (Chromium, Tesseract, poppler) and a doctor that checks them:
agentvision doctor # attempts a real Chromium launch; lists every missing lib
agentvision doctor --fix # installs the Chromium browser binaryOn a bare RHEL/CentOS box, playwright install-deps does not work (apt-only). See
docs/quickstart.md for the dnf line, or use the bundled
Dockerfile which bakes the deps in.
# Analyze a file/URL/HTML string and print a structured report
agentvision analyze ./index.html --backend local --json
# Run the self-correcting loop
agentvision loop ./dashboard.html --max-iter 3
# Responsive contact sheet across breakpoints
agentvision sheet ./index.html --breakpoints 375,768,1280,1920
# Visual regression against a named baseline
agentvision baseline ./index.html --name home
agentvision regress ./index.html --name homeLive pages, SPAs & dashboards (polling, websockets, canvas/WebGL):
# localhost dev server, wait for the data to render, freeze animation, machine output
agentvision analyze http://localhost:5173 --allow-local \
--wait-for "#dashboard" --settle-ms 800 --quietStreaming / video / over-time behavior β watch, don't just glance:
# Is the video actually playing? Did loading finish? Are captions on?
agentvision watch https://app.example.com/player --frames 6 --interval-ms 500 \
--expect 'must: the video is playing'watch reads deterministic <video> state (currentTime/readyState/captions) + pixel
liveness/stall/black-frame detection, then adds a time-aware vision pass. See
docs/use-cases/streaming.md.
--nav-wait defaults to load (polling pages never go idle); --freeze (default on) pauses
animations + requestAnimationFrame so canvas/WebGL pages capture without hanging; --quiet
prints only JSON (logs to stderr, exit codes 0 pass/warn Β· 2 fail Β· 3 error).
Library:
import asyncio
from agentvision import load_settings
from agentvision.core.loop import LoopSession
async def main():
settings = load_settings(vision_backend="local")
session = LoopSession("examples/broken_layout.html", settings=settings)
result = await session.iterate()
print(result.report.verdict, [i.message for i in result.report.issues])
asyncio.run(main())# CI gate (GitHub Action): fails the build on a visual FAIL verdict
- uses: amitpatole/agent-vision@v0.10.0
with: { source: dist/index.html, command: check, args: --full-page }- CI / pre-commit / Makefile β shell out; exit codes
0 pass/warn Β· 2 fail Β· 3 error,--quietfor JSON-only output. Reusable GitHub Action + pre-commit hook included. - Your agents β drop
integrations/agent-contract.mdinto the system prompt, use the Claude Code Skill, or the MCP tools (Cursor/Claude/any host).
Full guide: docs/integrations.md.
π Full docs site: amitpatole.github.io/agent-vision
- Quickstart Β· The Loop Β· Conformance Β· Handoff (eyesβbrain) Β· Streaming / temporal Β· Backends Β· Adapters Β· Integrations Β· Vision
- Pixel-accurate vision-model bounding boxes (they're advisory).
- WCAG verdicts on rasterized non-HTML (heuristic only).
- Bit-reproducible screenshots / deterministic LLM reports.
- Uniform provider-agnostic behavior (only the API surface is uniform).
MIT Β© Amit Patole


{ "perceived": "fail", "next_action": "revise", "matches_intent": false, "todo": ["[overflow] hero text overflows on the right", "[intent/must] a \"Checkout\" button is visible"], "open_questions": ["Verify: uses the brand's dark theme"] }