Awesome Visual Agent

A curated research index for visual agents that perceive, ground, plan, act, create, and evaluate in visually grounded environments.

Visual agents occupy the intersection of multimodal perception, grounded reasoning, tool use, interaction, and control. This repository curates papers, benchmarks, datasets, runtimes, and engineering resources for systems that close the loop between visual observation and purposeful action.

The list is selective rather than exhaustive. It prioritizes works that introduce a clear agent loop, action space, evaluation protocol, data engine, safety finding, or reusable implementation artifact, while excluding generic multimodal models and one-shot visual-generation systems without an agentic mechanism.

Selection Boundary

Included areas:

GUI, web, desktop, and mobile agents that perceive screens and produce executable actions.
Visual grounding work that is clearly tied to downstream agent control.
Embodied vision-language-action systems for robot manipulation, navigation, and physical-world interaction.
Agentic visual reasoning and generation systems with search, planning, memory, tools, critique, or iterative refinement.
Benchmarks, data engines, simulators, safety suites, and toolchains that support visual-agent construction and evaluation.

Excluded by default:

Broad multimodal foundation models with no visual-agent evaluation.
Generic OCR, captioning, visual question answering, or layout parsing without an action or agent setting.
Image, video, or 3D generators that are only prompt-in/artifact-out.
Unverified arXiv IDs, placeholder-looking entries, product rumors, and duplicate rows.

Track	Research question	Representative works
Screen grounding	Can the model localize text, widgets, controls, and regions well enough to act?	Set-of-Mark, SeeClick, OmniParser, UGround, ScreenSpot-Pro, GUI-Eyes, SafeGround, PAGER
Computer use	Can the agent complete tasks in real websites, desktops, or phones over multiple steps?	WebArena, WebLINX, AppAgent, Mobile-Agent, OSWorld, AndroidWorld, Agent S, UI-TARS, WebGym, OpenComputer
Embodied VLA	Can visual observations and language be converted into safe physical actions?	PerAct, VIMA, RT-1, RT-2, Open X-Embodiment, OpenVLA, Pi-Zero, Magma, VLA-REPLICA, Pre-VLA
Agentic reasoning and creation	Can the system plan, search, critique, edit, or generate visual artifacts through a loop?	VISPROG, ViperGPT, DiffusionAgent, GenArtist, DeepEyes, Visual Agentic RFT, Agent Banana, VisionCreator, GEMS, GenEvolve
General visual agents	Can multimodal agents build reusable visual skills, use visual tools, reason over video/charts/visualizations, and coordinate across visual contexts?	Orion, Kimi K2.5, MMSkills, VTC-Bench, VisualToolAgent, Visual Agentic Memory, HierVA, DV-World
Reliability and safety	Can we measure brittleness, privacy risk, prompt injection, unsafe actions, and deployment readiness?	VPI-Bench, OpenAgentSafety, OS-BLIND, HazardArena, UI-CUBE, GUIDE, CORA, WARD
Infrastructure	Which tools and environments support reproducible training, deployment, and evaluation?	BrowserGym, AgentLab, Stagehand, Playwright MCP, Agent S, Cua, OpenCUA, ScaleCUA, WebGym, C-World, LeRobot

Work	Date	Contribution / Relevance
GUI-Eyes	2026-01	Active visual perception for GUI grounding with learned crop/zoom tool use.
ShowUI-Aloha	2026-01	Converts human screen recordings into structured GUI-agent supervision.
OS-Symphony	2026-01	Holistic framework for robust computer-using agents.
Kimi K2.5	2026-02	Open-source multimodal model focused on visual agentic intelligence.
Agent Banana	2026-02	Agentic image editing with planning and tool execution rather than one-shot editing.
SAGE	2026-02	Agentic 3D scene generation for embodied-AI policy training.
ActionEngine	2026-02	Uses state-machine memory to make GUI agents more programmatic and recoverable.
OmniStream	2026-03	Streaming visual-agent representation for perception, reconstruction, and action.
VTC-Bench	2026-03	Evaluates compositional visual tool chaining in agentic multimodal models.
CUA-Suite	2026-03	Large human-annotated video demonstrations for computer-use agents.
GEMS	2026-03	Multimodal generation loop with memory, skills, and iterative agent refinement.
SciVisAgentBench	2026-03-31	Benchmark for scientific data analysis and visualization agents.
SASAV	2026-04-03	Self-directed agent for scientific analysis and visualization workflows.
UI-Copilot	2026-04	Long-horizon GUI automation with tool-integrated policy optimization.
UI-Zoomer	2026-04	Uncertainty-driven zoom-in for hard GUI grounding cases.
CANVAS	2026-04-15	Agentic storyboarding for continuity-aware long-form visual narratives.
Progressive Online Video Understanding	2026-04-20	Streaming visual-agent setting where answers trigger when enough evidence appears.
Beyond Pixels	2026-04-22	Interactive grounding for visualization agents beyond static pixel reading.
AI-Gram	2026-04-23	Deployed AI-native social network where visual agents create and respond to visual content.
DynamicGUIBench	2026-04	Evaluates GUI agents in high-dynamic interfaces rather than static screenshots.
DV-World	2026-04-28	Real-world benchmark for data-visualization agents with grounding and intent alignment.
UI-Verse	2026-05	Studies interface design heuristics that improve computer-use-agent reliability.
HierVA	2026-05-05	Hierarchical visual agent for chart reasoning across image-text contexts.
Securing Computer-Use Agents	2026-05	Connects CUA architecture, lifecycle, permission scope, and runtime reliability.
Don't Click That	2026-05	Deception-aware web-agent benchmark and defense for misleading interface elements.
VLAs-as-Tools	2026-05-13	Long-horizon embodied-agent strategy that delegates bounded physical subtasks to specialized VLA tools.
MMSkills	2026-05-13	Multimodal skill packages for reusable procedural knowledge in general visual agents.
Video2GUI	2026-05	Synthesizes GUI interaction trajectories from instructional videos.
SaaS-Bench	2026-05-15	Real-world SaaS workflow benchmark for long-horizon computer-use agents.
ScreenSearch	2026-05-15	Ambiguity-aware OS exploration for building large desktop GUI state graphs.
ShopGym	2026-05-15	Realistic, controllable e-commerce simulation and benchmark for web agents.
Visual Agentic Memory	2026-05-15	Online indexing, hierarchical memory, and agentic retrieval for long video understanding.
SE-GA	2026-05-16	Memory-augmented self-evolution framework for long-horizon GUI agents.
DocOS	2026-05-18	Benchmark for GUI agents that proactively retrieve documentation and ground it into actions.
MementoGUI	2026-05-18	Learned multimodal memory controller for long-horizon GUI-agent trajectories.
AQuaUI	2026-05-19	Adaptive quadtree visual-token reduction for high-resolution GUI-agent screenshots.
CutVerse	2026-05-19	GUI-agent benchmark for professional media post-production editing workflows.
OpenComputer	2026-05-19	Verifier-grounded software worlds and auditable rewards for computer-use agents.
VLA-REPLICA	2026-05-20	Low-cost, reproducible real-world benchmark for VLA model evaluation.
Agent JIT Compilation	2026-05-20	Compiles web-agent plans into lower-latency executable schedules.
GenEvolve	2026-05-20	Self-evolving image-generation agent using tool-orchestrated visual experience distillation.
Pre-VLA	2026-05-21	Runtime verification for risky VLA actions and world-model rollouts before execution.
Spatial Memory for Out-of-Vision Manipulation	2026-05-21	Adds persistent spatial memory to VLA policies when targets leave the camera view.
Generation Navigator	2026-05-18	State-aware multi-turn text-to-image agent trained with trajectory-level RL.
SimGym	2026-05-19	Traffic-grounded VLM browser agents for e-commerce A/B-test simulation.
GUI Agents for Continual Game Generation	2026-05-27	Uses GUI playtesting agents to evaluate and iteratively improve playable browser-game generation.
ProgVLA	2026-05-27	Compact progress-aware VLA policy for long-horizon robot manipulation.
MIRAGE	2026-05-27	Context-aware prompt-injection pipeline for mobile GUI agents through user-generated content.
Mag-VLA	2026-05-27	Bimanual magnetically actuated microrobot manipulation with a VLA policy.
MaskClaw	2026-05-27	Edge-side personalized privacy arbitration and skill evolution for screenshot-based GUI agents.
GenClaw	2026-05-28	Code-driven agentic image generation with reasoning, executable sketches, and generative refinement.
Qwen-VLA	2026-05-29	Unified VLA modeling across embodied tasks, environments, and robot embodiments.
Gaze2Act	2026-05-28	Gaze-conditioned VLA policies for interactive real-robot manipulation.
DeMaVLA	2026-05-29	VLA foundation model for real-world deformable-object manipulation.
BraveGuard	2026-05-31	Self-evolving safety defense trained from open-world threats and realistic computer-use trajectories.
PiL-World	2026-06-04	Chunk-wise world model for closed-loop VLA policy-in-the-loop evaluation.
GUI-AC	2026-06-09	Continual-learning method for GUI agents using adaptive advantage and dynamic clipping.
MemVenom	2026-06-09	Triggered poisoning attack against multimodal memories in long-horizon web agents.
Workflow-GYM	2026-06-10	Long-horizon benchmark for professional GUI workflows across specialized software domains.
HiViG	2026-06-10	History-aware visually grounded critic for pre-execution CUA action evaluation.
Learning What to Say to Your VLA	2026-06-10	Test-time language steering for frozen VLA policies with conformal harmlessness control.
VLGA	2026-06-10	Vision-language-geometry-action model for geometry-grounded autonomous driving.
Orchestra-o1	2026-06-10	Omnimodal agent orchestration framework with modality-aware decomposition, sub-agent specialization, and parallel execution.
CAPED	2026-06-10	Context-aware privacy exposure defense for screenshot-based mobile GUI agents.
PERIA	2026-06-11	Tool-augmented visual agent for spatial reasoning through perception and interaction tools.
InterleaveThinker	2026-06-11	Multi-agent planner-critic pipeline for interleaved text-image generation.
ReactVLA	2026-06-12	Low-latency reactive VLA framework for closed-loop robot manipulation.
Naive Visual Memory is Not Enough	2026-06-12	Failure-mode study of experiential and visual memory in GUI agents.
LabVLA	2026-06-12	Grounds VLA models in scientific laboratory protocol execution.
OSGuard	2026-06-13	Safety benchmark for computer-use agents that distinguishes task success from unsafe shortcuts.
MyPCBench	2026-06-15	Benchmark for personally intelligent computer-use agents over user-specific digital contexts.
LabOSBench	2026-06-15	Computer-use-agent benchmark for scientific instrument control interfaces.
ACE-Ego-0	2026-06-16	Unifies egocentric human video and robotic trajectories for VLA pretraining.
ProCUA-SFT	2026-06-16	Technical report on supervised fine-tuning data and recipes for desktop computer-use agents.
WeaveLA	2026-06-16	Event-driven latent memory weaving for repetitive long-horizon robot manipulation.
GeneralVLA-2	2026-06-16	Geometry-aware reconstruction and governed memory for robot planning.
MuseVLA	2026-06-16	Adaptive multimodal sensing VLA that invokes non-RGB sensors as task tools.
Qwen-RobotManip	2026-06-16	Qwen-VL-based robotic manipulation foundation model scaled with aligned heterogeneous data.
PearlVLA	2026-06-16	Progressive embodied action-plan refinement in latent space for efficient VLA deliberation.
PreAct	2026-06-16	Compiles successful computer-use trajectories into screen-checked state-machine programs.
ThinkingVLA	2026-06-16	Interleaves visual forecasting and language reasoning for long-horizon robotic manipulation.
Uncertainty Quantification for Flow-Based VLAs	2026-06-16	Uses velocity-field disagreement for failure detection and active fine-tuning of flow-based VLAs.
WireCraft	2026-06-16	Industrial deformable-linear-object manipulation benchmark with VLA baselines.

Work	Year	Links	Contribution / Relevance
A Comprehensive Survey of Agents for Computer Use	2025	paper	Broad map of computer-use-agent domains, agent loops, and evaluation bottlenecks.
GUI Agents: A Survey	2024	paper	Practical survey of GUI-agent architectures, datasets, benchmarks, and failure modes.
A Survey on (M)LLM-Based GUI Agents	2025	paper	Focused entry point for planning, grounding, memory, and GUI-agent evaluation.
Towards Trustworthy GUI Agents	2025	paper	Reliability and safety framing for deployment-facing GUI agents.
Large Multimodal Agents: A Survey	2024	paper	Contextual background on LLM-driven multimodal agent components.
A Survey on Vision-Language-Action Models for Embodied AI	2024	paper	Early VLA survey covering embodied perception, planning, and action.
Vision-Language-Action in Robotics	2026	paper	Data-centric survey of VLA datasets, benchmarks, and data engines.
Vision-Language-Action Safety	2026	paper	Focused taxonomy of threats, evaluations, and defenses for VLA systems.
Safety in Embodied AI	2026	paper	Wider safety survey across perception, planning, action, and interaction.
Visual Generation in the New Era	2026	paper	Conceptual lens for when visual generation becomes agentic world modeling.
Securing Computer-Use Agents	2026	paper	Deployment-grounded view of CUA reliability across architecture, lifecycle, permissions, and oversight.
GUI Agents with Reinforcement Learning	2026	paper	RL-centered survey of GUI-agent rewards, data efficiency, continual learning, and deployment risks.
Agentic World Modeling	2026	paper	Taxonomy for predictive world models across physical, digital, social, and scientific agents.
World Action Models	2026	paper	Defines embodied models that jointly predict future states and actions rather than actions alone.

Work	Year	Links	Contribution / Relevance
CogAgent	2023	paper, code	Early high-resolution VLM built explicitly for GUI understanding and navigation.
Set-of-Mark Prompting	2023	paper, code	Simple visual marking strategy that became a practical grounding primitive for LMM agents.
SeeClick	2024	paper, code	Shows that GUI grounding is a core bottleneck for visual GUI agents.
ScreenAI	2024	paper	Strong foundation for screen, document, infographic, and layout-heavy visual understanding.
Ferret-UI	2024	paper, code	Region-aware mobile UI understanding with explicit grounding.
OmniParser	2024	paper, code	Practical screenshot-to-interactable-region parser for pure-vision GUI agents.
UGround	2024	paper, code	Strong pure-vision grounding baseline without relying on accessibility trees.
OS-ATLAS	2024	paper	Foundation action model for generalist GUI agents.
ShowUI	2024	paper, code	Unifies screenshot-conditioned GUI perception and action modeling.
Aguvis	2024	paper	Pure-vision GUI agent direction with autonomous interface interaction.
UI-E2I-Synth	2025	paper	Synthetic instruction pipeline for scaling GUI grounding supervision.
ScreenSpot-Pro	2025	paper	Hard high-resolution grounding benchmark for professional computer-use screens.
GUI-G1	2025	paper, code	Careful analysis of RL pitfalls in GUI grounding.
Enhancing Visual Grounding via Self-Evolutionary RL	2025	paper	Data-efficient RL recipe for high-resolution GUI grounding.
GUI-Actor	2025	paper	Coordinate-free grounding with an action head and verifier.
Phi-Ground	2025	paper	Strong empirical report on training compact GUI grounding models.
Test-Time RL for GUI Grounding	2025	paper	Test-time adaptation using region consistency.
Explicit Position-to-Coordinate Mapping	2025	paper	Addresses coordinate generation as a concrete grounding bottleneck.
GUI-Eyes	2026	paper	Learns when and how to call visual tools such as crop and zoom.
SafeGround	2026	paper	Calibrates GUI-grounding uncertainty before risky or irreversible actions.
UI-Zoomer	2026	paper, code	Uses uncertainty to decide where to zoom for GUI grounding.
AutoFocus	2026	paper	Training-free active visual search for high-resolution GUI grounding.
DRS-GUI	2026	paper	Dynamic region search that narrows cluttered screenshots without model fine-tuning.
WinDeskGround	2026	paper	Robust grounding benchmark for complex multi-window desktop interfaces.
PAGER	2026	paper	Studies point-precise geometric GUI control where small coordinate errors cascade.
AQuaUI	2026	paper	Adaptive-quadtree visual-token reduction for high-resolution GUI-agent screenshots.

Work	Year	Links	Contribution / Relevance
Mind2Web	2023	paper	Foundational benchmark for generalist web agents.
Android in the Wild	2023	paper	Large-scale Android device-control dataset with realistic gestures.
WebArena	2023	paper, code	Realistic web-agent environment with execution-based tasks.
AutoDroid	2023	paper	Early Android task-automation system and benchmark that remains relevant as a mobile-agent baseline.
MM-Navigator	2023	paper	Early GPT-4V smartphone GUI navigation agent with zero-shot screen interaction.
AppAgent	2023	paper	Smartphone agent that learns app operation from autonomous exploration or demonstrations.
SeeAct	2024	paper	Web agent showing why grounding matters for GPT-4V-style agents.
Mobile-Agent	2024	paper, code	Vision-centric mobile device agent using visual perception tools and stepwise planning.
VisualWebArena	2024	paper, code	Adds visually grounded tasks to realistic web-agent evaluation.
WebVoyager	2024	paper	End-to-end multimodal web agent evaluated on live websites.
WebLINX	2024	paper, project	Large benchmark of multi-turn conversational web navigation with screenshots and action history.
OmniACT	2024	paper	Desktop and web benchmark where agents generate executable automation scripts.
WorkArena	2024	paper, code	Enterprise workflow benchmark for knowledge-work agents.
MMInA	2024	paper, code	Multihop multimodal Internet-agent benchmark on evolving real websites.
B-MoCA	2024	paper	Mobile device-control benchmark across diverse configurations.
OSWorld	2024	paper, code	Flagship benchmark for open-ended tasks in real desktop environments.
AndroidWorld	2024	paper, code	Dynamic Android benchmark with broad task diversity.
Mobile-Agent-v2	2024	paper, code	Multi-agent mobile operation assistant with planning, decision, and reflection roles.
MobileAgentBench	2024	paper	Practical benchmark for mobile LLM agents.
WebCanvas	2024	paper	Online web-agent benchmark and framework built around Mind2Web-Live.
Agent S	2024	paper, code	Open agentic framework for using computers through GUI actions.
Windows Agent Arena	2024	paper, code	Scalable evaluation environment for Windows OS agents.
SPA-Bench	2024	paper	Comprehensive smartphone-agent evaluation benchmark.
AndroidLab	2024	paper	Android training and benchmarking environment with virtual devices and task suites.
VideoWebArena	2024	paper	Long-context video understanding inside web-agent workflows.
MageBench	2024	paper, code	Lightweight visual-agent benchmark covering WebUI, Sokoban, and Football environments.
UI-TARS	2025	paper	Native GUI-agent model trained for perception, grounding, and action.
A3	2025	paper, project	Android Agent Arena for online mobile GUI-agent evaluation across real apps.
Agent S2	2025	paper, code	Generalist-specialist framework for computer-use agents.
UI-Evol	2025	paper	Plug-in knowledge-evolution module that improves OSWorld execution reliability for CUAs.
ZeroGUI	2025	paper	Online GUI-agent learning with task generation and reward estimation.
OpenCUA	2025	paper, code	Open foundation stack for computer-use agents.
ScaleCUA	2025	paper, code	Cross-platform data scaling for open-source computer-use agents.
WebGym	2026	paper	Large-scale training environment for realistic visual web agents.
C-World	2026	paper	Environment creator for scalable computer-use-agent training.
OS-Symphony	2026	paper, code	Framework for robust and generalist computer-use agents.
OmegaUse	2026	paper	General-purpose GUI agent for autonomous task execution.
OS-Marathon	2026	paper	Benchmark for long-horizon repetitive professional computer-use workflows.
Continual GUI Agents	2026	paper	Continual-learning setup and RL recipe for shifting GUI domains and resolutions.
CUA-Skill	2026	paper	Structured skill base for reusable computer-use procedures and composition graphs.
DynaWeb	2026	paper	Model-based RL framework that trains web agents inside learned web world models.
Avenir-Web	2026	paper	Multimodal web agent with grounding experts, experience imitation, and memory.
Agent Alpha	2026	paper	Uses step-level MCTS to unify GUI-agent generation, exploration, and evaluation.
UI-Mem	2026	paper	Hierarchical experience memory for online RL in mobile GUI agents.
MemGUI-Bench	2026	paper	Evaluates memory across mobile GUI sessions and changing environments.
ActionEngine	2026	paper	State-machine memory for more structured GUI automation.
SecAgent	2026	paper	Efficient 3B mobile GUI agent with semantic context compression and Chinese mobile data.
ContractSkill	2026	paper	Treats web-agent skills as repairable contracts that can be verified and reused.
PSPA-Bench	2026	paper	Personalized smartphone GUI-agent benchmark with process-level evaluation.
GPA	2026	paper	Demonstration-based GUI process automation with local deterministic replay.
ClawGUI	2026	paper	Unified framework for training, evaluating, and deploying GUI agents.
RiskWebWorld	2026	paper	Realistic interactive benchmark for e-commerce risk-management GUI agents.
UI-Copilot	2026	paper, code	Long-horizon GUI automation with tool-integrated policy optimization.
DynamicGUIBench	2026	paper	Stress-tests agents in dynamic, evolving GUI environments.
OmniGUI	2026	paper, project	Smartphone GUI benchmark with synchronized visual, audio, and video context.
UI-Verse	2026	paper	Interface-design perspective on making CUAs more reliable.
How Mobile World Model Guides GUI Agents?	2026	paper	Analyzes which mobile world-model representations help GUI-agent training and test-time guidance.
Executable Agentic Memory	2026	paper	Converts GUI experience into executable memory graphs for retrieval-and-execution planning.
SaaS-Bench	2026	paper, code	Long-horizon benchmark over real deployable SaaS systems and professional workflows.
ShopGym	2026	paper	Realistic, controllable e-commerce simulation and benchmark for web agents.
ScreenSearch	2026	paper	Ambiguity-aware large-scale desktop OS exploration with deduplicated state graphs.
Skim	2026	paper	Speculative execution framework for faster web-agent workflows on structured sites.
SE-GA	2026	paper, code	Memory-augmented self-evolving GUI agent for dynamic long-horizon tasks.
DocOS	2026	paper	Proactive document-guided GUI-agent benchmark in open web environments.
MementoGUI	2026	paper	Plug-in multimodal memory controller for long-horizon GUI control.
OpenComputer	2026	paper	Verifiable software worlds with state verifiers, task generation, and auditable rewards.
CutVerse	2026	paper	Benchmark for professional media post-production editing with dense multimodal GUIs.
Agent JIT Compilation	2026	paper	Compiles web-agent plans into lower-latency executable schedules.
Weblica	2026	paper	Reproducible web-replica environments for scaling visual web-agent training.
TClone	2026	paper	Low-latency live GUI environment forking for parallel CUA rollouts and what-if execution.
PANDO	2026	paper	Online skill distillation that reduces token and action overhead for multimodal web agents.
SimuWoB	2026	paper	Synthetic realistic mobile-app benchmark for fast, faithful GUI-agent evaluation.
CUA-Gym	2026	paper	Scalable generation of verifiable environments, tasks, rewards, and models for CUA RLVR.
MobileGym	2026	paper	Parallel mobile GUI-agent simulator with structured state, deterministic judges, and RL rewards.
AndroidDaily	2026	paper	Real-world closed-source Android benchmark with process-aware visual trajectory grading.
LearnWeak	2026	paper	Student-aware data synthesis and specialization for small computer-use agents.
PRO-CUA	2026	paper	Step-level process-reward optimization for computer-use agents on live web tasks.
GUITestScape	2026	paper	Open-set exploratory GUI testing benchmark for MLLM agents.
Multi-Agent Computer Use	2026	paper	Multi-agent CUA architecture with DAG decomposition, parallel execution, and replanning.
OpenWebRL	2026	paper	Online multi-turn RL framework for training open visual web agents on live websites.
ColorBrowserAgent	2026	paper	Human-in-the-loop long-horizon web GUI agent with progress summarization and knowledge adaptation.
WebForge	2026	paper, code	Automated framework for generating scalable, reproducible browser-agent benchmark environments.
AgentLens	2026	paper	Mobile GUI agent with adaptive visual modalities for human-agent interaction during execution.
SimGym	2026	paper	Live-browser VLM-agent framework for simulating e-commerce A/B tests.
GUI-AC	2026	paper	Enhances continual GUI-agent learning with grounding-certainty-aware advantage and clipping.
Workflow-GYM	2026	paper	Long-horizon benchmark for professional GUI workflows in specialized software environments.
HiViG	2026	paper, code	History-aware visually grounded critic for test-time CUA action evaluation.
Naive Visual Memory is Not Enough	2026	paper	Failure-mode study of visual and experiential memory modules in GUI agents.
MyPCBench	2026	paper	Evaluates personally intelligent CUAs over user-specific digital context and accounts.
LabOSBench	2026	paper	Benchmarks CUAs on scientific instrument-control interfaces and feedback loops.
ProCUA-SFT	2026	paper	Desktop CUA supervised fine-tuning report with trajectory data and training recipes.
PreAct	2026	paper	Compiles successful screen interaction trajectories into guarded state-machine programs for repeat tasks.

Work	Year	Links	Contribution / Relevance
PerAct	2022	paper, project	Language-conditioned RGB-D manipulation agent that predicts voxel actions directly.
VIMA	2022	paper, project	Multimodal-prompt robot manipulation benchmark and transformer agent.
RT-1	2022	paper, project	Large-scale real-robot action model that anchors later RT/VLA work.
PaLM-E	2023	paper	Embodied multimodal language model connecting visual input to robot tasks.
RT-2	2023	paper	Canonical VLA model transferring web-scale vision-language knowledge to robot control.
Open X-Embodiment / RT-X	2023	paper	Large robot-learning dataset and RT-X model family.
Octo	2024	paper	Open-source generalist robot policy.
OpenVLA	2024	paper, code	Open-source VLA model and a common baseline for robot manipulation.
Pi-Zero	2024	paper	Flow-based VLA model for general robot control.
Magma	2025	paper, code	Bridges multimodal agents across digital and physical actions.
SafeVLA	2025	paper	Safety alignment for VLA models via constrained learning.
Interleave-VLA	2025	paper	Robot manipulation with interleaved image-text instructions.
ChatVLA-2	2025	paper	Open-world embodied reasoning from pretrained knowledge.
VLA^2	2025	paper	Agentic framework for unseen-concept manipulation.
World-Value-Action	2026	paper	Uses implicit planning and future-state value estimation for VLA systems.
VLAs-as-Tools	2026	paper	Splits long-horizon embodied tasks between a high-level VLM planner and specialized VLA tools.
SAGE	2026	paper, code	Agentically generates simulator-ready 3D scenes for embodied policy training.
StableVLA	2026	paper	Studies robustness of VLA models under unseen visual disturbances without extra data.
Dexora	2026	paper	Open-source VLA direction for high-DoF bimanual dexterous manipulation.
VLA-REPLICA	2026	paper	Low-cost reproducible real-world evaluation benchmark for VLA models.
Spatial Memory for Out-of-Vision Manipulation	2026	paper	Adds persistent spatial memory when manipulation targets leave the current camera view.
Pre-VLA	2026	paper	Preemptive runtime verification for VLA actions and world-model rollouts.
ActQuant	2026	paper	Action-guided mixed-precision quantization for deploying VLA models on constrained hardware.
Continuous Reasoning for VLA	2026	paper	Replaces token-style reasoning with shareable continuous latents aligned to action chunks.
VLAMotor	2026	paper	Test-guided failure discovery and agent-based synthetic data repair for VLA models.
FATE-VLA	2026	paper	Adaptive failure-aware test generation that searches high-risk embodied scenes for VLA failures.
Uni-LaViRA	2026	paper	Agentic language-vision-robot-action architecture for unified embodied navigation across robot types.
ProgVLA	2026	paper	Progress-aware compact VLA model for long-horizon and multi-object robot manipulation.
Mag-VLA	2026	paper	VLA policy for bimanual magnetically actuated microrobot manipulation.
Gaze2Act	2026	paper	Uses human gaze as a dynamic intent signal for interactive VLA robot manipulation.
DeMaVLA	2026	paper	VLA foundation model for deformable-object manipulation with real-world folding data.
PiL-World	2026	paper	Chunk-wise world model for closed-loop VLA policy-in-the-loop evaluation.
Learning What to Say to Your VLA	2026	paper	Searches and distills language feedback policies for steering frozen VLA models.
VLGA	2026	paper	Adds dense geometry supervision to vision-language-action models for autonomous driving.
ReactVLA	2026	paper	Fast lightweight reactive robot manipulation via improved mean-flow action generation.
Qwen-VLA	2026	paper	Unifies embodied decision-making across tasks, environments, and robot embodiments.
LabVLA	2026	paper	Grounds VLA models in scientific laboratory protocol execution and bench work.
ACE-Ego-0	2026	paper	Bridges egocentric human videos and robot trajectories for VLA pretraining.
WeaveLA	2026	paper	Adds event-driven cross-subtask latent memory for repetitive robot manipulation.
GeneralVLA-2	2026	paper	Uses geometry-aware reconstruction and governed memory for robot planning.
MuseVLA	2026	paper	Treats temperature, audio, radar, and other sensors as on-demand VLA tools.
Qwen-RobotManip	2026	paper	Scales Qwen-VL-based manipulation models through aligned heterogeneous robot and human data.
PearlVLA	2026	paper	Refines embodied action plans in latent space with future-guided process rewards.
ThinkingVLA	2026	paper	Interleaves visual forecasting, inverse reasoning, and action generation for long-horizon manipulation.
Uncertainty Quantification for Flow-Based VLAs	2026	paper, project	Estimates VLA epistemic uncertainty for failure detection and active fine-tuning.
WireCraft	2026	paper	Industrial deformable-linear-object manipulation benchmark with shared VLA evaluation.

Work	Year	Links	Contribution / Relevance
VISPROG	2022	paper, project	Foundational visual-programming approach for tool-composed visual reasoning and editing.
Visual ChatGPT	2023	paper, code	Early system connecting ChatGPT with visual foundation models for multi-step visual tasks.
ViperGPT	2023	paper, code	Uses Python execution to compose vision modules for interpretable visual reasoning.
LLaVA-Plus	2023	paper	Trains multimodal agents to select and use visual tools across understanding and generation.
DiffusionAgent	2024	paper	Routes prompts through expert diffusion models with tree-of-thought navigation and feedback memory.
GenArtist	2024	paper, code	MLLM-as-agent for image generation and editing through planning and tool use.
CIGEval	2025	paper	Agentic evaluation framework for conditional image generation.
DeepEyes	2025	paper	Reinforcement learning for active visual reasoning, grounding, and "thinking with images."
ImAgent	2025	paper	Test-time scalable multimodal agent framework for image generation.
GenAgent	2026	paper	Scales text-to-image generation through agentic multimodal reasoning.
Mind-Brush	2026	paper	Adds cognitive search and reasoning loops to image generation.
Agent Banana	2026	paper, code	High-fidelity image editing with planner-executor tooling.
M3	2026	paper	Multi-modal, multi-agent, multi-round reasoning for high-fidelity text-to-image generation.
VisionCreator	2026	paper	Native visual-generation agentic model with understanding, planning, and creation.
VisionCreator-R1	2026	paper	Adds explicit reflection and reflection-plan co-optimization for visual-generation agents.
Gen-Searcher	2026	paper, project	Reinforces agentic search for image generation.
GEMS	2026	paper, project	Multimodal generation with memory, skills, and iterative agent loops.
Visual Generation in the New Era	2026	paper	Helpful taxonomy for agentic world modeling and generation.
Visual Agentic Memory	2026	paper	Online indexing, hierarchical memory, and agentic retrieval for long video understanding.
GenEvolve	2026	paper	Self-evolving image-generation agent with tool-orchestrated visual experience distillation.
Generation Navigator	2026	paper	State-aware multi-turn text-to-image agent with trajectory-level RL for generation steering.
GUI Agents for Continual Game Generation	2026	paper, project	Uses GUI playtesting agents as evaluators and feedback providers for playable game generation.
GenClaw	2026	paper	Code-driven agentic image generation that plans, sketches with executable code, and refines with image models.
InterleaveThinker	2026	paper	Multi-agent planner-critic pipeline for agentic interleaved text-image generation.

Work	Year	Links	Contribution / Relevance
AVA	2023	paper	Autonomous visualization agents with visual perception-driven decision making.
Visual Agents as Fast and Slow Thinkers	2024	paper	System-1/System-2 framing for visual-agent reasoning and action.
Visual Agentic AI for Spatial Reasoning	2025	paper	Dynamic-API visual agent for spatial reasoning in 3D scenes.
Visual Agentic Reinforcement Fine-Tuning	2025	paper	Trains VLMs to use visual tools and code for "thinking with images."
ParaView-MCP	2025	paper	Autonomous visualization agent with direct tool use in ParaView.
VisualToolAgent / VisTA	2025	paper	RL framework for dynamic visual tool selection and composition.
Evaluation-Centric Scientific Visualization Agents	2025	paper	Evaluation-first paradigm for scientific visualization agents.
DART	2025	paper	Uses multi-agent disagreement to recruit specialized visual tools.
Orion	2025	paper	Unified visual agent for multimodal perception, visual reasoning, and tool execution.
Kimi K2.5	2026	paper	Open-source multimodal agentic model optimized jointly for text and vision.
OmniStream	2026	paper	Streaming visual-agent representation for perception, reconstruction, and action.
VTC-Bench	2026	paper	Evaluates agentic multimodal models through compositional visual tool chaining.
SciVisAgentBench	2026	paper	Reproducible benchmark for scientific data analysis and visualization agents.
SASAV	2026	paper	Self-directed scientific analysis and visualization agent.
CANVAS	2026	paper	Visual agentic storyboarding for continuity-aware long-form visual narratives.
Progressive Online Video Understanding	2026	paper	Online visual agent that answers when enough streaming evidence appears.
Beyond Pixels	2026	paper	Introspective and interactive grounding for visualization agents.
AI-Gram	2026	paper	Live social platform populated by visual agents that create and respond to visual content.
DV-World	2026	paper	Real-world benchmark for data-visualization agents with native environment grounding.
Hierarchical Visual Agent / HierVA	2026	paper	Manages image-text contexts for multi-step chart reasoning across subplots.
Emergent Communication between Heterogeneous Visual Agents	2026	paper	Studies decentralized communication when visual agents have private representations.
MMSkills	2026	paper	Multimodal procedural skill packages for reusable visual-agent decision making.
Visual Agentic Memory	2026	paper	Training-free visual memory for online indexing, retrieval, and evidence verification.
MemEye	2026	paper	Visual-centric evaluation framework for long-term multimodal agent memory.
Diversity Over Frequency	2026	paper	Studies tool-use collapse and rollout diversity in visual Chain-of-Thought agents.
VESTA	2026	paper	Scientific visual exploration agent with dynamically generated statistical tools.
CV-Arena	2026	paper	Instructional computer-vision benchmark with agentic planning, editing, and verification.
Visual Skills	2026	paper	Multimodal reusable skill paradigm preserving visual evidence and spatial interaction traces.
TVIR	2026	paper	Text-visual interleaved deep-research benchmark and hierarchical multimodal report agent.
Active Exploring like a Pigeon	2026	paper	Agentic spatial reasoning with dynamic cognitive maps and verifiable spatial assertion codes.
PERIA	2026	paper	Tool-augmented visual agent for spatial reasoning across map reasoning, probing, and reconstruction tasks.
Orchestra-o1	2026	paper	Omnimodal agent orchestration with modality-aware task decomposition, online sub-agent specialization, and parallel sub-task execution.

Work	Year	Links	Contribution / Relevance
AGENTSAFE	2025	paper	Safety benchmark for embodied agents under hazardous instructions.
IS-Bench	2025	paper	Interactive safety benchmark for VLM-driven household agents.
VPI-Bench	2025	paper, code	Visual prompt-injection benchmark for computer-use agents.
OpenAgentSafety	2025	paper	Framework for evaluating real-world agent safety across risk categories.
OS-Sentinel	2025	paper	Hybrid validation for safer mobile GUI agents.
UI-CUBE	2025	paper	Enterprise CUA benchmark that measures operational reliability beyond task accuracy.
SafePred	2026	paper	Predictive guardrail for computer-using agents using world-model rollouts.
LPS-Bench	2026	paper	Safety-awareness benchmark for long-horizon CUA planning under benign and adversarial scenarios.
GUIGuard-Bench	2026	paper	Privacy-preserving GUI-agent evaluation.
CUAAudit	2026	paper	Tests whether VLMs can audit autonomous computer-use agents.
GUIDE	2026	paper	Hierarchical diagnostic evaluation for long GUI-agent trajectories.
VeriGUI	2026	paper	Action-effect verification and self-correction for robust GUI automation.
Semantic-level UI Element Injection	2026	paper	Red-teaming method that distracts GUI agents through benign-looking injected UI elements.
CORA	2026	paper	Conformal risk-controlled safeguard for mobile GUI-agent action execution.
OS-BLIND	2026	paper	Shows how benign-looking user instructions expose CUA vulnerabilities.
HazardArena	2026	paper	Semantic safety evaluation for VLA systems.
RedVLA	2026	paper	Physical red-teaming benchmark for VLA models.
GUI-Perturbed	2026	paper	Domain-randomization study exposing GUI-grounding brittleness.
OS-SPEAR	2026	paper	Toolkit for safety, performance, efficiency, and robustness analysis of OS agents.
Don't Click That	2026	paper	Benchmarks and mitigates deceptive UI elements for VLM-based web agents.
SafeManip	2026	paper	Temporal-safety benchmark for robotic manipulation using LTL-style monitors.
WARD	2026	paper	Robust defense for web agents against prompt injection in HTML and visual interfaces.
ProjGuard	2026	paper	Safety monitoring for computer-use agents via low-dimensional projections.
Pre-VLA	2026	paper	Runtime verification for risky VLA action generation and imagined rollouts.
AgentHijack	2026	paper	Benchmarks CUA robustness to realistic environment corruptions rather than direct adversarial prompts.
ROGUE	2026	paper	Corrigibility benchmark showing unsafe behavior can arise during ordinary computer-use tasks.
SafeVLA-Bench	2026	paper	Post-hoc safety benchmark exposing unsafe-success cases in VLA manipulation rollouts.
FATE-VLA	2026	paper	Failure-seeking VLA test generation for robustness evaluation before deployment.
MemVenom	2026	paper	Triggered poisoning attack against multimodal memory retrieval in web agents.
OSGuard	2026	paper	Dual-granularity CUA safety benchmark for unsafe shortcuts under benign instructions.
MIRAGE	2026	paper	Context-aware prompt injection against mobile GUI agents through user-generated content regions.
MaskClaw	2026	paper	Edge-side personalized privacy arbitration for GUI agents with behavior-driven skill evolution.
BraveGuard	2026	paper	Self-evolving guard training loop for safer computer-use-agent trajectories.
CAPED	2026	paper	Context-aware screenshot exposure control for mobile GUI-agent privacy.

Area	Resource	Link	Primary Use
Web	MiniWoB++	code	Compact browser-interaction environments for controlled RL-style experiments.
Web	Mind2Web	paper	Offline web-agent action prediction and grounding.
Web	WebArena	paper, code	Realistic web navigation with execution-based grading.
Web	WebArena-Verified	code	Audited WebArena task set with deterministic offline evaluation.
Web	VisualWebArena	paper, code	Visually grounded web tasks where screenshots matter.
Web	WebLINX	paper, project	Conversational web navigation from expert demonstrations.
Web	WorkArena	paper, code	Enterprise workflow automation in ServiceNow-style environments.
Web	MMInA	paper, code	Multihop multimodal tasks over evolving real websites.
Web	WebCanvas	paper	Online web-agent evaluation with Mind2Web-Live.
Web	WebGym	paper	Large-scale realistic training environment for visual web agents.
Web	DocOS	paper	Document-guided GUI-agent tasks in dynamic open-web environments.
Web	RiskWebWorld	paper	Realistic e-commerce risk-management tasks for GUI agents.
Web	SaaS-Bench	paper, code	Long-horizon professional workflows across deployable SaaS systems.
Web	ShopGym	paper	Controllable e-commerce simulation with realistic layouts, catalogs, policies, and tasks.
Desktop	OSWorld	paper, code	Open-ended desktop tasks in real operating systems.
Desktop	Windows Agent Arena	paper, code	Windows-specific scaling and reproducible OS-agent evaluation.
Desktop	OmniACT	paper	Evaluating executable automation rather than only low-level clicks.
Desktop	OS-Marathon	paper	Long-horizon repetitive professional workflows.
Desktop	OpenComputer	paper	Verifiable software worlds with state verifiers and auditable partial-credit rewards.
Desktop	CutVerse	paper	Media post-production editing tasks across professional creative applications.
Mobile	Android in the Wild	paper	Large-scale Android device-control demonstrations with screen observations.
Mobile	B-MoCA	paper	Mobile control across diverse device configurations.
Mobile	AndroidWorld	paper, code	Dynamic Android tasks with broad app coverage.
Mobile	AndroidControl	paper	Diverse Android control dataset for studying scale and generalization.
Mobile	MobileAgentBench	paper	Efficient mobile-agent evaluation across open-source apps.
Mobile	SPA-Bench	paper	Smartphone-agent testing with comprehensive task coverage.
Mobile	AndroidLab	paper	Training and systematic benchmarking on Android virtual devices.
Mobile	A3	paper, project	Real-app online evaluation for mobile GUI agents.
Mobile	SecAgent	paper	Chinese mobile GUI dataset, benchmark, and compact semantic-context agent.
Mobile	PSPA-Bench	paper	Personalized smartphone GUI-agent tasks with process-aware evaluation.
Mobile	OmniGUI	paper, project	Omni-modal smartphone action prediction with visual, audio, and video cues.
Computer use	C-World	paper	On-demand environment creation for computer-use-agent training.
Grounding	ScreenSpot-Pro	paper	High-resolution professional-screen grounding.
Grounding	WinDeskGround	paper	Multi-window desktop grounding under realistic visual clutter.
Grounding	PAGER	paper	Point-precise GUI control for geometric construction tasks.
Visual-agent reasoning	MageBench	paper, code	Lightweight environments for vision-in-the-chain agent reasoning.
Visual-agent reasoning	VTC-Bench	paper	Compositional visual tool chaining for agentic multimodal models.
Visualization	SciVisAgentBench	paper	Scientific data analysis and visualization-agent evaluation.
Visualization	DV-World	paper	Real-world data visualization tasks with environment grounding and intent alignment.
Memory	MemGUI-Bench	paper	Cross-session and cross-temporal mobile GUI memory.
Memory	MementoGUI-Bench	paper	Long-horizon GUI decision-making with memory consistency diagnostics.
Memory	Visual Agentic Memory	paper	Online indexing and evidence retrieval for long video understanding.
Dynamic GUI	DynamicGUIBench	paper	Robustness under evolving interfaces and dynamic UI changes.
Exploration	ScreenSearch	paper	Large-scale desktop state-graph exploration under partial observability.
Enterprise reliability	UI-CUBE	paper	Deployment-readiness diagnostics beyond simple task success.
Security	VPI-Bench	paper, code	Visual prompt injection for GUI and computer-use agents.
Safety	AGENTSAFE	paper	Hazardous-instruction safety for embodied agents.
Safety	HazardArena	paper	Semantic safety evaluation for VLA systems.
Safety	SafeManip	paper	Temporal safety properties for robotic manipulation rollouts.
Embodied	LIBERO	code	Lifelong robot manipulation tasks.
Embodied	RLBench	code	Simulation-based manipulation benchmark.
Embodied	VLA-REPLICA	paper	Low-cost reproducible real-world VLA evaluation.
Web	Weblica	paper	Scalable reproducible web-replica environments for visual web-agent training.
Web	CUA-Gym	paper	Verifiable RLVR task/environment/reward generation for computer-use agents.
Web	OpenWebRL	paper	Online multi-turn RL framework for live visual web agents.
Mobile	SimuWoB	paper	Synthetic high-fidelity mobile apps with automatic rewards.
Mobile	MobileGym	paper	Highly parallel mobile GUI simulator with deterministic state-based judging.
Mobile	AndroidDaily	paper	Closed-source real-app Android benchmark with visual process evaluation.
Desktop	TClone	paper	Low-latency forking of live GUI environments for CUA execution and evaluation.
GUI testing	GUITestScape	paper	Open-set exploratory GUI testing with interaction and display defects.
Visual-agent memory	MemEye	paper	Evaluates whether multimodal agent memory preserves visual evidence.
Visualization	VESTA / DAWN	paper	Statistical modeling benchmark and visual tool-agent framework.
Visual editing	CV-Arena	paper	Instructional computer-vision task benchmark with human-AI preference evaluation.
Embodied safety	SafeVLA-Bench	paper	Success-safety gap evaluation for VLA manipulation policies.
Web	WebForge-Bench	paper, code	Automatically generated self-contained browser-agent benchmark environments.
Web	SimGym	paper	E-commerce A/B-test simulation with traffic-grounded live-browser VLM agents.
Web	MemVenom	paper	Memory-poisoning threat model for long-horizon web agents with multimodal retrieval.
GUI game generation	PlaytestArena	paper, project	Browser-game generation benchmark evaluated by GUI playtesting agents.
Security	MIRAGE	paper	Mobile GUI prompt-injection samples placed in realistic user-generated content.
Safety	OSGuard	paper	Computer-use-agent safety benchmark for unsafe shortcuts during normal tasks.
Privacy	MaskClaw	paper	Edge-side privacy arbitration benchmark and skill-evolution scenarios for GUI agents.
Privacy	CAPED	paper	Context-aware mobile GUI screenshot exposure defense.
Embodied evaluation	PiL-World	paper	Closed-loop VLA policy-in-the-loop evaluation with imagined action-conditioned observations.
Desktop	Workflow-GYM	paper	Long-horizon GUI workflows in professional software fields.
Desktop	MyPCBench	paper	Personal computer-use benchmark with user-specific context and account state.
Scientific instruments	LabOSBench	paper	Scientific instrument-control interfaces for computer-use-agent evaluation.
Embodied evaluation	WireCraft	paper	Industrial wire and cable manipulation benchmark with VLA policy baselines.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
README.md		README.md

Resource	Type	Link	Primary Use
OpenAI Skills guide	docs	Docs	Understanding skill-style packaging for reusable agent capabilities.
Agent Skills for Large Language Models	survey	Paper	Architecture, acquisition, and security framing for skill-based agents.
CUA-Skill	skill base	Paper	Reusable computer-use procedures with parameterized execution graphs.
MMSkills	multimodal skill framework	Paper	Reusable visual procedures with multimodal state, progress, and failure evidence.
awesome-agent-skills	collection	GitHub	Finding reusable agent skills across browsing, coding, documents, and visual tasks.
awesome-gpt-image-2	collection	GitHub	Tracking prompt patterns and workflows around modern image generation.
gpt_image_2_skill	skill package	GitHub	Example of packaging image-generation workflows as reusable skills.
ToDiagram skills	skill collection	GitHub	Diagram and visual-communication skills that pair well with visual agents.

Resource	Type	Link	Primary Use
OmniParser	parser	GitHub	Converting screenshots into candidate interactable regions.
ShowUI	GUI model	GitHub	Screenshot-conditioned GUI action modeling and demonstration pipelines.
UGround	grounding model	GitHub	Pure-vision GUI grounding without accessibility trees.
OS-ATLAS	action model	Paper	Cross-platform GUI action grounding.
GUI-G1	grounding model	GitHub	Studying RL recipes and evaluation pitfalls for GUI grounding.
UI-Zoomer	grounding tool	GitHub	Adaptive zoom-in when the target UI element is hard to localize.
Phi-Ground	grounding model	Paper	Compact GUI grounding baseline for resource-constrained settings.
SafeGround	grounding calibrator	Paper	Estimating grounding risk before executing high-impact GUI actions.
AutoFocus	grounding tool	Paper	Training-free uncertainty-aware active visual search on high-resolution screens.
AQuaUI	token reducer	Paper	Adaptive quadtree compression for GUI screenshots at inference time.
Orion	visual agent	Paper	Tool-augmented visual reasoning and execution across images, videos, and documents.
Kimi K2.5	visual agentic model	Paper	Open-source multimodal agentic intelligence model with joint text-vision optimization.
VisualToolAgent	tool selector	Paper	RL-based selection and composition of visual tools.
VTC-Bench	tool-chain benchmark	Paper	Evaluating compositional visual tool use in agentic multimodal models.

Resource	Type	Link	Primary Use
UI-TARS Desktop	desktop agent	GitHub	Running multimodal desktop agents locally.
Agent S	runtime	GitHub	General computer-use experiments with a practical open framework.
Cua	operator stack	GitHub	Infrastructure for running and evaluating computer-use agents.
OpenAdapt	generative RPA stack	GitHub	Recording GUI demonstrations, training models, and evaluating agents from a unified CLI.
HIDAgent	HID toolkit	Paper	Enabling visual UI agents on HID-compatible devices.
GPA	demo replay stack	Paper	Local GUI process automation from a single demonstration.
browser-use	browser runtime	GitHub	Browser automation workflows when DOM/tool access is acceptable.
Stagehand	browser runtime	GitHub	Hybrid code-plus-natural-language browser automation for production workflows.
Playwright MCP	browser MCP server	GitHub	Gives agents browser automation tools through the Model Context Protocol.
BrowserGym	browser harness	GitHub	Reproducible browser-agent experiments and benchmark orchestration.
AgentLab	experiment framework	GitHub	Running, comparing, and analyzing web-agent experiments.
OpenAdapt Desktop	desktop capture/runtime	GitHub	Capturing human demonstrations and replaying desktop workflows.
ScreenPipe	local data capture	GitHub	Recording local screen/audio context for personal or research agents.

Resource	Type	Link	Primary Use
OSWorld	desktop environment	GitHub	Standard desktop benchmark and environment.
AndroidWorld	mobile environment	GitHub	Dynamic Android environment for mobile agents.
AndroidControl	mobile dataset	Paper	Large Android control demonstrations for training and data-scaling studies.
Windows Agent Arena	desktop environment	GitHub	Windows-specific OS-agent evaluation.
WebArena	web benchmark	GitHub	Realistic web tasks with execution-based grading.
WebArena-Verified	web benchmark	GitHub	Audited and deterministic WebArena evaluation.
VisualWebArena	visual web benchmark	GitHub	Web tasks where screenshots and visual grounding matter.
WorkArena	enterprise benchmark	GitHub	Enterprise-style workflow automation.
OpenCUA	open CUA stack	GitHub	Data, models, and evaluation foundations for computer-use agents.
ScaleCUA	scaling stack	GitHub	Cross-platform CUA data scaling and evaluation.
WebGym	visual web environment	Paper	Large-scale realistic training tasks for visual web agents.
C-World	environment creator	Paper	Creating diverse computer-use environments on demand.
OpenComputer	verifiable worlds	Paper	State verifiers, synthetic desktop tasks, and auditable rewards.
ShopGym	e-commerce simulator	Paper	Realistic and controllable e-commerce web-agent evaluation.
SciVisAgentBench	visualization benchmark	Paper	Evaluating scientific visualization agents on executable analysis tasks.
DV-World	data-visualization benchmark	Paper	Real-world visualization-agent scenarios with evolving environments.
Visual Agentic Memory	video memory framework	Paper	Training-free long-video indexing, hierarchical memory, and retrieval.
CUA-Suite	data suite	Paper	Large human-annotated video demonstrations for CUA research.
ShowUI-Aloha	data pipeline	Paper, code	Turning screen recordings into GUI-agent training trajectories.
Video2GUI	data pipeline	Paper	Synthesizing GUI trajectories from instructional videos.
ScreenSearch	exploration corpus	Paper	Building desktop GUI state graphs through ambiguity-aware exploration.
CutVerse	creative-workflow benchmark	Paper	Professional media post-production GUI trajectories and evaluation.
lmms-eval	eval toolkit	GitHub	Static multimodal evaluation that can complement closed-loop agent tests.
WebForge	browser benchmark generator	Paper, code	Automatically generating reproducible browser-agent benchmark environments.
SimGym	e-commerce simulator	Paper	Simulating visually driven e-commerce A/B tests with live-browser VLM agents.
PlaytestArena	game-generation benchmark	Paper, project	Using GUI agents to playtest generated browser games.
Workflow-GYM	professional GUI benchmark	Paper	Evaluating long-horizon computer-use agents in specialized professional software.
MyPCBench	personal CUA benchmark	Paper	Testing computer-use agents in personalized digital environments.
LabOSBench	scientific-instrument benchmark	Paper	Evaluating computer-use agents on scientific instrument-control workflows.
ProCUA-SFT	desktop CUA data	Paper	Supervised fine-tuning data and recipes for desktop computer-use agents.

Resource	Type	Link	Primary Use
OpenVLA	VLA model	GitHub	Common open baseline for VLA robot manipulation.
LeRobot	robotics toolkit	GitHub	Robot-learning datasets, policies, training, and deployment tooling.
LIBERO	robotics benchmark	GitHub	Lifelong robot manipulation tasks.
RLBench	robotics benchmark	GitHub	Simulation-based manipulation evaluation.
SAGE	3D scene engine	GitHub	Agentic 3D scene generation for embodied policy training.
Magma	foundation model	GitHub	Bridging digital computer use and physical action.
VLA-REPLICA	robot benchmark	Paper	Low-cost reproducible real-world VLA evaluation.
SafeManip	safety benchmark	Paper	Temporal-safety monitors for robotic manipulation rollouts.
ReactVLA	VLA model	Paper, project	Low-latency reactive VLA policy for real-time robot manipulation.
PiL-World	VLA evaluation	Paper	Closed-loop policy-in-the-loop evaluation without executing every rollout on a real robot.
Qwen-VLA	VLA model	Paper	Unified embodied decision-making across tasks, environments, and robot embodiments.
Qwen-RobotManip	VLA model	Paper	Scaled robotic manipulation foundation model built on Qwen-VL.
LabVLA	laboratory VLA	Paper	Grounding VLA models in scientific laboratory protocol execution.
ACE-Ego-0	VLA pretraining data	Paper	Unifying egocentric human video and robot trajectories for VLA pretraining.
MuseVLA	multisensory VLA	Paper	Invoking non-RGB sensors as adaptive tools for robotic manipulation.
WireCraft	manipulation benchmark	Paper	Industrial deformable-linear-object manipulation benchmark with shared evaluation.

Workflow	Practical stack
GUI grounding research	ScreenSpot-Pro + OmniParser + UGround + GUI-G1 + SafeGround + UI-Zoomer + AutoFocus + PAGER + AQuaUI
Browser-agent experiments	BrowserGym + AgentLab + WebArena + WebArena-Verified + VisualWebArena + WebLINX + MMInA + WebGym + ShopGym + SaaS-Bench
Desktop computer-use agents	UI-TARS Desktop + Agent S + Cua + OSWorld + Windows Agent Arena + OS-Marathon + OpenComputer + ScreenSearch + CutVerse
Mobile GUI agents	Android in the Wild + AndroidControl + AndroidWorld + A3 + MobileAgentBench + SPA-Bench + MemGUI-Bench + PSPA-Bench + OmniGUI + UI-Mem
Demonstration and data pipelines	OpenAdapt + OpenAdapt Desktop + ScreenPipe + ShowUI-Aloha + CUA-Suite + Video2GUI + C-World + OpenComputer
Agentic visual creation	gpt_image_2_skill + DiffusionAgent + GenArtist + DeepEyes + Agent Banana + VisionCreator + VisionCreator-R1 + GEMS + GenEvolve
General visual agents and tool use	Visual Agentic RFT + VisualToolAgent + Orion + Kimi K2.5 + VTC-Bench + MMSkills + Visual Agentic Memory
Visualization and chart agents	AVA + ParaView-MCP + SciVisAgentBench + SASAV + Beyond Pixels + DV-World + HierVA
Embodied VLA research	OpenVLA + LeRobot + LIBERO + RLBench + SAGE + Magma + VLAs-as-Tools + VLA-REPLICA + SafeManip + Pre-VLA
Reliability and security testing	VPI-Bench + OpenAgentSafety + UI-CUBE + GUIDE + CORA + OS-BLIND + HazardArena + OS-SPEAR + WARD + Pre-VLA

Resource	Link	Why read it
OpenAI Computer Use guide	Docs	Developer-facing guide for building with computer-use tooling.
OpenAI Computer-Using Agent	Article	Product and research framing for modern CUAs.
OpenAI Skills guide	Docs	Practical reference for reusable agent skills.
OpenAI MCP and Connectors guide	Docs	Reference for connecting external tools and services to agents.
Anthropic: Developing a computer use model	Article	Strong public engineering writeup on GUI-agent training and evaluation.
Anthropic: Introducing computer use	Article	System framing and deployment context for computer-use models.
Google DeepMind: Gemini Robotics	Article	Industry view on embodied visual agents.
Google DeepMind: Gemini Robotics On-Device	Article	Notes on low-latency, local VLA deployment.

Repository	Link	Notes
Awesome-GUI-Agents	GitHub	Focused companion index for GUI grounding and automation papers.
GUI-Agents-Paper-List	GitHub	Systematic paper index focused on GUI agents.
awesome-ui-agents	GitHub	Neighboring index for UI-agent papers and projects.
Evolving Visual Generation	GitHub	Adjacent map for visual-generation systems.
Awesome Multimodal Modeling	GitHub	Broader multimodal modeling list beyond the stricter agent boundary here.

Folders and files

Latest commit

History

Repository files navigation

Awesome Visual Agent

Contents

Selection Boundary

Research Taxonomy

Curation Rubric

Reading Pathways

Recent Additions

Research Map

Surveys and Landscape

GUI Grounding and Screen Perception

Computer-Use Agents and Environments

Embodied Vision-Language-Action Agents

Agentic Visual Reasoning, Generation, and World Building

General Visual Agents, Tool Use, and Visualization Agents

Safety, Robustness, and Evaluation

Benchmarks and Environments

Skills, Tools, and Engineering Resources

Skill and Prompt Libraries

Models, Parsers, and Grounding Tools

Agent Runtimes and Operator Stacks

Data Capture, Training, and Evaluation Stacks

Embodied and Robotics Tooling

Workflow Stacks

Official Docs and Engineering Notes

Related Lists

Contributing

Maintenance Policy

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages