Skip to content

OpenEnvision/Awesome-Visual-Agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 

Repository files navigation

Awesome Visual Agent

A curated research index for visual agents that perceive, ground, plan, act, create, and evaluate in visually grounded environments.

Awesome Scope Boundary Style

Visual agents occupy the intersection of multimodal perception, grounded reasoning, tool use, interaction, and control. This repository curates papers, benchmarks, datasets, runtimes, and engineering resources for systems that close the loop between visual observation and purposeful action.

The list is selective rather than exhaustive. It prioritizes works that introduce a clear agent loop, action space, evaluation protocol, data engine, safety finding, or reusable implementation artifact, while excluding generic multimodal models and one-shot visual-generation systems without an agentic mechanism.

Contents

Selection Boundary

Included areas:

  • GUI, web, desktop, and mobile agents that perceive screens and produce executable actions.
  • Visual grounding work that is clearly tied to downstream agent control.
  • Embodied vision-language-action systems for robot manipulation, navigation, and physical-world interaction.
  • Agentic visual reasoning and generation systems with search, planning, memory, tools, critique, or iterative refinement.
  • Benchmarks, data engines, simulators, safety suites, and toolchains that support visual-agent construction and evaluation.

Excluded by default:

  • Broad multimodal foundation models with no visual-agent evaluation.
  • Generic OCR, captioning, visual question answering, or layout parsing without an action or agent setting.
  • Image, video, or 3D generators that are only prompt-in/artifact-out.
  • Unverified arXiv IDs, placeholder-looking entries, product rumors, and duplicate rows.

Back to top

Research Taxonomy

Track Research question Representative works
Screen grounding Can the model localize text, widgets, controls, and regions well enough to act? Set-of-Mark, SeeClick, OmniParser, UGround, ScreenSpot-Pro, GUI-Eyes, SafeGround, PAGER
Computer use Can the agent complete tasks in real websites, desktops, or phones over multiple steps? WebArena, WebLINX, AppAgent, Mobile-Agent, OSWorld, AndroidWorld, Agent S, UI-TARS, WebGym, OpenComputer
Embodied VLA Can visual observations and language be converted into safe physical actions? PerAct, VIMA, RT-1, RT-2, Open X-Embodiment, OpenVLA, Pi-Zero, Magma, VLA-REPLICA, Pre-VLA
Agentic reasoning and creation Can the system plan, search, critique, edit, or generate visual artifacts through a loop? VISPROG, ViperGPT, DiffusionAgent, GenArtist, DeepEyes, Visual Agentic RFT, Agent Banana, VisionCreator, GEMS, GenEvolve
General visual agents Can multimodal agents build reusable visual skills, use visual tools, reason over video/charts/visualizations, and coordinate across visual contexts? Orion, Kimi K2.5, MMSkills, VTC-Bench, VisualToolAgent, Visual Agentic Memory, HierVA, DV-World
Reliability and safety Can we measure brittleness, privacy risk, prompt injection, unsafe actions, and deployment readiness? VPI-Bench, OpenAgentSafety, OS-BLIND, HazardArena, UI-CUBE, GUIDE, CORA, WARD
Infrastructure Which tools and environments support reproducible training, deployment, and evaluation? BrowserGym, AgentLab, Stagehand, Playwright MCP, Agent S, Cua, OpenCUA, ScaleCUA, WebGym, C-World, LeRobot

Back to top

Curation Rubric

An item should usually satisfy at least one of these conditions:

  • It defines a new visual-agent capability, benchmark, data engine, training recipe, runtime, or safety evaluation.
  • It evaluates closed-loop behavior rather than only static recognition or one-shot generation.
  • It is widely used as a baseline, benchmark, dataset, environment, or builder tool.
  • It has a stable paper, official code, project page, or documentation that readers can inspect.

An item is removed or left out when the visual-agent connection is weak, the link is unverifiable, the arXiv ID is wrong, the row duplicates a better entry, or the contribution is mostly a product announcement without enough technical detail.

Back to top

Reading Pathways

GUI and computer use. Start with SeeClick, OmniParser, OSWorld, UI-TARS, Agent S2, OpenCUA, UI-Copilot, WebGym, MementoGUI, and OpenComputer.

Mobile GUI agents. Read Android in the Wild, MM-Navigator, AppAgent, Mobile-Agent, Mobile-Agent-v2, A3, MemGUI-Bench, PSPA-Bench, OmniGUI, and How Mobile World Model Guides GUI Agents?.

Grounding and perception. Read ScreenAI, Ferret-UI, UGround, ScreenSpot-Pro, GUI-Actor, Phi-Ground, GUI-Eyes, UI-Zoomer, SafeGround, AutoFocus, and PAGER.

Embodied VLA. Start with PerAct, VIMA, RT-1, PaLM-E, RT-2, Open X-Embodiment, OpenVLA, Pi-Zero, Magma, World-Value-Action, VLAs-as-Tools, VLA-REPLICA, and Pre-VLA.

Agentic visual reasoning and creation. Read VISPROG, Visual ChatGPT, ViperGPT, LLaVA-Plus, DiffusionAgent, GenArtist, DeepEyes, Agent Banana, VisionCreator, Visual Agentic Memory, GEMS, and GenEvolve.

General visual agents. Read Visual Agentic Reinforcement Fine-Tuning, VisualToolAgent, Orion, Kimi K2.5, VTC-Bench, MMSkills, Visual Agentic Memory, HierVA, and DV-World.

Back to top

Recent Additions

Work Date Contribution / Relevance
GUI-Eyes 2026-01 Active visual perception for GUI grounding with learned crop/zoom tool use.
ShowUI-Aloha 2026-01 Converts human screen recordings into structured GUI-agent supervision.
OS-Symphony 2026-01 Holistic framework for robust computer-using agents.
Kimi K2.5 2026-02 Open-source multimodal model focused on visual agentic intelligence.
Agent Banana 2026-02 Agentic image editing with planning and tool execution rather than one-shot editing.
SAGE 2026-02 Agentic 3D scene generation for embodied-AI policy training.
ActionEngine 2026-02 Uses state-machine memory to make GUI agents more programmatic and recoverable.
OmniStream 2026-03 Streaming visual-agent representation for perception, reconstruction, and action.
VTC-Bench 2026-03 Evaluates compositional visual tool chaining in agentic multimodal models.
CUA-Suite 2026-03 Large human-annotated video demonstrations for computer-use agents.
GEMS 2026-03 Multimodal generation loop with memory, skills, and iterative agent refinement.
SciVisAgentBench 2026-03-31 Benchmark for scientific data analysis and visualization agents.
SASAV 2026-04-03 Self-directed agent for scientific analysis and visualization workflows.
UI-Copilot 2026-04 Long-horizon GUI automation with tool-integrated policy optimization.
UI-Zoomer 2026-04 Uncertainty-driven zoom-in for hard GUI grounding cases.
CANVAS 2026-04-15 Agentic storyboarding for continuity-aware long-form visual narratives.
Progressive Online Video Understanding 2026-04-20 Streaming visual-agent setting where answers trigger when enough evidence appears.
Beyond Pixels 2026-04-22 Interactive grounding for visualization agents beyond static pixel reading.
AI-Gram 2026-04-23 Deployed AI-native social network where visual agents create and respond to visual content.
DynamicGUIBench 2026-04 Evaluates GUI agents in high-dynamic interfaces rather than static screenshots.
DV-World 2026-04-28 Real-world benchmark for data-visualization agents with grounding and intent alignment.
UI-Verse 2026-05 Studies interface design heuristics that improve computer-use-agent reliability.
HierVA 2026-05-05 Hierarchical visual agent for chart reasoning across image-text contexts.
Securing Computer-Use Agents 2026-05 Connects CUA architecture, lifecycle, permission scope, and runtime reliability.
Don't Click That 2026-05 Deception-aware web-agent benchmark and defense for misleading interface elements.
VLAs-as-Tools 2026-05-13 Long-horizon embodied-agent strategy that delegates bounded physical subtasks to specialized VLA tools.
MMSkills 2026-05-13 Multimodal skill packages for reusable procedural knowledge in general visual agents.
Video2GUI 2026-05 Synthesizes GUI interaction trajectories from instructional videos.
SaaS-Bench 2026-05-15 Real-world SaaS workflow benchmark for long-horizon computer-use agents.
ScreenSearch 2026-05-15 Ambiguity-aware OS exploration for building large desktop GUI state graphs.
ShopGym 2026-05-15 Realistic, controllable e-commerce simulation and benchmark for web agents.
Visual Agentic Memory 2026-05-15 Online indexing, hierarchical memory, and agentic retrieval for long video understanding.
SE-GA 2026-05-16 Memory-augmented self-evolution framework for long-horizon GUI agents.
DocOS 2026-05-18 Benchmark for GUI agents that proactively retrieve documentation and ground it into actions.
MementoGUI 2026-05-18 Learned multimodal memory controller for long-horizon GUI-agent trajectories.
AQuaUI 2026-05-19 Adaptive quadtree visual-token reduction for high-resolution GUI-agent screenshots.
CutVerse 2026-05-19 GUI-agent benchmark for professional media post-production editing workflows.
OpenComputer 2026-05-19 Verifier-grounded software worlds and auditable rewards for computer-use agents.
VLA-REPLICA 2026-05-20 Low-cost, reproducible real-world benchmark for VLA model evaluation.
Agent JIT Compilation 2026-05-20 Compiles web-agent plans into lower-latency executable schedules.
GenEvolve 2026-05-20 Self-evolving image-generation agent using tool-orchestrated visual experience distillation.
Pre-VLA 2026-05-21 Runtime verification for risky VLA actions and world-model rollouts before execution.
Spatial Memory for Out-of-Vision Manipulation 2026-05-21 Adds persistent spatial memory to VLA policies when targets leave the camera view.
Generation Navigator 2026-05-18 State-aware multi-turn text-to-image agent trained with trajectory-level RL.
SimGym 2026-05-19 Traffic-grounded VLM browser agents for e-commerce A/B-test simulation.
GUI Agents for Continual Game Generation 2026-05-27 Uses GUI playtesting agents to evaluate and iteratively improve playable browser-game generation.
ProgVLA 2026-05-27 Compact progress-aware VLA policy for long-horizon robot manipulation.
MIRAGE 2026-05-27 Context-aware prompt-injection pipeline for mobile GUI agents through user-generated content.
Mag-VLA 2026-05-27 Bimanual magnetically actuated microrobot manipulation with a VLA policy.
MaskClaw 2026-05-27 Edge-side personalized privacy arbitration and skill evolution for screenshot-based GUI agents.
GenClaw 2026-05-28 Code-driven agentic image generation with reasoning, executable sketches, and generative refinement.
Qwen-VLA 2026-05-29 Unified VLA modeling across embodied tasks, environments, and robot embodiments.
Gaze2Act 2026-05-28 Gaze-conditioned VLA policies for interactive real-robot manipulation.
DeMaVLA 2026-05-29 VLA foundation model for real-world deformable-object manipulation.
BraveGuard 2026-05-31 Self-evolving safety defense trained from open-world threats and realistic computer-use trajectories.
PiL-World 2026-06-04 Chunk-wise world model for closed-loop VLA policy-in-the-loop evaluation.
GUI-AC 2026-06-09 Continual-learning method for GUI agents using adaptive advantage and dynamic clipping.
MemVenom 2026-06-09 Triggered poisoning attack against multimodal memories in long-horizon web agents.
Workflow-GYM 2026-06-10 Long-horizon benchmark for professional GUI workflows across specialized software domains.
HiViG 2026-06-10 History-aware visually grounded critic for pre-execution CUA action evaluation.
Learning What to Say to Your VLA 2026-06-10 Test-time language steering for frozen VLA policies with conformal harmlessness control.
VLGA 2026-06-10 Vision-language-geometry-action model for geometry-grounded autonomous driving.
Orchestra-o1 2026-06-10 Omnimodal agent orchestration framework with modality-aware decomposition, sub-agent specialization, and parallel execution.
CAPED 2026-06-10 Context-aware privacy exposure defense for screenshot-based mobile GUI agents.
PERIA 2026-06-11 Tool-augmented visual agent for spatial reasoning through perception and interaction tools.
InterleaveThinker 2026-06-11 Multi-agent planner-critic pipeline for interleaved text-image generation.
ReactVLA 2026-06-12 Low-latency reactive VLA framework for closed-loop robot manipulation.
Naive Visual Memory is Not Enough 2026-06-12 Failure-mode study of experiential and visual memory in GUI agents.
LabVLA 2026-06-12 Grounds VLA models in scientific laboratory protocol execution.
OSGuard 2026-06-13 Safety benchmark for computer-use agents that distinguishes task success from unsafe shortcuts.
MyPCBench 2026-06-15 Benchmark for personally intelligent computer-use agents over user-specific digital contexts.
LabOSBench 2026-06-15 Computer-use-agent benchmark for scientific instrument control interfaces.
ACE-Ego-0 2026-06-16 Unifies egocentric human video and robotic trajectories for VLA pretraining.
ProCUA-SFT 2026-06-16 Technical report on supervised fine-tuning data and recipes for desktop computer-use agents.
WeaveLA 2026-06-16 Event-driven latent memory weaving for repetitive long-horizon robot manipulation.
GeneralVLA-2 2026-06-16 Geometry-aware reconstruction and governed memory for robot planning.
MuseVLA 2026-06-16 Adaptive multimodal sensing VLA that invokes non-RGB sensors as task tools.
Qwen-RobotManip 2026-06-16 Qwen-VL-based robotic manipulation foundation model scaled with aligned heterogeneous data.
PearlVLA 2026-06-16 Progressive embodied action-plan refinement in latent space for efficient VLA deliberation.
PreAct 2026-06-16 Compiles successful computer-use trajectories into screen-checked state-machine programs.
ThinkingVLA 2026-06-16 Interleaves visual forecasting and language reasoning for long-horizon robotic manipulation.
Uncertainty Quantification for Flow-Based VLAs 2026-06-16 Uses velocity-field disagreement for failure detection and active fine-tuning of flow-based VLAs.
WireCraft 2026-06-16 Industrial deformable-linear-object manipulation benchmark with VLA baselines.

Back to top

Research Map

Surveys and Landscape

Work Year Links Contribution / Relevance
A Comprehensive Survey of Agents for Computer Use 2025 paper Broad map of computer-use-agent domains, agent loops, and evaluation bottlenecks.
GUI Agents: A Survey 2024 paper Practical survey of GUI-agent architectures, datasets, benchmarks, and failure modes.
A Survey on (M)LLM-Based GUI Agents 2025 paper Focused entry point for planning, grounding, memory, and GUI-agent evaluation.
Towards Trustworthy GUI Agents 2025 paper Reliability and safety framing for deployment-facing GUI agents.
Large Multimodal Agents: A Survey 2024 paper Contextual background on LLM-driven multimodal agent components.
A Survey on Vision-Language-Action Models for Embodied AI 2024 paper Early VLA survey covering embodied perception, planning, and action.
Vision-Language-Action in Robotics 2026 paper Data-centric survey of VLA datasets, benchmarks, and data engines.
Vision-Language-Action Safety 2026 paper Focused taxonomy of threats, evaluations, and defenses for VLA systems.
Safety in Embodied AI 2026 paper Wider safety survey across perception, planning, action, and interaction.
Visual Generation in the New Era 2026 paper Conceptual lens for when visual generation becomes agentic world modeling.
Securing Computer-Use Agents 2026 paper Deployment-grounded view of CUA reliability across architecture, lifecycle, permissions, and oversight.
GUI Agents with Reinforcement Learning 2026 paper RL-centered survey of GUI-agent rewards, data efficiency, continual learning, and deployment risks.
Agentic World Modeling 2026 paper Taxonomy for predictive world models across physical, digital, social, and scientific agents.
World Action Models 2026 paper Defines embodied models that jointly predict future states and actions rather than actions alone.

Back to top

GUI Grounding and Screen Perception

Work Year Links Contribution / Relevance
CogAgent 2023 paper, code Early high-resolution VLM built explicitly for GUI understanding and navigation.
Set-of-Mark Prompting 2023 paper, code Simple visual marking strategy that became a practical grounding primitive for LMM agents.
SeeClick 2024 paper, code Shows that GUI grounding is a core bottleneck for visual GUI agents.
ScreenAI 2024 paper Strong foundation for screen, document, infographic, and layout-heavy visual understanding.
Ferret-UI 2024 paper, code Region-aware mobile UI understanding with explicit grounding.
OmniParser 2024 paper, code Practical screenshot-to-interactable-region parser for pure-vision GUI agents.
UGround 2024 paper, code Strong pure-vision grounding baseline without relying on accessibility trees.
OS-ATLAS 2024 paper Foundation action model for generalist GUI agents.
ShowUI 2024 paper, code Unifies screenshot-conditioned GUI perception and action modeling.
Aguvis 2024 paper Pure-vision GUI agent direction with autonomous interface interaction.
UI-E2I-Synth 2025 paper Synthetic instruction pipeline for scaling GUI grounding supervision.
ScreenSpot-Pro 2025 paper Hard high-resolution grounding benchmark for professional computer-use screens.
GUI-G1 2025 paper, code Careful analysis of RL pitfalls in GUI grounding.
Enhancing Visual Grounding via Self-Evolutionary RL 2025 paper Data-efficient RL recipe for high-resolution GUI grounding.
GUI-Actor 2025 paper Coordinate-free grounding with an action head and verifier.
Phi-Ground 2025 paper Strong empirical report on training compact GUI grounding models.
Test-Time RL for GUI Grounding 2025 paper Test-time adaptation using region consistency.
Explicit Position-to-Coordinate Mapping 2025 paper Addresses coordinate generation as a concrete grounding bottleneck.
GUI-Eyes 2026 paper Learns when and how to call visual tools such as crop and zoom.
SafeGround 2026 paper Calibrates GUI-grounding uncertainty before risky or irreversible actions.
UI-Zoomer 2026 paper, code Uses uncertainty to decide where to zoom for GUI grounding.
AutoFocus 2026 paper Training-free active visual search for high-resolution GUI grounding.
DRS-GUI 2026 paper Dynamic region search that narrows cluttered screenshots without model fine-tuning.
WinDeskGround 2026 paper Robust grounding benchmark for complex multi-window desktop interfaces.
PAGER 2026 paper Studies point-precise geometric GUI control where small coordinate errors cascade.
AQuaUI 2026 paper Adaptive-quadtree visual-token reduction for high-resolution GUI-agent screenshots.

Back to top

Computer-Use Agents and Environments

Work Year Links Contribution / Relevance
Mind2Web 2023 paper Foundational benchmark for generalist web agents.
Android in the Wild 2023 paper Large-scale Android device-control dataset with realistic gestures.
WebArena 2023 paper, code Realistic web-agent environment with execution-based tasks.
AutoDroid 2023 paper Early Android task-automation system and benchmark that remains relevant as a mobile-agent baseline.
MM-Navigator 2023 paper Early GPT-4V smartphone GUI navigation agent with zero-shot screen interaction.
AppAgent 2023 paper Smartphone agent that learns app operation from autonomous exploration or demonstrations.
SeeAct 2024 paper Web agent showing why grounding matters for GPT-4V-style agents.
Mobile-Agent 2024 paper, code Vision-centric mobile device agent using visual perception tools and stepwise planning.
VisualWebArena 2024 paper, code Adds visually grounded tasks to realistic web-agent evaluation.
WebVoyager 2024 paper End-to-end multimodal web agent evaluated on live websites.
WebLINX 2024 paper, project Large benchmark of multi-turn conversational web navigation with screenshots and action history.
OmniACT 2024 paper Desktop and web benchmark where agents generate executable automation scripts.
WorkArena 2024 paper, code Enterprise workflow benchmark for knowledge-work agents.
MMInA 2024 paper, code Multihop multimodal Internet-agent benchmark on evolving real websites.
B-MoCA 2024 paper Mobile device-control benchmark across diverse configurations.
OSWorld 2024 paper, code Flagship benchmark for open-ended tasks in real desktop environments.
AndroidWorld 2024 paper, code Dynamic Android benchmark with broad task diversity.
Mobile-Agent-v2 2024 paper, code Multi-agent mobile operation assistant with planning, decision, and reflection roles.
MobileAgentBench 2024 paper Practical benchmark for mobile LLM agents.
WebCanvas 2024 paper Online web-agent benchmark and framework built around Mind2Web-Live.
Agent S 2024 paper, code Open agentic framework for using computers through GUI actions.
Windows Agent Arena 2024 paper, code Scalable evaluation environment for Windows OS agents.
SPA-Bench 2024 paper Comprehensive smartphone-agent evaluation benchmark.
AndroidLab 2024 paper Android training and benchmarking environment with virtual devices and task suites.
VideoWebArena 2024 paper Long-context video understanding inside web-agent workflows.
MageBench 2024 paper, code Lightweight visual-agent benchmark covering WebUI, Sokoban, and Football environments.
UI-TARS 2025 paper Native GUI-agent model trained for perception, grounding, and action.
A3 2025 paper, project Android Agent Arena for online mobile GUI-agent evaluation across real apps.
Agent S2 2025 paper, code Generalist-specialist framework for computer-use agents.
UI-Evol 2025 paper Plug-in knowledge-evolution module that improves OSWorld execution reliability for CUAs.
ZeroGUI 2025 paper Online GUI-agent learning with task generation and reward estimation.
OpenCUA 2025 paper, code Open foundation stack for computer-use agents.
ScaleCUA 2025 paper, code Cross-platform data scaling for open-source computer-use agents.
WebGym 2026 paper Large-scale training environment for realistic visual web agents.
C-World 2026 paper Environment creator for scalable computer-use-agent training.
OS-Symphony 2026 paper, code Framework for robust and generalist computer-use agents.
OmegaUse 2026 paper General-purpose GUI agent for autonomous task execution.
OS-Marathon 2026 paper Benchmark for long-horizon repetitive professional computer-use workflows.
Continual GUI Agents 2026 paper Continual-learning setup and RL recipe for shifting GUI domains and resolutions.
CUA-Skill 2026 paper Structured skill base for reusable computer-use procedures and composition graphs.
DynaWeb 2026 paper Model-based RL framework that trains web agents inside learned web world models.
Avenir-Web 2026 paper Multimodal web agent with grounding experts, experience imitation, and memory.
Agent Alpha 2026 paper Uses step-level MCTS to unify GUI-agent generation, exploration, and evaluation.
UI-Mem 2026 paper Hierarchical experience memory for online RL in mobile GUI agents.
MemGUI-Bench 2026 paper Evaluates memory across mobile GUI sessions and changing environments.
ActionEngine 2026 paper State-machine memory for more structured GUI automation.
SecAgent 2026 paper Efficient 3B mobile GUI agent with semantic context compression and Chinese mobile data.
ContractSkill 2026 paper Treats web-agent skills as repairable contracts that can be verified and reused.
PSPA-Bench 2026 paper Personalized smartphone GUI-agent benchmark with process-level evaluation.
GPA 2026 paper Demonstration-based GUI process automation with local deterministic replay.
ClawGUI 2026 paper Unified framework for training, evaluating, and deploying GUI agents.
RiskWebWorld 2026 paper Realistic interactive benchmark for e-commerce risk-management GUI agents.
UI-Copilot 2026 paper, code Long-horizon GUI automation with tool-integrated policy optimization.
DynamicGUIBench 2026 paper Stress-tests agents in dynamic, evolving GUI environments.
OmniGUI 2026 paper, project Smartphone GUI benchmark with synchronized visual, audio, and video context.
UI-Verse 2026 paper Interface-design perspective on making CUAs more reliable.
How Mobile World Model Guides GUI Agents? 2026 paper Analyzes which mobile world-model representations help GUI-agent training and test-time guidance.
Executable Agentic Memory 2026 paper Converts GUI experience into executable memory graphs for retrieval-and-execution planning.
SaaS-Bench 2026 paper, code Long-horizon benchmark over real deployable SaaS systems and professional workflows.
ShopGym 2026 paper Realistic, controllable e-commerce simulation and benchmark for web agents.
ScreenSearch 2026 paper Ambiguity-aware large-scale desktop OS exploration with deduplicated state graphs.
Skim 2026 paper Speculative execution framework for faster web-agent workflows on structured sites.
SE-GA 2026 paper, code Memory-augmented self-evolving GUI agent for dynamic long-horizon tasks.
DocOS 2026 paper Proactive document-guided GUI-agent benchmark in open web environments.
MementoGUI 2026 paper Plug-in multimodal memory controller for long-horizon GUI control.
OpenComputer 2026 paper Verifiable software worlds with state verifiers, task generation, and auditable rewards.
CutVerse 2026 paper Benchmark for professional media post-production editing with dense multimodal GUIs.
Agent JIT Compilation 2026 paper Compiles web-agent plans into lower-latency executable schedules.
Weblica 2026 paper Reproducible web-replica environments for scaling visual web-agent training.
TClone 2026 paper Low-latency live GUI environment forking for parallel CUA rollouts and what-if execution.
PANDO 2026 paper Online skill distillation that reduces token and action overhead for multimodal web agents.
SimuWoB 2026 paper Synthetic realistic mobile-app benchmark for fast, faithful GUI-agent evaluation.
CUA-Gym 2026 paper Scalable generation of verifiable environments, tasks, rewards, and models for CUA RLVR.
MobileGym 2026 paper Parallel mobile GUI-agent simulator with structured state, deterministic judges, and RL rewards.
AndroidDaily 2026 paper Real-world closed-source Android benchmark with process-aware visual trajectory grading.
LearnWeak 2026 paper Student-aware data synthesis and specialization for small computer-use agents.
PRO-CUA 2026 paper Step-level process-reward optimization for computer-use agents on live web tasks.
GUITestScape 2026 paper Open-set exploratory GUI testing benchmark for MLLM agents.
Multi-Agent Computer Use 2026 paper Multi-agent CUA architecture with DAG decomposition, parallel execution, and replanning.
OpenWebRL 2026 paper Online multi-turn RL framework for training open visual web agents on live websites.
ColorBrowserAgent 2026 paper Human-in-the-loop long-horizon web GUI agent with progress summarization and knowledge adaptation.
WebForge 2026 paper, code Automated framework for generating scalable, reproducible browser-agent benchmark environments.
AgentLens 2026 paper Mobile GUI agent with adaptive visual modalities for human-agent interaction during execution.
SimGym 2026 paper Live-browser VLM-agent framework for simulating e-commerce A/B tests.
GUI-AC 2026 paper Enhances continual GUI-agent learning with grounding-certainty-aware advantage and clipping.
Workflow-GYM 2026 paper Long-horizon benchmark for professional GUI workflows in specialized software environments.
HiViG 2026 paper, code History-aware visually grounded critic for test-time CUA action evaluation.
Naive Visual Memory is Not Enough 2026 paper Failure-mode study of visual and experiential memory modules in GUI agents.
MyPCBench 2026 paper Evaluates personally intelligent CUAs over user-specific digital context and accounts.
LabOSBench 2026 paper Benchmarks CUAs on scientific instrument-control interfaces and feedback loops.
ProCUA-SFT 2026 paper Desktop CUA supervised fine-tuning report with trajectory data and training recipes.
PreAct 2026 paper Compiles successful screen interaction trajectories into guarded state-machine programs for repeat tasks.

Back to top

Embodied Vision-Language-Action Agents

Work Year Links Contribution / Relevance
PerAct 2022 paper, project Language-conditioned RGB-D manipulation agent that predicts voxel actions directly.
VIMA 2022 paper, project Multimodal-prompt robot manipulation benchmark and transformer agent.
RT-1 2022 paper, project Large-scale real-robot action model that anchors later RT/VLA work.
PaLM-E 2023 paper Embodied multimodal language model connecting visual input to robot tasks.
RT-2 2023 paper Canonical VLA model transferring web-scale vision-language knowledge to robot control.
Open X-Embodiment / RT-X 2023 paper Large robot-learning dataset and RT-X model family.
Octo 2024 paper Open-source generalist robot policy.
OpenVLA 2024 paper, code Open-source VLA model and a common baseline for robot manipulation.
Pi-Zero 2024 paper Flow-based VLA model for general robot control.
Magma 2025 paper, code Bridges multimodal agents across digital and physical actions.
SafeVLA 2025 paper Safety alignment for VLA models via constrained learning.
Interleave-VLA 2025 paper Robot manipulation with interleaved image-text instructions.
ChatVLA-2 2025 paper Open-world embodied reasoning from pretrained knowledge.
VLA^2 2025 paper Agentic framework for unseen-concept manipulation.
World-Value-Action 2026 paper Uses implicit planning and future-state value estimation for VLA systems.
VLAs-as-Tools 2026 paper Splits long-horizon embodied tasks between a high-level VLM planner and specialized VLA tools.
SAGE 2026 paper, code Agentically generates simulator-ready 3D scenes for embodied policy training.
StableVLA 2026 paper Studies robustness of VLA models under unseen visual disturbances without extra data.
Dexora 2026 paper Open-source VLA direction for high-DoF bimanual dexterous manipulation.
VLA-REPLICA 2026 paper Low-cost reproducible real-world evaluation benchmark for VLA models.
Spatial Memory for Out-of-Vision Manipulation 2026 paper Adds persistent spatial memory when manipulation targets leave the current camera view.
Pre-VLA 2026 paper Preemptive runtime verification for VLA actions and world-model rollouts.
ActQuant 2026 paper Action-guided mixed-precision quantization for deploying VLA models on constrained hardware.
Continuous Reasoning for VLA 2026 paper Replaces token-style reasoning with shareable continuous latents aligned to action chunks.
VLAMotor 2026 paper Test-guided failure discovery and agent-based synthetic data repair for VLA models.
FATE-VLA 2026 paper Adaptive failure-aware test generation that searches high-risk embodied scenes for VLA failures.
Uni-LaViRA 2026 paper Agentic language-vision-robot-action architecture for unified embodied navigation across robot types.
ProgVLA 2026 paper Progress-aware compact VLA model for long-horizon and multi-object robot manipulation.
Mag-VLA 2026 paper VLA policy for bimanual magnetically actuated microrobot manipulation.
Gaze2Act 2026 paper Uses human gaze as a dynamic intent signal for interactive VLA robot manipulation.
DeMaVLA 2026 paper VLA foundation model for deformable-object manipulation with real-world folding data.
PiL-World 2026 paper Chunk-wise world model for closed-loop VLA policy-in-the-loop evaluation.
Learning What to Say to Your VLA 2026 paper Searches and distills language feedback policies for steering frozen VLA models.
VLGA 2026 paper Adds dense geometry supervision to vision-language-action models for autonomous driving.
ReactVLA 2026 paper Fast lightweight reactive robot manipulation via improved mean-flow action generation.
Qwen-VLA 2026 paper Unifies embodied decision-making across tasks, environments, and robot embodiments.
LabVLA 2026 paper Grounds VLA models in scientific laboratory protocol execution and bench work.
ACE-Ego-0 2026 paper Bridges egocentric human videos and robot trajectories for VLA pretraining.
WeaveLA 2026 paper Adds event-driven cross-subtask latent memory for repetitive robot manipulation.
GeneralVLA-2 2026 paper Uses geometry-aware reconstruction and governed memory for robot planning.
MuseVLA 2026 paper Treats temperature, audio, radar, and other sensors as on-demand VLA tools.
Qwen-RobotManip 2026 paper Scales Qwen-VL-based manipulation models through aligned heterogeneous robot and human data.
PearlVLA 2026 paper Refines embodied action plans in latent space with future-guided process rewards.
ThinkingVLA 2026 paper Interleaves visual forecasting, inverse reasoning, and action generation for long-horizon manipulation.
Uncertainty Quantification for Flow-Based VLAs 2026 paper, project Estimates VLA epistemic uncertainty for failure detection and active fine-tuning.
WireCraft 2026 paper Industrial deformable-linear-object manipulation benchmark with shared VLA evaluation.

Back to top

Agentic Visual Reasoning, Generation, and World Building

Work Year Links Contribution / Relevance
VISPROG 2022 paper, project Foundational visual-programming approach for tool-composed visual reasoning and editing.
Visual ChatGPT 2023 paper, code Early system connecting ChatGPT with visual foundation models for multi-step visual tasks.
ViperGPT 2023 paper, code Uses Python execution to compose vision modules for interpretable visual reasoning.
LLaVA-Plus 2023 paper Trains multimodal agents to select and use visual tools across understanding and generation.
DiffusionAgent 2024 paper Routes prompts through expert diffusion models with tree-of-thought navigation and feedback memory.
GenArtist 2024 paper, code MLLM-as-agent for image generation and editing through planning and tool use.
CIGEval 2025 paper Agentic evaluation framework for conditional image generation.
DeepEyes 2025 paper Reinforcement learning for active visual reasoning, grounding, and "thinking with images."
ImAgent 2025 paper Test-time scalable multimodal agent framework for image generation.
GenAgent 2026 paper Scales text-to-image generation through agentic multimodal reasoning.
Mind-Brush 2026 paper Adds cognitive search and reasoning loops to image generation.
Agent Banana 2026 paper, code High-fidelity image editing with planner-executor tooling.
M3 2026 paper Multi-modal, multi-agent, multi-round reasoning for high-fidelity text-to-image generation.
VisionCreator 2026 paper Native visual-generation agentic model with understanding, planning, and creation.
VisionCreator-R1 2026 paper Adds explicit reflection and reflection-plan co-optimization for visual-generation agents.
Gen-Searcher 2026 paper, project Reinforces agentic search for image generation.
GEMS 2026 paper, project Multimodal generation with memory, skills, and iterative agent loops.
Visual Generation in the New Era 2026 paper Helpful taxonomy for agentic world modeling and generation.
Visual Agentic Memory 2026 paper Online indexing, hierarchical memory, and agentic retrieval for long video understanding.
GenEvolve 2026 paper Self-evolving image-generation agent with tool-orchestrated visual experience distillation.
Generation Navigator 2026 paper State-aware multi-turn text-to-image agent with trajectory-level RL for generation steering.
GUI Agents for Continual Game Generation 2026 paper, project Uses GUI playtesting agents as evaluators and feedback providers for playable game generation.
GenClaw 2026 paper Code-driven agentic image generation that plans, sketches with executable code, and refines with image models.
InterleaveThinker 2026 paper Multi-agent planner-critic pipeline for agentic interleaved text-image generation.

Back to top

General Visual Agents, Tool Use, and Visualization Agents

Work Year Links Contribution / Relevance
AVA 2023 paper Autonomous visualization agents with visual perception-driven decision making.
Visual Agents as Fast and Slow Thinkers 2024 paper System-1/System-2 framing for visual-agent reasoning and action.
Visual Agentic AI for Spatial Reasoning 2025 paper Dynamic-API visual agent for spatial reasoning in 3D scenes.
Visual Agentic Reinforcement Fine-Tuning 2025 paper Trains VLMs to use visual tools and code for "thinking with images."
ParaView-MCP 2025 paper Autonomous visualization agent with direct tool use in ParaView.
VisualToolAgent / VisTA 2025 paper RL framework for dynamic visual tool selection and composition.
Evaluation-Centric Scientific Visualization Agents 2025 paper Evaluation-first paradigm for scientific visualization agents.
DART 2025 paper Uses multi-agent disagreement to recruit specialized visual tools.
Orion 2025 paper Unified visual agent for multimodal perception, visual reasoning, and tool execution.
Kimi K2.5 2026 paper Open-source multimodal agentic model optimized jointly for text and vision.
OmniStream 2026 paper Streaming visual-agent representation for perception, reconstruction, and action.
VTC-Bench 2026 paper Evaluates agentic multimodal models through compositional visual tool chaining.
SciVisAgentBench 2026 paper Reproducible benchmark for scientific data analysis and visualization agents.
SASAV 2026 paper Self-directed scientific analysis and visualization agent.
CANVAS 2026 paper Visual agentic storyboarding for continuity-aware long-form visual narratives.
Progressive Online Video Understanding 2026 paper Online visual agent that answers when enough streaming evidence appears.
Beyond Pixels 2026 paper Introspective and interactive grounding for visualization agents.
AI-Gram 2026 paper Live social platform populated by visual agents that create and respond to visual content.
DV-World 2026 paper Real-world benchmark for data-visualization agents with native environment grounding.
Hierarchical Visual Agent / HierVA 2026 paper Manages image-text contexts for multi-step chart reasoning across subplots.
Emergent Communication between Heterogeneous Visual Agents 2026 paper Studies decentralized communication when visual agents have private representations.
MMSkills 2026 paper Multimodal procedural skill packages for reusable visual-agent decision making.
Visual Agentic Memory 2026 paper Training-free visual memory for online indexing, retrieval, and evidence verification.
MemEye 2026 paper Visual-centric evaluation framework for long-term multimodal agent memory.
Diversity Over Frequency 2026 paper Studies tool-use collapse and rollout diversity in visual Chain-of-Thought agents.
VESTA 2026 paper Scientific visual exploration agent with dynamically generated statistical tools.
CV-Arena 2026 paper Instructional computer-vision benchmark with agentic planning, editing, and verification.
Visual Skills 2026 paper Multimodal reusable skill paradigm preserving visual evidence and spatial interaction traces.
TVIR 2026 paper Text-visual interleaved deep-research benchmark and hierarchical multimodal report agent.
Active Exploring like a Pigeon 2026 paper Agentic spatial reasoning with dynamic cognitive maps and verifiable spatial assertion codes.
PERIA 2026 paper Tool-augmented visual agent for spatial reasoning across map reasoning, probing, and reconstruction tasks.
Orchestra-o1 2026 paper Omnimodal agent orchestration with modality-aware task decomposition, online sub-agent specialization, and parallel sub-task execution.

Back to top

Safety, Robustness, and Evaluation

Work Year Links Contribution / Relevance
AGENTSAFE 2025 paper Safety benchmark for embodied agents under hazardous instructions.
IS-Bench 2025 paper Interactive safety benchmark for VLM-driven household agents.
VPI-Bench 2025 paper, code Visual prompt-injection benchmark for computer-use agents.
OpenAgentSafety 2025 paper Framework for evaluating real-world agent safety across risk categories.
OS-Sentinel 2025 paper Hybrid validation for safer mobile GUI agents.
UI-CUBE 2025 paper Enterprise CUA benchmark that measures operational reliability beyond task accuracy.
SafePred 2026 paper Predictive guardrail for computer-using agents using world-model rollouts.
LPS-Bench 2026 paper Safety-awareness benchmark for long-horizon CUA planning under benign and adversarial scenarios.
GUIGuard-Bench 2026 paper Privacy-preserving GUI-agent evaluation.
CUAAudit 2026 paper Tests whether VLMs can audit autonomous computer-use agents.
GUIDE 2026 paper Hierarchical diagnostic evaluation for long GUI-agent trajectories.
VeriGUI 2026 paper Action-effect verification and self-correction for robust GUI automation.
Semantic-level UI Element Injection 2026 paper Red-teaming method that distracts GUI agents through benign-looking injected UI elements.
CORA 2026 paper Conformal risk-controlled safeguard for mobile GUI-agent action execution.
OS-BLIND 2026 paper Shows how benign-looking user instructions expose CUA vulnerabilities.
HazardArena 2026 paper Semantic safety evaluation for VLA systems.
RedVLA 2026 paper Physical red-teaming benchmark for VLA models.
GUI-Perturbed 2026 paper Domain-randomization study exposing GUI-grounding brittleness.
OS-SPEAR 2026 paper Toolkit for safety, performance, efficiency, and robustness analysis of OS agents.
Don't Click That 2026 paper Benchmarks and mitigates deceptive UI elements for VLM-based web agents.
SafeManip 2026 paper Temporal-safety benchmark for robotic manipulation using LTL-style monitors.
WARD 2026 paper Robust defense for web agents against prompt injection in HTML and visual interfaces.
ProjGuard 2026 paper Safety monitoring for computer-use agents via low-dimensional projections.
Pre-VLA 2026 paper Runtime verification for risky VLA action generation and imagined rollouts.
AgentHijack 2026 paper Benchmarks CUA robustness to realistic environment corruptions rather than direct adversarial prompts.
ROGUE 2026 paper Corrigibility benchmark showing unsafe behavior can arise during ordinary computer-use tasks.
SafeVLA-Bench 2026 paper Post-hoc safety benchmark exposing unsafe-success cases in VLA manipulation rollouts.
FATE-VLA 2026 paper Failure-seeking VLA test generation for robustness evaluation before deployment.
MemVenom 2026 paper Triggered poisoning attack against multimodal memory retrieval in web agents.
OSGuard 2026 paper Dual-granularity CUA safety benchmark for unsafe shortcuts under benign instructions.
MIRAGE 2026 paper Context-aware prompt injection against mobile GUI agents through user-generated content regions.
MaskClaw 2026 paper Edge-side personalized privacy arbitration for GUI agents with behavior-driven skill evolution.
BraveGuard 2026 paper Self-evolving guard training loop for safer computer-use-agent trajectories.
CAPED 2026 paper Context-aware screenshot exposure control for mobile GUI-agent privacy.

Back to top

Benchmarks and Environments

Area Resource Link Primary Use
Web MiniWoB++ code Compact browser-interaction environments for controlled RL-style experiments.
Web Mind2Web paper Offline web-agent action prediction and grounding.
Web WebArena paper, code Realistic web navigation with execution-based grading.
Web WebArena-Verified code Audited WebArena task set with deterministic offline evaluation.
Web VisualWebArena paper, code Visually grounded web tasks where screenshots matter.
Web WebLINX paper, project Conversational web navigation from expert demonstrations.
Web WorkArena paper, code Enterprise workflow automation in ServiceNow-style environments.
Web MMInA paper, code Multihop multimodal tasks over evolving real websites.
Web WebCanvas paper Online web-agent evaluation with Mind2Web-Live.
Web WebGym paper Large-scale realistic training environment for visual web agents.
Web DocOS paper Document-guided GUI-agent tasks in dynamic open-web environments.
Web RiskWebWorld paper Realistic e-commerce risk-management tasks for GUI agents.
Web SaaS-Bench paper, code Long-horizon professional workflows across deployable SaaS systems.
Web ShopGym paper Controllable e-commerce simulation with realistic layouts, catalogs, policies, and tasks.
Desktop OSWorld paper, code Open-ended desktop tasks in real operating systems.
Desktop Windows Agent Arena paper, code Windows-specific scaling and reproducible OS-agent evaluation.
Desktop OmniACT paper Evaluating executable automation rather than only low-level clicks.
Desktop OS-Marathon paper Long-horizon repetitive professional workflows.
Desktop OpenComputer paper Verifiable software worlds with state verifiers and auditable partial-credit rewards.
Desktop CutVerse paper Media post-production editing tasks across professional creative applications.
Mobile Android in the Wild paper Large-scale Android device-control demonstrations with screen observations.
Mobile B-MoCA paper Mobile control across diverse device configurations.
Mobile AndroidWorld paper, code Dynamic Android tasks with broad app coverage.
Mobile AndroidControl paper Diverse Android control dataset for studying scale and generalization.
Mobile MobileAgentBench paper Efficient mobile-agent evaluation across open-source apps.
Mobile SPA-Bench paper Smartphone-agent testing with comprehensive task coverage.
Mobile AndroidLab paper Training and systematic benchmarking on Android virtual devices.
Mobile A3 paper, project Real-app online evaluation for mobile GUI agents.
Mobile SecAgent paper Chinese mobile GUI dataset, benchmark, and compact semantic-context agent.
Mobile PSPA-Bench paper Personalized smartphone GUI-agent tasks with process-aware evaluation.
Mobile OmniGUI paper, project Omni-modal smartphone action prediction with visual, audio, and video cues.
Computer use C-World paper On-demand environment creation for computer-use-agent training.
Grounding ScreenSpot-Pro paper High-resolution professional-screen grounding.
Grounding WinDeskGround paper Multi-window desktop grounding under realistic visual clutter.
Grounding PAGER paper Point-precise GUI control for geometric construction tasks.
Visual-agent reasoning MageBench paper, code Lightweight environments for vision-in-the-chain agent reasoning.
Visual-agent reasoning VTC-Bench paper Compositional visual tool chaining for agentic multimodal models.
Visualization SciVisAgentBench paper Scientific data analysis and visualization-agent evaluation.
Visualization DV-World paper Real-world data visualization tasks with environment grounding and intent alignment.
Memory MemGUI-Bench paper Cross-session and cross-temporal mobile GUI memory.
Memory MementoGUI-Bench paper Long-horizon GUI decision-making with memory consistency diagnostics.
Memory Visual Agentic Memory paper Online indexing and evidence retrieval for long video understanding.
Dynamic GUI DynamicGUIBench paper Robustness under evolving interfaces and dynamic UI changes.
Exploration ScreenSearch paper Large-scale desktop state-graph exploration under partial observability.
Enterprise reliability UI-CUBE paper Deployment-readiness diagnostics beyond simple task success.
Security VPI-Bench paper, code Visual prompt injection for GUI and computer-use agents.
Safety AGENTSAFE paper Hazardous-instruction safety for embodied agents.
Safety HazardArena paper Semantic safety evaluation for VLA systems.
Safety SafeManip paper Temporal safety properties for robotic manipulation rollouts.
Embodied LIBERO code Lifelong robot manipulation tasks.
Embodied RLBench code Simulation-based manipulation benchmark.
Embodied VLA-REPLICA paper Low-cost reproducible real-world VLA evaluation.
Web Weblica paper Scalable reproducible web-replica environments for visual web-agent training.
Web CUA-Gym paper Verifiable RLVR task/environment/reward generation for computer-use agents.
Web OpenWebRL paper Online multi-turn RL framework for live visual web agents.
Mobile SimuWoB paper Synthetic high-fidelity mobile apps with automatic rewards.
Mobile MobileGym paper Highly parallel mobile GUI simulator with deterministic state-based judging.
Mobile AndroidDaily paper Closed-source real-app Android benchmark with visual process evaluation.
Desktop TClone paper Low-latency forking of live GUI environments for CUA execution and evaluation.
GUI testing GUITestScape paper Open-set exploratory GUI testing with interaction and display defects.
Visual-agent memory MemEye paper Evaluates whether multimodal agent memory preserves visual evidence.
Visualization VESTA / DAWN paper Statistical modeling benchmark and visual tool-agent framework.
Visual editing CV-Arena paper Instructional computer-vision task benchmark with human-AI preference evaluation.
Embodied safety SafeVLA-Bench paper Success-safety gap evaluation for VLA manipulation policies.
Web WebForge-Bench paper, code Automatically generated self-contained browser-agent benchmark environments.
Web SimGym paper E-commerce A/B-test simulation with traffic-grounded live-browser VLM agents.
Web MemVenom paper Memory-poisoning threat model for long-horizon web agents with multimodal retrieval.
GUI game generation PlaytestArena paper, project Browser-game generation benchmark evaluated by GUI playtesting agents.
Security MIRAGE paper Mobile GUI prompt-injection samples placed in realistic user-generated content.
Safety OSGuard paper Computer-use-agent safety benchmark for unsafe shortcuts during normal tasks.
Privacy MaskClaw paper Edge-side privacy arbitration benchmark and skill-evolution scenarios for GUI agents.
Privacy CAPED paper Context-aware mobile GUI screenshot exposure defense.
Embodied evaluation PiL-World paper Closed-loop VLA policy-in-the-loop evaluation with imagined action-conditioned observations.
Desktop Workflow-GYM paper Long-horizon GUI workflows in professional software fields.
Desktop MyPCBench paper Personal computer-use benchmark with user-specific context and account state.
Scientific instruments LabOSBench paper Scientific instrument-control interfaces for computer-use-agent evaluation.
Embodied evaluation WireCraft paper Industrial wire and cable manipulation benchmark with VLA policy baselines.

Back to top

Skills, Tools, and Engineering Resources

These resources are intentionally separated from research papers. They are implementation and evaluation artifacts rather than, in every case, standalone research contributions.

Skill and Prompt Libraries

Resource Type Link Primary Use
OpenAI Skills guide docs Docs Understanding skill-style packaging for reusable agent capabilities.
Agent Skills for Large Language Models survey Paper Architecture, acquisition, and security framing for skill-based agents.
CUA-Skill skill base Paper Reusable computer-use procedures with parameterized execution graphs.
MMSkills multimodal skill framework Paper Reusable visual procedures with multimodal state, progress, and failure evidence.
awesome-agent-skills collection GitHub Finding reusable agent skills across browsing, coding, documents, and visual tasks.
awesome-gpt-image-2 collection GitHub Tracking prompt patterns and workflows around modern image generation.
gpt_image_2_skill skill package GitHub Example of packaging image-generation workflows as reusable skills.
ToDiagram skills skill collection GitHub Diagram and visual-communication skills that pair well with visual agents.

Models, Parsers, and Grounding Tools

Resource Type Link Primary Use
OmniParser parser GitHub Converting screenshots into candidate interactable regions.
ShowUI GUI model GitHub Screenshot-conditioned GUI action modeling and demonstration pipelines.
UGround grounding model GitHub Pure-vision GUI grounding without accessibility trees.
OS-ATLAS action model Paper Cross-platform GUI action grounding.
GUI-G1 grounding model GitHub Studying RL recipes and evaluation pitfalls for GUI grounding.
UI-Zoomer grounding tool GitHub Adaptive zoom-in when the target UI element is hard to localize.
Phi-Ground grounding model Paper Compact GUI grounding baseline for resource-constrained settings.
SafeGround grounding calibrator Paper Estimating grounding risk before executing high-impact GUI actions.
AutoFocus grounding tool Paper Training-free uncertainty-aware active visual search on high-resolution screens.
AQuaUI token reducer Paper Adaptive quadtree compression for GUI screenshots at inference time.
Orion visual agent Paper Tool-augmented visual reasoning and execution across images, videos, and documents.
Kimi K2.5 visual agentic model Paper Open-source multimodal agentic intelligence model with joint text-vision optimization.
VisualToolAgent tool selector Paper RL-based selection and composition of visual tools.
VTC-Bench tool-chain benchmark Paper Evaluating compositional visual tool use in agentic multimodal models.

Agent Runtimes and Operator Stacks

Resource Type Link Primary Use
UI-TARS Desktop desktop agent GitHub Running multimodal desktop agents locally.
Agent S runtime GitHub General computer-use experiments with a practical open framework.
Cua operator stack GitHub Infrastructure for running and evaluating computer-use agents.
OpenAdapt generative RPA stack GitHub Recording GUI demonstrations, training models, and evaluating agents from a unified CLI.
HIDAgent HID toolkit Paper Enabling visual UI agents on HID-compatible devices.
GPA demo replay stack Paper Local GUI process automation from a single demonstration.
browser-use browser runtime GitHub Browser automation workflows when DOM/tool access is acceptable.
Stagehand browser runtime GitHub Hybrid code-plus-natural-language browser automation for production workflows.
Playwright MCP browser MCP server GitHub Gives agents browser automation tools through the Model Context Protocol.
BrowserGym browser harness GitHub Reproducible browser-agent experiments and benchmark orchestration.
AgentLab experiment framework GitHub Running, comparing, and analyzing web-agent experiments.
OpenAdapt Desktop desktop capture/runtime GitHub Capturing human demonstrations and replaying desktop workflows.
ScreenPipe local data capture GitHub Recording local screen/audio context for personal or research agents.

Data Capture, Training, and Evaluation Stacks

Resource Type Link Primary Use
OSWorld desktop environment GitHub Standard desktop benchmark and environment.
AndroidWorld mobile environment GitHub Dynamic Android environment for mobile agents.
AndroidControl mobile dataset Paper Large Android control demonstrations for training and data-scaling studies.
Windows Agent Arena desktop environment GitHub Windows-specific OS-agent evaluation.
WebArena web benchmark GitHub Realistic web tasks with execution-based grading.
WebArena-Verified web benchmark GitHub Audited and deterministic WebArena evaluation.
VisualWebArena visual web benchmark GitHub Web tasks where screenshots and visual grounding matter.
WorkArena enterprise benchmark GitHub Enterprise-style workflow automation.
OpenCUA open CUA stack GitHub Data, models, and evaluation foundations for computer-use agents.
ScaleCUA scaling stack GitHub Cross-platform CUA data scaling and evaluation.
WebGym visual web environment Paper Large-scale realistic training tasks for visual web agents.
C-World environment creator Paper Creating diverse computer-use environments on demand.
OpenComputer verifiable worlds Paper State verifiers, synthetic desktop tasks, and auditable rewards.
ShopGym e-commerce simulator Paper Realistic and controllable e-commerce web-agent evaluation.
SciVisAgentBench visualization benchmark Paper Evaluating scientific visualization agents on executable analysis tasks.
DV-World data-visualization benchmark Paper Real-world visualization-agent scenarios with evolving environments.
Visual Agentic Memory video memory framework Paper Training-free long-video indexing, hierarchical memory, and retrieval.
CUA-Suite data suite Paper Large human-annotated video demonstrations for CUA research.
ShowUI-Aloha data pipeline Paper, code Turning screen recordings into GUI-agent training trajectories.
Video2GUI data pipeline Paper Synthesizing GUI trajectories from instructional videos.
ScreenSearch exploration corpus Paper Building desktop GUI state graphs through ambiguity-aware exploration.
CutVerse creative-workflow benchmark Paper Professional media post-production GUI trajectories and evaluation.
lmms-eval eval toolkit GitHub Static multimodal evaluation that can complement closed-loop agent tests.
WebForge browser benchmark generator Paper, code Automatically generating reproducible browser-agent benchmark environments.
SimGym e-commerce simulator Paper Simulating visually driven e-commerce A/B tests with live-browser VLM agents.
PlaytestArena game-generation benchmark Paper, project Using GUI agents to playtest generated browser games.
Workflow-GYM professional GUI benchmark Paper Evaluating long-horizon computer-use agents in specialized professional software.
MyPCBench personal CUA benchmark Paper Testing computer-use agents in personalized digital environments.
LabOSBench scientific-instrument benchmark Paper Evaluating computer-use agents on scientific instrument-control workflows.
ProCUA-SFT desktop CUA data Paper Supervised fine-tuning data and recipes for desktop computer-use agents.

Embodied and Robotics Tooling

Resource Type Link Primary Use
OpenVLA VLA model GitHub Common open baseline for VLA robot manipulation.
LeRobot robotics toolkit GitHub Robot-learning datasets, policies, training, and deployment tooling.
LIBERO robotics benchmark GitHub Lifelong robot manipulation tasks.
RLBench robotics benchmark GitHub Simulation-based manipulation evaluation.
SAGE 3D scene engine GitHub Agentic 3D scene generation for embodied policy training.
Magma foundation model GitHub Bridging digital computer use and physical action.
VLA-REPLICA robot benchmark Paper Low-cost reproducible real-world VLA evaluation.
SafeManip safety benchmark Paper Temporal-safety monitors for robotic manipulation rollouts.
ReactVLA VLA model Paper, project Low-latency reactive VLA policy for real-time robot manipulation.
PiL-World VLA evaluation Paper Closed-loop policy-in-the-loop evaluation without executing every rollout on a real robot.
Qwen-VLA VLA model Paper Unified embodied decision-making across tasks, environments, and robot embodiments.
Qwen-RobotManip VLA model Paper Scaled robotic manipulation foundation model built on Qwen-VL.
LabVLA laboratory VLA Paper Grounding VLA models in scientific laboratory protocol execution.
ACE-Ego-0 VLA pretraining data Paper Unifying egocentric human video and robot trajectories for VLA pretraining.
MuseVLA multisensory VLA Paper Invoking non-RGB sensors as adaptive tools for robotic manipulation.
WireCraft manipulation benchmark Paper Industrial deformable-linear-object manipulation benchmark with shared evaluation.

Back to top

Workflow Stacks

Workflow Practical stack
GUI grounding research ScreenSpot-Pro + OmniParser + UGround + GUI-G1 + SafeGround + UI-Zoomer + AutoFocus + PAGER + AQuaUI
Browser-agent experiments BrowserGym + AgentLab + WebArena + WebArena-Verified + VisualWebArena + WebLINX + MMInA + WebGym + ShopGym + SaaS-Bench
Desktop computer-use agents UI-TARS Desktop + Agent S + Cua + OSWorld + Windows Agent Arena + OS-Marathon + OpenComputer + ScreenSearch + CutVerse
Mobile GUI agents Android in the Wild + AndroidControl + AndroidWorld + A3 + MobileAgentBench + SPA-Bench + MemGUI-Bench + PSPA-Bench + OmniGUI + UI-Mem
Demonstration and data pipelines OpenAdapt + OpenAdapt Desktop + ScreenPipe + ShowUI-Aloha + CUA-Suite + Video2GUI + C-World + OpenComputer
Agentic visual creation gpt_image_2_skill + DiffusionAgent + GenArtist + DeepEyes + Agent Banana + VisionCreator + VisionCreator-R1 + GEMS + GenEvolve
General visual agents and tool use Visual Agentic RFT + VisualToolAgent + Orion + Kimi K2.5 + VTC-Bench + MMSkills + Visual Agentic Memory
Visualization and chart agents AVA + ParaView-MCP + SciVisAgentBench + SASAV + Beyond Pixels + DV-World + HierVA
Embodied VLA research OpenVLA + LeRobot + LIBERO + RLBench + SAGE + Magma + VLAs-as-Tools + VLA-REPLICA + SafeManip + Pre-VLA
Reliability and security testing VPI-Bench + OpenAgentSafety + UI-CUBE + GUIDE + CORA + OS-BLIND + HazardArena + OS-SPEAR + WARD + Pre-VLA

Back to top

Official Docs and Engineering Notes

Resource Link Why read it
OpenAI Computer Use guide Docs Developer-facing guide for building with computer-use tooling.
OpenAI Computer-Using Agent Article Product and research framing for modern CUAs.
OpenAI Skills guide Docs Practical reference for reusable agent skills.
OpenAI MCP and Connectors guide Docs Reference for connecting external tools and services to agents.
Anthropic: Developing a computer use model Article Strong public engineering writeup on GUI-agent training and evaluation.
Anthropic: Introducing computer use Article System framing and deployment context for computer-use models.
Google DeepMind: Gemini Robotics Article Industry view on embodied visual agents.
Google DeepMind: Gemini Robotics On-Device Article Notes on low-latency, local VLA deployment.

Back to top

Related Lists

Repository Link Notes
Awesome-GUI-Agents GitHub Focused companion index for GUI grounding and automation papers.
GUI-Agents-Paper-List GitHub Systematic paper index focused on GUI agents.
awesome-ui-agents GitHub Neighboring index for UI-agent papers and projects.
Evolving Visual Generation GitHub Adjacent map for visual-generation systems.
Awesome Multimodal Modeling GitHub Broader multimodal modeling list beyond the stricter agent boundary here.

Back to top

Contributing

Pull requests are welcome when they improve precision rather than volume.

Recommended metadata:

  • The paper title or project name.
  • Official paper, code, project page, or documentation link.
  • The best category for the item.
  • One sentence explaining the visual-agent loop, benchmark role, or builder value.

Out of scope:

  • Generic multimodal model releases with no visual-agent evaluation.
  • One-shot generation papers without planning, tools, search, critique, or interaction.
  • Duplicate benchmark rows unless the new row adds a distinct environment or protocol.
  • Unverified arXiv IDs, placeholder links, and marketing-only announcements.

Back to top

Maintenance Policy

This repository is maintained as a precision-oriented research map:

  • Prefer primary sources: official papers, project pages, code repositories, datasets, benchmarks, and technical documentation.
  • Keep research entries, benchmarks, and engineering resources separated when their roles differ.
  • Add recent work only when it improves the conceptual coverage, empirical coverage, or builder utility of the map.
  • Verify arXiv identifiers, project links, and benchmark names before adding new entries.
  • Prune duplicate, weakly scoped, or marketing-only entries even when they are recent.
  • Preserve a strict visual-agent boundary: perception alone is not sufficient without grounding, planning, tool use, interaction, control, or agent-oriented evaluation.

Back to top

Citation

If you use this curated index in research or engineering work, please cite it as:

@misc{awesome-visual-agent,
  title        = {Awesome Visual Agent},
  author       = {OpenEnvision and contributors},
  year         = {2026},
  howpublished = {\url{https://github.com/OpenEnvision/Awesome-Visual-Agent}},
  note         = {Curated list of visual-agent papers, benchmarks, and tooling}
}

Back to top

About

Awesome Visual Agent

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors