A curated research index for visual agents that perceive, ground, plan, act, create, and evaluate in visually grounded environments.
Visual agents occupy the intersection of multimodal perception, grounded reasoning, tool use, interaction, and control. This repository curates papers, benchmarks, datasets, runtimes, and engineering resources for systems that close the loop between visual observation and purposeful action.
The list is selective rather than exhaustive. It prioritizes works that introduce a clear agent loop, action space, evaluation protocol, data engine, safety finding, or reusable implementation artifact, while excluding generic multimodal models and one-shot visual-generation systems without an agentic mechanism.
- Selection Boundary
- Research Taxonomy
- Curation Rubric
- Reading Pathways
- Recent Additions
- Research Map
- Benchmarks and Environments
- Skills, Tools, and Engineering Resources
- Workflow Stacks
- Official Docs and Engineering Notes
- Related Lists
- Contributing
- Maintenance Policy
- Citation
Included areas:
- GUI, web, desktop, and mobile agents that perceive screens and produce executable actions.
- Visual grounding work that is clearly tied to downstream agent control.
- Embodied vision-language-action systems for robot manipulation, navigation, and physical-world interaction.
- Agentic visual reasoning and generation systems with search, planning, memory, tools, critique, or iterative refinement.
- Benchmarks, data engines, simulators, safety suites, and toolchains that support visual-agent construction and evaluation.
Excluded by default:
- Broad multimodal foundation models with no visual-agent evaluation.
- Generic OCR, captioning, visual question answering, or layout parsing without an action or agent setting.
- Image, video, or 3D generators that are only prompt-in/artifact-out.
- Unverified arXiv IDs, placeholder-looking entries, product rumors, and duplicate rows.
| Track | Research question | Representative works |
|---|---|---|
| Screen grounding | Can the model localize text, widgets, controls, and regions well enough to act? | Set-of-Mark, SeeClick, OmniParser, UGround, ScreenSpot-Pro, GUI-Eyes, SafeGround, PAGER |
| Computer use | Can the agent complete tasks in real websites, desktops, or phones over multiple steps? | WebArena, WebLINX, AppAgent, Mobile-Agent, OSWorld, AndroidWorld, Agent S, UI-TARS, WebGym, OpenComputer |
| Embodied VLA | Can visual observations and language be converted into safe physical actions? | PerAct, VIMA, RT-1, RT-2, Open X-Embodiment, OpenVLA, Pi-Zero, Magma, VLA-REPLICA, Pre-VLA |
| Agentic reasoning and creation | Can the system plan, search, critique, edit, or generate visual artifacts through a loop? | VISPROG, ViperGPT, DiffusionAgent, GenArtist, DeepEyes, Visual Agentic RFT, Agent Banana, VisionCreator, GEMS, GenEvolve |
| General visual agents | Can multimodal agents build reusable visual skills, use visual tools, reason over video/charts/visualizations, and coordinate across visual contexts? | Orion, Kimi K2.5, MMSkills, VTC-Bench, VisualToolAgent, Visual Agentic Memory, HierVA, DV-World |
| Reliability and safety | Can we measure brittleness, privacy risk, prompt injection, unsafe actions, and deployment readiness? | VPI-Bench, OpenAgentSafety, OS-BLIND, HazardArena, UI-CUBE, GUIDE, CORA, WARD |
| Infrastructure | Which tools and environments support reproducible training, deployment, and evaluation? | BrowserGym, AgentLab, Stagehand, Playwright MCP, Agent S, Cua, OpenCUA, ScaleCUA, WebGym, C-World, LeRobot |
An item should usually satisfy at least one of these conditions:
- It defines a new visual-agent capability, benchmark, data engine, training recipe, runtime, or safety evaluation.
- It evaluates closed-loop behavior rather than only static recognition or one-shot generation.
- It is widely used as a baseline, benchmark, dataset, environment, or builder tool.
- It has a stable paper, official code, project page, or documentation that readers can inspect.
An item is removed or left out when the visual-agent connection is weak, the link is unverifiable, the arXiv ID is wrong, the row duplicates a better entry, or the contribution is mostly a product announcement without enough technical detail.
GUI and computer use. Start with SeeClick, OmniParser, OSWorld, UI-TARS, Agent S2, OpenCUA, UI-Copilot, WebGym, MementoGUI, and OpenComputer.
Mobile GUI agents. Read Android in the Wild, MM-Navigator, AppAgent, Mobile-Agent, Mobile-Agent-v2, A3, MemGUI-Bench, PSPA-Bench, OmniGUI, and How Mobile World Model Guides GUI Agents?.
Grounding and perception. Read ScreenAI, Ferret-UI, UGround, ScreenSpot-Pro, GUI-Actor, Phi-Ground, GUI-Eyes, UI-Zoomer, SafeGround, AutoFocus, and PAGER.
Embodied VLA. Start with PerAct, VIMA, RT-1, PaLM-E, RT-2, Open X-Embodiment, OpenVLA, Pi-Zero, Magma, World-Value-Action, VLAs-as-Tools, VLA-REPLICA, and Pre-VLA.
Agentic visual reasoning and creation. Read VISPROG, Visual ChatGPT, ViperGPT, LLaVA-Plus, DiffusionAgent, GenArtist, DeepEyes, Agent Banana, VisionCreator, Visual Agentic Memory, GEMS, and GenEvolve.
General visual agents. Read Visual Agentic Reinforcement Fine-Tuning, VisualToolAgent, Orion, Kimi K2.5, VTC-Bench, MMSkills, Visual Agentic Memory, HierVA, and DV-World.
| Work | Date | Contribution / Relevance |
|---|---|---|
| GUI-Eyes | 2026-01 | Active visual perception for GUI grounding with learned crop/zoom tool use. |
| ShowUI-Aloha | 2026-01 | Converts human screen recordings into structured GUI-agent supervision. |
| OS-Symphony | 2026-01 | Holistic framework for robust computer-using agents. |
| Kimi K2.5 | 2026-02 | Open-source multimodal model focused on visual agentic intelligence. |
| Agent Banana | 2026-02 | Agentic image editing with planning and tool execution rather than one-shot editing. |
| SAGE | 2026-02 | Agentic 3D scene generation for embodied-AI policy training. |
| ActionEngine | 2026-02 | Uses state-machine memory to make GUI agents more programmatic and recoverable. |
| OmniStream | 2026-03 | Streaming visual-agent representation for perception, reconstruction, and action. |
| VTC-Bench | 2026-03 | Evaluates compositional visual tool chaining in agentic multimodal models. |
| CUA-Suite | 2026-03 | Large human-annotated video demonstrations for computer-use agents. |
| GEMS | 2026-03 | Multimodal generation loop with memory, skills, and iterative agent refinement. |
| SciVisAgentBench | 2026-03-31 | Benchmark for scientific data analysis and visualization agents. |
| SASAV | 2026-04-03 | Self-directed agent for scientific analysis and visualization workflows. |
| UI-Copilot | 2026-04 | Long-horizon GUI automation with tool-integrated policy optimization. |
| UI-Zoomer | 2026-04 | Uncertainty-driven zoom-in for hard GUI grounding cases. |
| CANVAS | 2026-04-15 | Agentic storyboarding for continuity-aware long-form visual narratives. |
| Progressive Online Video Understanding | 2026-04-20 | Streaming visual-agent setting where answers trigger when enough evidence appears. |
| Beyond Pixels | 2026-04-22 | Interactive grounding for visualization agents beyond static pixel reading. |
| AI-Gram | 2026-04-23 | Deployed AI-native social network where visual agents create and respond to visual content. |
| DynamicGUIBench | 2026-04 | Evaluates GUI agents in high-dynamic interfaces rather than static screenshots. |
| DV-World | 2026-04-28 | Real-world benchmark for data-visualization agents with grounding and intent alignment. |
| UI-Verse | 2026-05 | Studies interface design heuristics that improve computer-use-agent reliability. |
| HierVA | 2026-05-05 | Hierarchical visual agent for chart reasoning across image-text contexts. |
| Securing Computer-Use Agents | 2026-05 | Connects CUA architecture, lifecycle, permission scope, and runtime reliability. |
| Don't Click That | 2026-05 | Deception-aware web-agent benchmark and defense for misleading interface elements. |
| VLAs-as-Tools | 2026-05-13 | Long-horizon embodied-agent strategy that delegates bounded physical subtasks to specialized VLA tools. |
| MMSkills | 2026-05-13 | Multimodal skill packages for reusable procedural knowledge in general visual agents. |
| Video2GUI | 2026-05 | Synthesizes GUI interaction trajectories from instructional videos. |
| SaaS-Bench | 2026-05-15 | Real-world SaaS workflow benchmark for long-horizon computer-use agents. |
| ScreenSearch | 2026-05-15 | Ambiguity-aware OS exploration for building large desktop GUI state graphs. |
| ShopGym | 2026-05-15 | Realistic, controllable e-commerce simulation and benchmark for web agents. |
| Visual Agentic Memory | 2026-05-15 | Online indexing, hierarchical memory, and agentic retrieval for long video understanding. |
| SE-GA | 2026-05-16 | Memory-augmented self-evolution framework for long-horizon GUI agents. |
| DocOS | 2026-05-18 | Benchmark for GUI agents that proactively retrieve documentation and ground it into actions. |
| MementoGUI | 2026-05-18 | Learned multimodal memory controller for long-horizon GUI-agent trajectories. |
| AQuaUI | 2026-05-19 | Adaptive quadtree visual-token reduction for high-resolution GUI-agent screenshots. |
| CutVerse | 2026-05-19 | GUI-agent benchmark for professional media post-production editing workflows. |
| OpenComputer | 2026-05-19 | Verifier-grounded software worlds and auditable rewards for computer-use agents. |
| VLA-REPLICA | 2026-05-20 | Low-cost, reproducible real-world benchmark for VLA model evaluation. |
| Agent JIT Compilation | 2026-05-20 | Compiles web-agent plans into lower-latency executable schedules. |
| GenEvolve | 2026-05-20 | Self-evolving image-generation agent using tool-orchestrated visual experience distillation. |
| Pre-VLA | 2026-05-21 | Runtime verification for risky VLA actions and world-model rollouts before execution. |
| Spatial Memory for Out-of-Vision Manipulation | 2026-05-21 | Adds persistent spatial memory to VLA policies when targets leave the camera view. |
| Generation Navigator | 2026-05-18 | State-aware multi-turn text-to-image agent trained with trajectory-level RL. |
| SimGym | 2026-05-19 | Traffic-grounded VLM browser agents for e-commerce A/B-test simulation. |
| GUI Agents for Continual Game Generation | 2026-05-27 | Uses GUI playtesting agents to evaluate and iteratively improve playable browser-game generation. |
| ProgVLA | 2026-05-27 | Compact progress-aware VLA policy for long-horizon robot manipulation. |
| MIRAGE | 2026-05-27 | Context-aware prompt-injection pipeline for mobile GUI agents through user-generated content. |
| Mag-VLA | 2026-05-27 | Bimanual magnetically actuated microrobot manipulation with a VLA policy. |
| MaskClaw | 2026-05-27 | Edge-side personalized privacy arbitration and skill evolution for screenshot-based GUI agents. |
| GenClaw | 2026-05-28 | Code-driven agentic image generation with reasoning, executable sketches, and generative refinement. |
| Qwen-VLA | 2026-05-29 | Unified VLA modeling across embodied tasks, environments, and robot embodiments. |
| Gaze2Act | 2026-05-28 | Gaze-conditioned VLA policies for interactive real-robot manipulation. |
| DeMaVLA | 2026-05-29 | VLA foundation model for real-world deformable-object manipulation. |
| BraveGuard | 2026-05-31 | Self-evolving safety defense trained from open-world threats and realistic computer-use trajectories. |
| PiL-World | 2026-06-04 | Chunk-wise world model for closed-loop VLA policy-in-the-loop evaluation. |
| GUI-AC | 2026-06-09 | Continual-learning method for GUI agents using adaptive advantage and dynamic clipping. |
| MemVenom | 2026-06-09 | Triggered poisoning attack against multimodal memories in long-horizon web agents. |
| Workflow-GYM | 2026-06-10 | Long-horizon benchmark for professional GUI workflows across specialized software domains. |
| HiViG | 2026-06-10 | History-aware visually grounded critic for pre-execution CUA action evaluation. |
| Learning What to Say to Your VLA | 2026-06-10 | Test-time language steering for frozen VLA policies with conformal harmlessness control. |
| VLGA | 2026-06-10 | Vision-language-geometry-action model for geometry-grounded autonomous driving. |
| Orchestra-o1 | 2026-06-10 | Omnimodal agent orchestration framework with modality-aware decomposition, sub-agent specialization, and parallel execution. |
| CAPED | 2026-06-10 | Context-aware privacy exposure defense for screenshot-based mobile GUI agents. |
| PERIA | 2026-06-11 | Tool-augmented visual agent for spatial reasoning through perception and interaction tools. |
| InterleaveThinker | 2026-06-11 | Multi-agent planner-critic pipeline for interleaved text-image generation. |
| ReactVLA | 2026-06-12 | Low-latency reactive VLA framework for closed-loop robot manipulation. |
| Naive Visual Memory is Not Enough | 2026-06-12 | Failure-mode study of experiential and visual memory in GUI agents. |
| LabVLA | 2026-06-12 | Grounds VLA models in scientific laboratory protocol execution. |
| OSGuard | 2026-06-13 | Safety benchmark for computer-use agents that distinguishes task success from unsafe shortcuts. |
| MyPCBench | 2026-06-15 | Benchmark for personally intelligent computer-use agents over user-specific digital contexts. |
| LabOSBench | 2026-06-15 | Computer-use-agent benchmark for scientific instrument control interfaces. |
| ACE-Ego-0 | 2026-06-16 | Unifies egocentric human video and robotic trajectories for VLA pretraining. |
| ProCUA-SFT | 2026-06-16 | Technical report on supervised fine-tuning data and recipes for desktop computer-use agents. |
| WeaveLA | 2026-06-16 | Event-driven latent memory weaving for repetitive long-horizon robot manipulation. |
| GeneralVLA-2 | 2026-06-16 | Geometry-aware reconstruction and governed memory for robot planning. |
| MuseVLA | 2026-06-16 | Adaptive multimodal sensing VLA that invokes non-RGB sensors as task tools. |
| Qwen-RobotManip | 2026-06-16 | Qwen-VL-based robotic manipulation foundation model scaled with aligned heterogeneous data. |
| PearlVLA | 2026-06-16 | Progressive embodied action-plan refinement in latent space for efficient VLA deliberation. |
| PreAct | 2026-06-16 | Compiles successful computer-use trajectories into screen-checked state-machine programs. |
| ThinkingVLA | 2026-06-16 | Interleaves visual forecasting and language reasoning for long-horizon robotic manipulation. |
| Uncertainty Quantification for Flow-Based VLAs | 2026-06-16 | Uses velocity-field disagreement for failure detection and active fine-tuning of flow-based VLAs. |
| WireCraft | 2026-06-16 | Industrial deformable-linear-object manipulation benchmark with VLA baselines. |
| Work | Year | Links | Contribution / Relevance |
|---|---|---|---|
| A Comprehensive Survey of Agents for Computer Use | 2025 | paper | Broad map of computer-use-agent domains, agent loops, and evaluation bottlenecks. |
| GUI Agents: A Survey | 2024 | paper | Practical survey of GUI-agent architectures, datasets, benchmarks, and failure modes. |
| A Survey on (M)LLM-Based GUI Agents | 2025 | paper | Focused entry point for planning, grounding, memory, and GUI-agent evaluation. |
| Towards Trustworthy GUI Agents | 2025 | paper | Reliability and safety framing for deployment-facing GUI agents. |
| Large Multimodal Agents: A Survey | 2024 | paper | Contextual background on LLM-driven multimodal agent components. |
| A Survey on Vision-Language-Action Models for Embodied AI | 2024 | paper | Early VLA survey covering embodied perception, planning, and action. |
| Vision-Language-Action in Robotics | 2026 | paper | Data-centric survey of VLA datasets, benchmarks, and data engines. |
| Vision-Language-Action Safety | 2026 | paper | Focused taxonomy of threats, evaluations, and defenses for VLA systems. |
| Safety in Embodied AI | 2026 | paper | Wider safety survey across perception, planning, action, and interaction. |
| Visual Generation in the New Era | 2026 | paper | Conceptual lens for when visual generation becomes agentic world modeling. |
| Securing Computer-Use Agents | 2026 | paper | Deployment-grounded view of CUA reliability across architecture, lifecycle, permissions, and oversight. |
| GUI Agents with Reinforcement Learning | 2026 | paper | RL-centered survey of GUI-agent rewards, data efficiency, continual learning, and deployment risks. |
| Agentic World Modeling | 2026 | paper | Taxonomy for predictive world models across physical, digital, social, and scientific agents. |
| World Action Models | 2026 | paper | Defines embodied models that jointly predict future states and actions rather than actions alone. |
| Work | Year | Links | Contribution / Relevance |
|---|---|---|---|
| CogAgent | 2023 | paper, code | Early high-resolution VLM built explicitly for GUI understanding and navigation. |
| Set-of-Mark Prompting | 2023 | paper, code | Simple visual marking strategy that became a practical grounding primitive for LMM agents. |
| SeeClick | 2024 | paper, code | Shows that GUI grounding is a core bottleneck for visual GUI agents. |
| ScreenAI | 2024 | paper | Strong foundation for screen, document, infographic, and layout-heavy visual understanding. |
| Ferret-UI | 2024 | paper, code | Region-aware mobile UI understanding with explicit grounding. |
| OmniParser | 2024 | paper, code | Practical screenshot-to-interactable-region parser for pure-vision GUI agents. |
| UGround | 2024 | paper, code | Strong pure-vision grounding baseline without relying on accessibility trees. |
| OS-ATLAS | 2024 | paper | Foundation action model for generalist GUI agents. |
| ShowUI | 2024 | paper, code | Unifies screenshot-conditioned GUI perception and action modeling. |
| Aguvis | 2024 | paper | Pure-vision GUI agent direction with autonomous interface interaction. |
| UI-E2I-Synth | 2025 | paper | Synthetic instruction pipeline for scaling GUI grounding supervision. |
| ScreenSpot-Pro | 2025 | paper | Hard high-resolution grounding benchmark for professional computer-use screens. |
| GUI-G1 | 2025 | paper, code | Careful analysis of RL pitfalls in GUI grounding. |
| Enhancing Visual Grounding via Self-Evolutionary RL | 2025 | paper | Data-efficient RL recipe for high-resolution GUI grounding. |
| GUI-Actor | 2025 | paper | Coordinate-free grounding with an action head and verifier. |
| Phi-Ground | 2025 | paper | Strong empirical report on training compact GUI grounding models. |
| Test-Time RL for GUI Grounding | 2025 | paper | Test-time adaptation using region consistency. |
| Explicit Position-to-Coordinate Mapping | 2025 | paper | Addresses coordinate generation as a concrete grounding bottleneck. |
| GUI-Eyes | 2026 | paper | Learns when and how to call visual tools such as crop and zoom. |
| SafeGround | 2026 | paper | Calibrates GUI-grounding uncertainty before risky or irreversible actions. |
| UI-Zoomer | 2026 | paper, code | Uses uncertainty to decide where to zoom for GUI grounding. |
| AutoFocus | 2026 | paper | Training-free active visual search for high-resolution GUI grounding. |
| DRS-GUI | 2026 | paper | Dynamic region search that narrows cluttered screenshots without model fine-tuning. |
| WinDeskGround | 2026 | paper | Robust grounding benchmark for complex multi-window desktop interfaces. |
| PAGER | 2026 | paper | Studies point-precise geometric GUI control where small coordinate errors cascade. |
| AQuaUI | 2026 | paper | Adaptive-quadtree visual-token reduction for high-resolution GUI-agent screenshots. |
| Work | Year | Links | Contribution / Relevance |
|---|---|---|---|
| Mind2Web | 2023 | paper | Foundational benchmark for generalist web agents. |
| Android in the Wild | 2023 | paper | Large-scale Android device-control dataset with realistic gestures. |
| WebArena | 2023 | paper, code | Realistic web-agent environment with execution-based tasks. |
| AutoDroid | 2023 | paper | Early Android task-automation system and benchmark that remains relevant as a mobile-agent baseline. |
| MM-Navigator | 2023 | paper | Early GPT-4V smartphone GUI navigation agent with zero-shot screen interaction. |
| AppAgent | 2023 | paper | Smartphone agent that learns app operation from autonomous exploration or demonstrations. |
| SeeAct | 2024 | paper | Web agent showing why grounding matters for GPT-4V-style agents. |
| Mobile-Agent | 2024 | paper, code | Vision-centric mobile device agent using visual perception tools and stepwise planning. |
| VisualWebArena | 2024 | paper, code | Adds visually grounded tasks to realistic web-agent evaluation. |
| WebVoyager | 2024 | paper | End-to-end multimodal web agent evaluated on live websites. |
| WebLINX | 2024 | paper, project | Large benchmark of multi-turn conversational web navigation with screenshots and action history. |
| OmniACT | 2024 | paper | Desktop and web benchmark where agents generate executable automation scripts. |
| WorkArena | 2024 | paper, code | Enterprise workflow benchmark for knowledge-work agents. |
| MMInA | 2024 | paper, code | Multihop multimodal Internet-agent benchmark on evolving real websites. |
| B-MoCA | 2024 | paper | Mobile device-control benchmark across diverse configurations. |
| OSWorld | 2024 | paper, code | Flagship benchmark for open-ended tasks in real desktop environments. |
| AndroidWorld | 2024 | paper, code | Dynamic Android benchmark with broad task diversity. |
| Mobile-Agent-v2 | 2024 | paper, code | Multi-agent mobile operation assistant with planning, decision, and reflection roles. |
| MobileAgentBench | 2024 | paper | Practical benchmark for mobile LLM agents. |
| WebCanvas | 2024 | paper | Online web-agent benchmark and framework built around Mind2Web-Live. |
| Agent S | 2024 | paper, code | Open agentic framework for using computers through GUI actions. |
| Windows Agent Arena | 2024 | paper, code | Scalable evaluation environment for Windows OS agents. |
| SPA-Bench | 2024 | paper | Comprehensive smartphone-agent evaluation benchmark. |
| AndroidLab | 2024 | paper | Android training and benchmarking environment with virtual devices and task suites. |
| VideoWebArena | 2024 | paper | Long-context video understanding inside web-agent workflows. |
| MageBench | 2024 | paper, code | Lightweight visual-agent benchmark covering WebUI, Sokoban, and Football environments. |
| UI-TARS | 2025 | paper | Native GUI-agent model trained for perception, grounding, and action. |
| A3 | 2025 | paper, project | Android Agent Arena for online mobile GUI-agent evaluation across real apps. |
| Agent S2 | 2025 | paper, code | Generalist-specialist framework for computer-use agents. |
| UI-Evol | 2025 | paper | Plug-in knowledge-evolution module that improves OSWorld execution reliability for CUAs. |
| ZeroGUI | 2025 | paper | Online GUI-agent learning with task generation and reward estimation. |
| OpenCUA | 2025 | paper, code | Open foundation stack for computer-use agents. |
| ScaleCUA | 2025 | paper, code | Cross-platform data scaling for open-source computer-use agents. |
| WebGym | 2026 | paper | Large-scale training environment for realistic visual web agents. |
| C-World | 2026 | paper | Environment creator for scalable computer-use-agent training. |
| OS-Symphony | 2026 | paper, code | Framework for robust and generalist computer-use agents. |
| OmegaUse | 2026 | paper | General-purpose GUI agent for autonomous task execution. |
| OS-Marathon | 2026 | paper | Benchmark for long-horizon repetitive professional computer-use workflows. |
| Continual GUI Agents | 2026 | paper | Continual-learning setup and RL recipe for shifting GUI domains and resolutions. |
| CUA-Skill | 2026 | paper | Structured skill base for reusable computer-use procedures and composition graphs. |
| DynaWeb | 2026 | paper | Model-based RL framework that trains web agents inside learned web world models. |
| Avenir-Web | 2026 | paper | Multimodal web agent with grounding experts, experience imitation, and memory. |
| Agent Alpha | 2026 | paper | Uses step-level MCTS to unify GUI-agent generation, exploration, and evaluation. |
| UI-Mem | 2026 | paper | Hierarchical experience memory for online RL in mobile GUI agents. |
| MemGUI-Bench | 2026 | paper | Evaluates memory across mobile GUI sessions and changing environments. |
| ActionEngine | 2026 | paper | State-machine memory for more structured GUI automation. |
| SecAgent | 2026 | paper | Efficient 3B mobile GUI agent with semantic context compression and Chinese mobile data. |
| ContractSkill | 2026 | paper | Treats web-agent skills as repairable contracts that can be verified and reused. |
| PSPA-Bench | 2026 | paper | Personalized smartphone GUI-agent benchmark with process-level evaluation. |
| GPA | 2026 | paper | Demonstration-based GUI process automation with local deterministic replay. |
| ClawGUI | 2026 | paper | Unified framework for training, evaluating, and deploying GUI agents. |
| RiskWebWorld | 2026 | paper | Realistic interactive benchmark for e-commerce risk-management GUI agents. |
| UI-Copilot | 2026 | paper, code | Long-horizon GUI automation with tool-integrated policy optimization. |
| DynamicGUIBench | 2026 | paper | Stress-tests agents in dynamic, evolving GUI environments. |
| OmniGUI | 2026 | paper, project | Smartphone GUI benchmark with synchronized visual, audio, and video context. |
| UI-Verse | 2026 | paper | Interface-design perspective on making CUAs more reliable. |
| How Mobile World Model Guides GUI Agents? | 2026 | paper | Analyzes which mobile world-model representations help GUI-agent training and test-time guidance. |
| Executable Agentic Memory | 2026 | paper | Converts GUI experience into executable memory graphs for retrieval-and-execution planning. |
| SaaS-Bench | 2026 | paper, code | Long-horizon benchmark over real deployable SaaS systems and professional workflows. |
| ShopGym | 2026 | paper | Realistic, controllable e-commerce simulation and benchmark for web agents. |
| ScreenSearch | 2026 | paper | Ambiguity-aware large-scale desktop OS exploration with deduplicated state graphs. |
| Skim | 2026 | paper | Speculative execution framework for faster web-agent workflows on structured sites. |
| SE-GA | 2026 | paper, code | Memory-augmented self-evolving GUI agent for dynamic long-horizon tasks. |
| DocOS | 2026 | paper | Proactive document-guided GUI-agent benchmark in open web environments. |
| MementoGUI | 2026 | paper | Plug-in multimodal memory controller for long-horizon GUI control. |
| OpenComputer | 2026 | paper | Verifiable software worlds with state verifiers, task generation, and auditable rewards. |
| CutVerse | 2026 | paper | Benchmark for professional media post-production editing with dense multimodal GUIs. |
| Agent JIT Compilation | 2026 | paper | Compiles web-agent plans into lower-latency executable schedules. |
| Weblica | 2026 | paper | Reproducible web-replica environments for scaling visual web-agent training. |
| TClone | 2026 | paper | Low-latency live GUI environment forking for parallel CUA rollouts and what-if execution. |
| PANDO | 2026 | paper | Online skill distillation that reduces token and action overhead for multimodal web agents. |
| SimuWoB | 2026 | paper | Synthetic realistic mobile-app benchmark for fast, faithful GUI-agent evaluation. |
| CUA-Gym | 2026 | paper | Scalable generation of verifiable environments, tasks, rewards, and models for CUA RLVR. |
| MobileGym | 2026 | paper | Parallel mobile GUI-agent simulator with structured state, deterministic judges, and RL rewards. |
| AndroidDaily | 2026 | paper | Real-world closed-source Android benchmark with process-aware visual trajectory grading. |
| LearnWeak | 2026 | paper | Student-aware data synthesis and specialization for small computer-use agents. |
| PRO-CUA | 2026 | paper | Step-level process-reward optimization for computer-use agents on live web tasks. |
| GUITestScape | 2026 | paper | Open-set exploratory GUI testing benchmark for MLLM agents. |
| Multi-Agent Computer Use | 2026 | paper | Multi-agent CUA architecture with DAG decomposition, parallel execution, and replanning. |
| OpenWebRL | 2026 | paper | Online multi-turn RL framework for training open visual web agents on live websites. |
| ColorBrowserAgent | 2026 | paper | Human-in-the-loop long-horizon web GUI agent with progress summarization and knowledge adaptation. |
| WebForge | 2026 | paper, code | Automated framework for generating scalable, reproducible browser-agent benchmark environments. |
| AgentLens | 2026 | paper | Mobile GUI agent with adaptive visual modalities for human-agent interaction during execution. |
| SimGym | 2026 | paper | Live-browser VLM-agent framework for simulating e-commerce A/B tests. |
| GUI-AC | 2026 | paper | Enhances continual GUI-agent learning with grounding-certainty-aware advantage and clipping. |
| Workflow-GYM | 2026 | paper | Long-horizon benchmark for professional GUI workflows in specialized software environments. |
| HiViG | 2026 | paper, code | History-aware visually grounded critic for test-time CUA action evaluation. |
| Naive Visual Memory is Not Enough | 2026 | paper | Failure-mode study of visual and experiential memory modules in GUI agents. |
| MyPCBench | 2026 | paper | Evaluates personally intelligent CUAs over user-specific digital context and accounts. |
| LabOSBench | 2026 | paper | Benchmarks CUAs on scientific instrument-control interfaces and feedback loops. |
| ProCUA-SFT | 2026 | paper | Desktop CUA supervised fine-tuning report with trajectory data and training recipes. |
| PreAct | 2026 | paper | Compiles successful screen interaction trajectories into guarded state-machine programs for repeat tasks. |
| Work | Year | Links | Contribution / Relevance |
|---|---|---|---|
| PerAct | 2022 | paper, project | Language-conditioned RGB-D manipulation agent that predicts voxel actions directly. |
| VIMA | 2022 | paper, project | Multimodal-prompt robot manipulation benchmark and transformer agent. |
| RT-1 | 2022 | paper, project | Large-scale real-robot action model that anchors later RT/VLA work. |
| PaLM-E | 2023 | paper | Embodied multimodal language model connecting visual input to robot tasks. |
| RT-2 | 2023 | paper | Canonical VLA model transferring web-scale vision-language knowledge to robot control. |
| Open X-Embodiment / RT-X | 2023 | paper | Large robot-learning dataset and RT-X model family. |
| Octo | 2024 | paper | Open-source generalist robot policy. |
| OpenVLA | 2024 | paper, code | Open-source VLA model and a common baseline for robot manipulation. |
| Pi-Zero | 2024 | paper | Flow-based VLA model for general robot control. |
| Magma | 2025 | paper, code | Bridges multimodal agents across digital and physical actions. |
| SafeVLA | 2025 | paper | Safety alignment for VLA models via constrained learning. |
| Interleave-VLA | 2025 | paper | Robot manipulation with interleaved image-text instructions. |
| ChatVLA-2 | 2025 | paper | Open-world embodied reasoning from pretrained knowledge. |
| VLA^2 | 2025 | paper | Agentic framework for unseen-concept manipulation. |
| World-Value-Action | 2026 | paper | Uses implicit planning and future-state value estimation for VLA systems. |
| VLAs-as-Tools | 2026 | paper | Splits long-horizon embodied tasks between a high-level VLM planner and specialized VLA tools. |
| SAGE | 2026 | paper, code | Agentically generates simulator-ready 3D scenes for embodied policy training. |
| StableVLA | 2026 | paper | Studies robustness of VLA models under unseen visual disturbances without extra data. |
| Dexora | 2026 | paper | Open-source VLA direction for high-DoF bimanual dexterous manipulation. |
| VLA-REPLICA | 2026 | paper | Low-cost reproducible real-world evaluation benchmark for VLA models. |
| Spatial Memory for Out-of-Vision Manipulation | 2026 | paper | Adds persistent spatial memory when manipulation targets leave the current camera view. |
| Pre-VLA | 2026 | paper | Preemptive runtime verification for VLA actions and world-model rollouts. |
| ActQuant | 2026 | paper | Action-guided mixed-precision quantization for deploying VLA models on constrained hardware. |
| Continuous Reasoning for VLA | 2026 | paper | Replaces token-style reasoning with shareable continuous latents aligned to action chunks. |
| VLAMotor | 2026 | paper | Test-guided failure discovery and agent-based synthetic data repair for VLA models. |
| FATE-VLA | 2026 | paper | Adaptive failure-aware test generation that searches high-risk embodied scenes for VLA failures. |
| Uni-LaViRA | 2026 | paper | Agentic language-vision-robot-action architecture for unified embodied navigation across robot types. |
| ProgVLA | 2026 | paper | Progress-aware compact VLA model for long-horizon and multi-object robot manipulation. |
| Mag-VLA | 2026 | paper | VLA policy for bimanual magnetically actuated microrobot manipulation. |
| Gaze2Act | 2026 | paper | Uses human gaze as a dynamic intent signal for interactive VLA robot manipulation. |
| DeMaVLA | 2026 | paper | VLA foundation model for deformable-object manipulation with real-world folding data. |
| PiL-World | 2026 | paper | Chunk-wise world model for closed-loop VLA policy-in-the-loop evaluation. |
| Learning What to Say to Your VLA | 2026 | paper | Searches and distills language feedback policies for steering frozen VLA models. |
| VLGA | 2026 | paper | Adds dense geometry supervision to vision-language-action models for autonomous driving. |
| ReactVLA | 2026 | paper | Fast lightweight reactive robot manipulation via improved mean-flow action generation. |
| Qwen-VLA | 2026 | paper | Unifies embodied decision-making across tasks, environments, and robot embodiments. |
| LabVLA | 2026 | paper | Grounds VLA models in scientific laboratory protocol execution and bench work. |
| ACE-Ego-0 | 2026 | paper | Bridges egocentric human videos and robot trajectories for VLA pretraining. |
| WeaveLA | 2026 | paper | Adds event-driven cross-subtask latent memory for repetitive robot manipulation. |
| GeneralVLA-2 | 2026 | paper | Uses geometry-aware reconstruction and governed memory for robot planning. |
| MuseVLA | 2026 | paper | Treats temperature, audio, radar, and other sensors as on-demand VLA tools. |
| Qwen-RobotManip | 2026 | paper | Scales Qwen-VL-based manipulation models through aligned heterogeneous robot and human data. |
| PearlVLA | 2026 | paper | Refines embodied action plans in latent space with future-guided process rewards. |
| ThinkingVLA | 2026 | paper | Interleaves visual forecasting, inverse reasoning, and action generation for long-horizon manipulation. |
| Uncertainty Quantification for Flow-Based VLAs | 2026 | paper, project | Estimates VLA epistemic uncertainty for failure detection and active fine-tuning. |
| WireCraft | 2026 | paper | Industrial deformable-linear-object manipulation benchmark with shared VLA evaluation. |
| Work | Year | Links | Contribution / Relevance |
|---|---|---|---|
| VISPROG | 2022 | paper, project | Foundational visual-programming approach for tool-composed visual reasoning and editing. |
| Visual ChatGPT | 2023 | paper, code | Early system connecting ChatGPT with visual foundation models for multi-step visual tasks. |
| ViperGPT | 2023 | paper, code | Uses Python execution to compose vision modules for interpretable visual reasoning. |
| LLaVA-Plus | 2023 | paper | Trains multimodal agents to select and use visual tools across understanding and generation. |
| DiffusionAgent | 2024 | paper | Routes prompts through expert diffusion models with tree-of-thought navigation and feedback memory. |
| GenArtist | 2024 | paper, code | MLLM-as-agent for image generation and editing through planning and tool use. |
| CIGEval | 2025 | paper | Agentic evaluation framework for conditional image generation. |
| DeepEyes | 2025 | paper | Reinforcement learning for active visual reasoning, grounding, and "thinking with images." |
| ImAgent | 2025 | paper | Test-time scalable multimodal agent framework for image generation. |
| GenAgent | 2026 | paper | Scales text-to-image generation through agentic multimodal reasoning. |
| Mind-Brush | 2026 | paper | Adds cognitive search and reasoning loops to image generation. |
| Agent Banana | 2026 | paper, code | High-fidelity image editing with planner-executor tooling. |
| M3 | 2026 | paper | Multi-modal, multi-agent, multi-round reasoning for high-fidelity text-to-image generation. |
| VisionCreator | 2026 | paper | Native visual-generation agentic model with understanding, planning, and creation. |
| VisionCreator-R1 | 2026 | paper | Adds explicit reflection and reflection-plan co-optimization for visual-generation agents. |
| Gen-Searcher | 2026 | paper, project | Reinforces agentic search for image generation. |
| GEMS | 2026 | paper, project | Multimodal generation with memory, skills, and iterative agent loops. |
| Visual Generation in the New Era | 2026 | paper | Helpful taxonomy for agentic world modeling and generation. |
| Visual Agentic Memory | 2026 | paper | Online indexing, hierarchical memory, and agentic retrieval for long video understanding. |
| GenEvolve | 2026 | paper | Self-evolving image-generation agent with tool-orchestrated visual experience distillation. |
| Generation Navigator | 2026 | paper | State-aware multi-turn text-to-image agent with trajectory-level RL for generation steering. |
| GUI Agents for Continual Game Generation | 2026 | paper, project | Uses GUI playtesting agents as evaluators and feedback providers for playable game generation. |
| GenClaw | 2026 | paper | Code-driven agentic image generation that plans, sketches with executable code, and refines with image models. |
| InterleaveThinker | 2026 | paper | Multi-agent planner-critic pipeline for agentic interleaved text-image generation. |
| Work | Year | Links | Contribution / Relevance |
|---|---|---|---|
| AVA | 2023 | paper | Autonomous visualization agents with visual perception-driven decision making. |
| Visual Agents as Fast and Slow Thinkers | 2024 | paper | System-1/System-2 framing for visual-agent reasoning and action. |
| Visual Agentic AI for Spatial Reasoning | 2025 | paper | Dynamic-API visual agent for spatial reasoning in 3D scenes. |
| Visual Agentic Reinforcement Fine-Tuning | 2025 | paper | Trains VLMs to use visual tools and code for "thinking with images." |
| ParaView-MCP | 2025 | paper | Autonomous visualization agent with direct tool use in ParaView. |
| VisualToolAgent / VisTA | 2025 | paper | RL framework for dynamic visual tool selection and composition. |
| Evaluation-Centric Scientific Visualization Agents | 2025 | paper | Evaluation-first paradigm for scientific visualization agents. |
| DART | 2025 | paper | Uses multi-agent disagreement to recruit specialized visual tools. |
| Orion | 2025 | paper | Unified visual agent for multimodal perception, visual reasoning, and tool execution. |
| Kimi K2.5 | 2026 | paper | Open-source multimodal agentic model optimized jointly for text and vision. |
| OmniStream | 2026 | paper | Streaming visual-agent representation for perception, reconstruction, and action. |
| VTC-Bench | 2026 | paper | Evaluates agentic multimodal models through compositional visual tool chaining. |
| SciVisAgentBench | 2026 | paper | Reproducible benchmark for scientific data analysis and visualization agents. |
| SASAV | 2026 | paper | Self-directed scientific analysis and visualization agent. |
| CANVAS | 2026 | paper | Visual agentic storyboarding for continuity-aware long-form visual narratives. |
| Progressive Online Video Understanding | 2026 | paper | Online visual agent that answers when enough streaming evidence appears. |
| Beyond Pixels | 2026 | paper | Introspective and interactive grounding for visualization agents. |
| AI-Gram | 2026 | paper | Live social platform populated by visual agents that create and respond to visual content. |
| DV-World | 2026 | paper | Real-world benchmark for data-visualization agents with native environment grounding. |
| Hierarchical Visual Agent / HierVA | 2026 | paper | Manages image-text contexts for multi-step chart reasoning across subplots. |
| Emergent Communication between Heterogeneous Visual Agents | 2026 | paper | Studies decentralized communication when visual agents have private representations. |
| MMSkills | 2026 | paper | Multimodal procedural skill packages for reusable visual-agent decision making. |
| Visual Agentic Memory | 2026 | paper | Training-free visual memory for online indexing, retrieval, and evidence verification. |
| MemEye | 2026 | paper | Visual-centric evaluation framework for long-term multimodal agent memory. |
| Diversity Over Frequency | 2026 | paper | Studies tool-use collapse and rollout diversity in visual Chain-of-Thought agents. |
| VESTA | 2026 | paper | Scientific visual exploration agent with dynamically generated statistical tools. |
| CV-Arena | 2026 | paper | Instructional computer-vision benchmark with agentic planning, editing, and verification. |
| Visual Skills | 2026 | paper | Multimodal reusable skill paradigm preserving visual evidence and spatial interaction traces. |
| TVIR | 2026 | paper | Text-visual interleaved deep-research benchmark and hierarchical multimodal report agent. |
| Active Exploring like a Pigeon | 2026 | paper | Agentic spatial reasoning with dynamic cognitive maps and verifiable spatial assertion codes. |
| PERIA | 2026 | paper | Tool-augmented visual agent for spatial reasoning across map reasoning, probing, and reconstruction tasks. |
| Orchestra-o1 | 2026 | paper | Omnimodal agent orchestration with modality-aware task decomposition, online sub-agent specialization, and parallel sub-task execution. |
| Work | Year | Links | Contribution / Relevance |
|---|---|---|---|
| AGENTSAFE | 2025 | paper | Safety benchmark for embodied agents under hazardous instructions. |
| IS-Bench | 2025 | paper | Interactive safety benchmark for VLM-driven household agents. |
| VPI-Bench | 2025 | paper, code | Visual prompt-injection benchmark for computer-use agents. |
| OpenAgentSafety | 2025 | paper | Framework for evaluating real-world agent safety across risk categories. |
| OS-Sentinel | 2025 | paper | Hybrid validation for safer mobile GUI agents. |
| UI-CUBE | 2025 | paper | Enterprise CUA benchmark that measures operational reliability beyond task accuracy. |
| SafePred | 2026 | paper | Predictive guardrail for computer-using agents using world-model rollouts. |
| LPS-Bench | 2026 | paper | Safety-awareness benchmark for long-horizon CUA planning under benign and adversarial scenarios. |
| GUIGuard-Bench | 2026 | paper | Privacy-preserving GUI-agent evaluation. |
| CUAAudit | 2026 | paper | Tests whether VLMs can audit autonomous computer-use agents. |
| GUIDE | 2026 | paper | Hierarchical diagnostic evaluation for long GUI-agent trajectories. |
| VeriGUI | 2026 | paper | Action-effect verification and self-correction for robust GUI automation. |
| Semantic-level UI Element Injection | 2026 | paper | Red-teaming method that distracts GUI agents through benign-looking injected UI elements. |
| CORA | 2026 | paper | Conformal risk-controlled safeguard for mobile GUI-agent action execution. |
| OS-BLIND | 2026 | paper | Shows how benign-looking user instructions expose CUA vulnerabilities. |
| HazardArena | 2026 | paper | Semantic safety evaluation for VLA systems. |
| RedVLA | 2026 | paper | Physical red-teaming benchmark for VLA models. |
| GUI-Perturbed | 2026 | paper | Domain-randomization study exposing GUI-grounding brittleness. |
| OS-SPEAR | 2026 | paper | Toolkit for safety, performance, efficiency, and robustness analysis of OS agents. |
| Don't Click That | 2026 | paper | Benchmarks and mitigates deceptive UI elements for VLM-based web agents. |
| SafeManip | 2026 | paper | Temporal-safety benchmark for robotic manipulation using LTL-style monitors. |
| WARD | 2026 | paper | Robust defense for web agents against prompt injection in HTML and visual interfaces. |
| ProjGuard | 2026 | paper | Safety monitoring for computer-use agents via low-dimensional projections. |
| Pre-VLA | 2026 | paper | Runtime verification for risky VLA action generation and imagined rollouts. |
| AgentHijack | 2026 | paper | Benchmarks CUA robustness to realistic environment corruptions rather than direct adversarial prompts. |
| ROGUE | 2026 | paper | Corrigibility benchmark showing unsafe behavior can arise during ordinary computer-use tasks. |
| SafeVLA-Bench | 2026 | paper | Post-hoc safety benchmark exposing unsafe-success cases in VLA manipulation rollouts. |
| FATE-VLA | 2026 | paper | Failure-seeking VLA test generation for robustness evaluation before deployment. |
| MemVenom | 2026 | paper | Triggered poisoning attack against multimodal memory retrieval in web agents. |
| OSGuard | 2026 | paper | Dual-granularity CUA safety benchmark for unsafe shortcuts under benign instructions. |
| MIRAGE | 2026 | paper | Context-aware prompt injection against mobile GUI agents through user-generated content regions. |
| MaskClaw | 2026 | paper | Edge-side personalized privacy arbitration for GUI agents with behavior-driven skill evolution. |
| BraveGuard | 2026 | paper | Self-evolving guard training loop for safer computer-use-agent trajectories. |
| CAPED | 2026 | paper | Context-aware screenshot exposure control for mobile GUI-agent privacy. |
| Area | Resource | Link | Primary Use |
|---|---|---|---|
| Web | MiniWoB++ | code | Compact browser-interaction environments for controlled RL-style experiments. |
| Web | Mind2Web | paper | Offline web-agent action prediction and grounding. |
| Web | WebArena | paper, code | Realistic web navigation with execution-based grading. |
| Web | WebArena-Verified | code | Audited WebArena task set with deterministic offline evaluation. |
| Web | VisualWebArena | paper, code | Visually grounded web tasks where screenshots matter. |
| Web | WebLINX | paper, project | Conversational web navigation from expert demonstrations. |
| Web | WorkArena | paper, code | Enterprise workflow automation in ServiceNow-style environments. |
| Web | MMInA | paper, code | Multihop multimodal tasks over evolving real websites. |
| Web | WebCanvas | paper | Online web-agent evaluation with Mind2Web-Live. |
| Web | WebGym | paper | Large-scale realistic training environment for visual web agents. |
| Web | DocOS | paper | Document-guided GUI-agent tasks in dynamic open-web environments. |
| Web | RiskWebWorld | paper | Realistic e-commerce risk-management tasks for GUI agents. |
| Web | SaaS-Bench | paper, code | Long-horizon professional workflows across deployable SaaS systems. |
| Web | ShopGym | paper | Controllable e-commerce simulation with realistic layouts, catalogs, policies, and tasks. |
| Desktop | OSWorld | paper, code | Open-ended desktop tasks in real operating systems. |
| Desktop | Windows Agent Arena | paper, code | Windows-specific scaling and reproducible OS-agent evaluation. |
| Desktop | OmniACT | paper | Evaluating executable automation rather than only low-level clicks. |
| Desktop | OS-Marathon | paper | Long-horizon repetitive professional workflows. |
| Desktop | OpenComputer | paper | Verifiable software worlds with state verifiers and auditable partial-credit rewards. |
| Desktop | CutVerse | paper | Media post-production editing tasks across professional creative applications. |
| Mobile | Android in the Wild | paper | Large-scale Android device-control demonstrations with screen observations. |
| Mobile | B-MoCA | paper | Mobile control across diverse device configurations. |
| Mobile | AndroidWorld | paper, code | Dynamic Android tasks with broad app coverage. |
| Mobile | AndroidControl | paper | Diverse Android control dataset for studying scale and generalization. |
| Mobile | MobileAgentBench | paper | Efficient mobile-agent evaluation across open-source apps. |
| Mobile | SPA-Bench | paper | Smartphone-agent testing with comprehensive task coverage. |
| Mobile | AndroidLab | paper | Training and systematic benchmarking on Android virtual devices. |
| Mobile | A3 | paper, project | Real-app online evaluation for mobile GUI agents. |
| Mobile | SecAgent | paper | Chinese mobile GUI dataset, benchmark, and compact semantic-context agent. |
| Mobile | PSPA-Bench | paper | Personalized smartphone GUI-agent tasks with process-aware evaluation. |
| Mobile | OmniGUI | paper, project | Omni-modal smartphone action prediction with visual, audio, and video cues. |
| Computer use | C-World | paper | On-demand environment creation for computer-use-agent training. |
| Grounding | ScreenSpot-Pro | paper | High-resolution professional-screen grounding. |
| Grounding | WinDeskGround | paper | Multi-window desktop grounding under realistic visual clutter. |
| Grounding | PAGER | paper | Point-precise GUI control for geometric construction tasks. |
| Visual-agent reasoning | MageBench | paper, code | Lightweight environments for vision-in-the-chain agent reasoning. |
| Visual-agent reasoning | VTC-Bench | paper | Compositional visual tool chaining for agentic multimodal models. |
| Visualization | SciVisAgentBench | paper | Scientific data analysis and visualization-agent evaluation. |
| Visualization | DV-World | paper | Real-world data visualization tasks with environment grounding and intent alignment. |
| Memory | MemGUI-Bench | paper | Cross-session and cross-temporal mobile GUI memory. |
| Memory | MementoGUI-Bench | paper | Long-horizon GUI decision-making with memory consistency diagnostics. |
| Memory | Visual Agentic Memory | paper | Online indexing and evidence retrieval for long video understanding. |
| Dynamic GUI | DynamicGUIBench | paper | Robustness under evolving interfaces and dynamic UI changes. |
| Exploration | ScreenSearch | paper | Large-scale desktop state-graph exploration under partial observability. |
| Enterprise reliability | UI-CUBE | paper | Deployment-readiness diagnostics beyond simple task success. |
| Security | VPI-Bench | paper, code | Visual prompt injection for GUI and computer-use agents. |
| Safety | AGENTSAFE | paper | Hazardous-instruction safety for embodied agents. |
| Safety | HazardArena | paper | Semantic safety evaluation for VLA systems. |
| Safety | SafeManip | paper | Temporal safety properties for robotic manipulation rollouts. |
| Embodied | LIBERO | code | Lifelong robot manipulation tasks. |
| Embodied | RLBench | code | Simulation-based manipulation benchmark. |
| Embodied | VLA-REPLICA | paper | Low-cost reproducible real-world VLA evaluation. |
| Web | Weblica | paper | Scalable reproducible web-replica environments for visual web-agent training. |
| Web | CUA-Gym | paper | Verifiable RLVR task/environment/reward generation for computer-use agents. |
| Web | OpenWebRL | paper | Online multi-turn RL framework for live visual web agents. |
| Mobile | SimuWoB | paper | Synthetic high-fidelity mobile apps with automatic rewards. |
| Mobile | MobileGym | paper | Highly parallel mobile GUI simulator with deterministic state-based judging. |
| Mobile | AndroidDaily | paper | Closed-source real-app Android benchmark with visual process evaluation. |
| Desktop | TClone | paper | Low-latency forking of live GUI environments for CUA execution and evaluation. |
| GUI testing | GUITestScape | paper | Open-set exploratory GUI testing with interaction and display defects. |
| Visual-agent memory | MemEye | paper | Evaluates whether multimodal agent memory preserves visual evidence. |
| Visualization | VESTA / DAWN | paper | Statistical modeling benchmark and visual tool-agent framework. |
| Visual editing | CV-Arena | paper | Instructional computer-vision task benchmark with human-AI preference evaluation. |
| Embodied safety | SafeVLA-Bench | paper | Success-safety gap evaluation for VLA manipulation policies. |
| Web | WebForge-Bench | paper, code | Automatically generated self-contained browser-agent benchmark environments. |
| Web | SimGym | paper | E-commerce A/B-test simulation with traffic-grounded live-browser VLM agents. |
| Web | MemVenom | paper | Memory-poisoning threat model for long-horizon web agents with multimodal retrieval. |
| GUI game generation | PlaytestArena | paper, project | Browser-game generation benchmark evaluated by GUI playtesting agents. |
| Security | MIRAGE | paper | Mobile GUI prompt-injection samples placed in realistic user-generated content. |
| Safety | OSGuard | paper | Computer-use-agent safety benchmark for unsafe shortcuts during normal tasks. |
| Privacy | MaskClaw | paper | Edge-side privacy arbitration benchmark and skill-evolution scenarios for GUI agents. |
| Privacy | CAPED | paper | Context-aware mobile GUI screenshot exposure defense. |
| Embodied evaluation | PiL-World | paper | Closed-loop VLA policy-in-the-loop evaluation with imagined action-conditioned observations. |
| Desktop | Workflow-GYM | paper | Long-horizon GUI workflows in professional software fields. |
| Desktop | MyPCBench | paper | Personal computer-use benchmark with user-specific context and account state. |
| Scientific instruments | LabOSBench | paper | Scientific instrument-control interfaces for computer-use-agent evaluation. |
| Embodied evaluation | WireCraft | paper | Industrial wire and cable manipulation benchmark with VLA policy baselines. |
These resources are intentionally separated from research papers. They are implementation and evaluation artifacts rather than, in every case, standalone research contributions.
| Resource | Type | Link | Primary Use |
|---|---|---|---|
| OpenAI Skills guide | docs | Docs | Understanding skill-style packaging for reusable agent capabilities. |
| Agent Skills for Large Language Models | survey | Paper | Architecture, acquisition, and security framing for skill-based agents. |
| CUA-Skill | skill base | Paper | Reusable computer-use procedures with parameterized execution graphs. |
| MMSkills | multimodal skill framework | Paper | Reusable visual procedures with multimodal state, progress, and failure evidence. |
| awesome-agent-skills | collection | GitHub | Finding reusable agent skills across browsing, coding, documents, and visual tasks. |
| awesome-gpt-image-2 | collection | GitHub | Tracking prompt patterns and workflows around modern image generation. |
| gpt_image_2_skill | skill package | GitHub | Example of packaging image-generation workflows as reusable skills. |
| ToDiagram skills | skill collection | GitHub | Diagram and visual-communication skills that pair well with visual agents. |
| Resource | Type | Link | Primary Use |
|---|---|---|---|
| OmniParser | parser | GitHub | Converting screenshots into candidate interactable regions. |
| ShowUI | GUI model | GitHub | Screenshot-conditioned GUI action modeling and demonstration pipelines. |
| UGround | grounding model | GitHub | Pure-vision GUI grounding without accessibility trees. |
| OS-ATLAS | action model | Paper | Cross-platform GUI action grounding. |
| GUI-G1 | grounding model | GitHub | Studying RL recipes and evaluation pitfalls for GUI grounding. |
| UI-Zoomer | grounding tool | GitHub | Adaptive zoom-in when the target UI element is hard to localize. |
| Phi-Ground | grounding model | Paper | Compact GUI grounding baseline for resource-constrained settings. |
| SafeGround | grounding calibrator | Paper | Estimating grounding risk before executing high-impact GUI actions. |
| AutoFocus | grounding tool | Paper | Training-free uncertainty-aware active visual search on high-resolution screens. |
| AQuaUI | token reducer | Paper | Adaptive quadtree compression for GUI screenshots at inference time. |
| Orion | visual agent | Paper | Tool-augmented visual reasoning and execution across images, videos, and documents. |
| Kimi K2.5 | visual agentic model | Paper | Open-source multimodal agentic intelligence model with joint text-vision optimization. |
| VisualToolAgent | tool selector | Paper | RL-based selection and composition of visual tools. |
| VTC-Bench | tool-chain benchmark | Paper | Evaluating compositional visual tool use in agentic multimodal models. |
| Resource | Type | Link | Primary Use |
|---|---|---|---|
| UI-TARS Desktop | desktop agent | GitHub | Running multimodal desktop agents locally. |
| Agent S | runtime | GitHub | General computer-use experiments with a practical open framework. |
| Cua | operator stack | GitHub | Infrastructure for running and evaluating computer-use agents. |
| OpenAdapt | generative RPA stack | GitHub | Recording GUI demonstrations, training models, and evaluating agents from a unified CLI. |
| HIDAgent | HID toolkit | Paper | Enabling visual UI agents on HID-compatible devices. |
| GPA | demo replay stack | Paper | Local GUI process automation from a single demonstration. |
| browser-use | browser runtime | GitHub | Browser automation workflows when DOM/tool access is acceptable. |
| Stagehand | browser runtime | GitHub | Hybrid code-plus-natural-language browser automation for production workflows. |
| Playwright MCP | browser MCP server | GitHub | Gives agents browser automation tools through the Model Context Protocol. |
| BrowserGym | browser harness | GitHub | Reproducible browser-agent experiments and benchmark orchestration. |
| AgentLab | experiment framework | GitHub | Running, comparing, and analyzing web-agent experiments. |
| OpenAdapt Desktop | desktop capture/runtime | GitHub | Capturing human demonstrations and replaying desktop workflows. |
| ScreenPipe | local data capture | GitHub | Recording local screen/audio context for personal or research agents. |
| Resource | Type | Link | Primary Use |
|---|---|---|---|
| OSWorld | desktop environment | GitHub | Standard desktop benchmark and environment. |
| AndroidWorld | mobile environment | GitHub | Dynamic Android environment for mobile agents. |
| AndroidControl | mobile dataset | Paper | Large Android control demonstrations for training and data-scaling studies. |
| Windows Agent Arena | desktop environment | GitHub | Windows-specific OS-agent evaluation. |
| WebArena | web benchmark | GitHub | Realistic web tasks with execution-based grading. |
| WebArena-Verified | web benchmark | GitHub | Audited and deterministic WebArena evaluation. |
| VisualWebArena | visual web benchmark | GitHub | Web tasks where screenshots and visual grounding matter. |
| WorkArena | enterprise benchmark | GitHub | Enterprise-style workflow automation. |
| OpenCUA | open CUA stack | GitHub | Data, models, and evaluation foundations for computer-use agents. |
| ScaleCUA | scaling stack | GitHub | Cross-platform CUA data scaling and evaluation. |
| WebGym | visual web environment | Paper | Large-scale realistic training tasks for visual web agents. |
| C-World | environment creator | Paper | Creating diverse computer-use environments on demand. |
| OpenComputer | verifiable worlds | Paper | State verifiers, synthetic desktop tasks, and auditable rewards. |
| ShopGym | e-commerce simulator | Paper | Realistic and controllable e-commerce web-agent evaluation. |
| SciVisAgentBench | visualization benchmark | Paper | Evaluating scientific visualization agents on executable analysis tasks. |
| DV-World | data-visualization benchmark | Paper | Real-world visualization-agent scenarios with evolving environments. |
| Visual Agentic Memory | video memory framework | Paper | Training-free long-video indexing, hierarchical memory, and retrieval. |
| CUA-Suite | data suite | Paper | Large human-annotated video demonstrations for CUA research. |
| ShowUI-Aloha | data pipeline | Paper, code | Turning screen recordings into GUI-agent training trajectories. |
| Video2GUI | data pipeline | Paper | Synthesizing GUI trajectories from instructional videos. |
| ScreenSearch | exploration corpus | Paper | Building desktop GUI state graphs through ambiguity-aware exploration. |
| CutVerse | creative-workflow benchmark | Paper | Professional media post-production GUI trajectories and evaluation. |
| lmms-eval | eval toolkit | GitHub | Static multimodal evaluation that can complement closed-loop agent tests. |
| WebForge | browser benchmark generator | Paper, code | Automatically generating reproducible browser-agent benchmark environments. |
| SimGym | e-commerce simulator | Paper | Simulating visually driven e-commerce A/B tests with live-browser VLM agents. |
| PlaytestArena | game-generation benchmark | Paper, project | Using GUI agents to playtest generated browser games. |
| Workflow-GYM | professional GUI benchmark | Paper | Evaluating long-horizon computer-use agents in specialized professional software. |
| MyPCBench | personal CUA benchmark | Paper | Testing computer-use agents in personalized digital environments. |
| LabOSBench | scientific-instrument benchmark | Paper | Evaluating computer-use agents on scientific instrument-control workflows. |
| ProCUA-SFT | desktop CUA data | Paper | Supervised fine-tuning data and recipes for desktop computer-use agents. |
| Resource | Type | Link | Primary Use |
|---|---|---|---|
| OpenVLA | VLA model | GitHub | Common open baseline for VLA robot manipulation. |
| LeRobot | robotics toolkit | GitHub | Robot-learning datasets, policies, training, and deployment tooling. |
| LIBERO | robotics benchmark | GitHub | Lifelong robot manipulation tasks. |
| RLBench | robotics benchmark | GitHub | Simulation-based manipulation evaluation. |
| SAGE | 3D scene engine | GitHub | Agentic 3D scene generation for embodied policy training. |
| Magma | foundation model | GitHub | Bridging digital computer use and physical action. |
| VLA-REPLICA | robot benchmark | Paper | Low-cost reproducible real-world VLA evaluation. |
| SafeManip | safety benchmark | Paper | Temporal-safety monitors for robotic manipulation rollouts. |
| ReactVLA | VLA model | Paper, project | Low-latency reactive VLA policy for real-time robot manipulation. |
| PiL-World | VLA evaluation | Paper | Closed-loop policy-in-the-loop evaluation without executing every rollout on a real robot. |
| Qwen-VLA | VLA model | Paper | Unified embodied decision-making across tasks, environments, and robot embodiments. |
| Qwen-RobotManip | VLA model | Paper | Scaled robotic manipulation foundation model built on Qwen-VL. |
| LabVLA | laboratory VLA | Paper | Grounding VLA models in scientific laboratory protocol execution. |
| ACE-Ego-0 | VLA pretraining data | Paper | Unifying egocentric human video and robot trajectories for VLA pretraining. |
| MuseVLA | multisensory VLA | Paper | Invoking non-RGB sensors as adaptive tools for robotic manipulation. |
| WireCraft | manipulation benchmark | Paper | Industrial deformable-linear-object manipulation benchmark with shared evaluation. |
| Workflow | Practical stack |
|---|---|
| GUI grounding research | ScreenSpot-Pro + OmniParser + UGround + GUI-G1 + SafeGround + UI-Zoomer + AutoFocus + PAGER + AQuaUI |
| Browser-agent experiments | BrowserGym + AgentLab + WebArena + WebArena-Verified + VisualWebArena + WebLINX + MMInA + WebGym + ShopGym + SaaS-Bench |
| Desktop computer-use agents | UI-TARS Desktop + Agent S + Cua + OSWorld + Windows Agent Arena + OS-Marathon + OpenComputer + ScreenSearch + CutVerse |
| Mobile GUI agents | Android in the Wild + AndroidControl + AndroidWorld + A3 + MobileAgentBench + SPA-Bench + MemGUI-Bench + PSPA-Bench + OmniGUI + UI-Mem |
| Demonstration and data pipelines | OpenAdapt + OpenAdapt Desktop + ScreenPipe + ShowUI-Aloha + CUA-Suite + Video2GUI + C-World + OpenComputer |
| Agentic visual creation | gpt_image_2_skill + DiffusionAgent + GenArtist + DeepEyes + Agent Banana + VisionCreator + VisionCreator-R1 + GEMS + GenEvolve |
| General visual agents and tool use | Visual Agentic RFT + VisualToolAgent + Orion + Kimi K2.5 + VTC-Bench + MMSkills + Visual Agentic Memory |
| Visualization and chart agents | AVA + ParaView-MCP + SciVisAgentBench + SASAV + Beyond Pixels + DV-World + HierVA |
| Embodied VLA research | OpenVLA + LeRobot + LIBERO + RLBench + SAGE + Magma + VLAs-as-Tools + VLA-REPLICA + SafeManip + Pre-VLA |
| Reliability and security testing | VPI-Bench + OpenAgentSafety + UI-CUBE + GUIDE + CORA + OS-BLIND + HazardArena + OS-SPEAR + WARD + Pre-VLA |
| Resource | Link | Why read it |
|---|---|---|
| OpenAI Computer Use guide | Docs | Developer-facing guide for building with computer-use tooling. |
| OpenAI Computer-Using Agent | Article | Product and research framing for modern CUAs. |
| OpenAI Skills guide | Docs | Practical reference for reusable agent skills. |
| OpenAI MCP and Connectors guide | Docs | Reference for connecting external tools and services to agents. |
| Anthropic: Developing a computer use model | Article | Strong public engineering writeup on GUI-agent training and evaluation. |
| Anthropic: Introducing computer use | Article | System framing and deployment context for computer-use models. |
| Google DeepMind: Gemini Robotics | Article | Industry view on embodied visual agents. |
| Google DeepMind: Gemini Robotics On-Device | Article | Notes on low-latency, local VLA deployment. |
| Repository | Link | Notes |
|---|---|---|
| Awesome-GUI-Agents | GitHub | Focused companion index for GUI grounding and automation papers. |
| GUI-Agents-Paper-List | GitHub | Systematic paper index focused on GUI agents. |
| awesome-ui-agents | GitHub | Neighboring index for UI-agent papers and projects. |
| Evolving Visual Generation | GitHub | Adjacent map for visual-generation systems. |
| Awesome Multimodal Modeling | GitHub | Broader multimodal modeling list beyond the stricter agent boundary here. |
Pull requests are welcome when they improve precision rather than volume.
Recommended metadata:
- The paper title or project name.
- Official paper, code, project page, or documentation link.
- The best category for the item.
- One sentence explaining the visual-agent loop, benchmark role, or builder value.
Out of scope:
- Generic multimodal model releases with no visual-agent evaluation.
- One-shot generation papers without planning, tools, search, critique, or interaction.
- Duplicate benchmark rows unless the new row adds a distinct environment or protocol.
- Unverified arXiv IDs, placeholder links, and marketing-only announcements.
This repository is maintained as a precision-oriented research map:
- Prefer primary sources: official papers, project pages, code repositories, datasets, benchmarks, and technical documentation.
- Keep research entries, benchmarks, and engineering resources separated when their roles differ.
- Add recent work only when it improves the conceptual coverage, empirical coverage, or builder utility of the map.
- Verify arXiv identifiers, project links, and benchmark names before adding new entries.
- Prune duplicate, weakly scoped, or marketing-only entries even when they are recent.
- Preserve a strict visual-agent boundary: perception alone is not sufficient without grounding, planning, tool use, interaction, control, or agent-oriented evaluation.
If you use this curated index in research or engineering work, please cite it as:
@misc{awesome-visual-agent,
title = {Awesome Visual Agent},
author = {OpenEnvision and contributors},
year = {2026},
howpublished = {\url{https://github.com/OpenEnvision/Awesome-Visual-Agent}},
note = {Curated list of visual-agent papers, benchmarks, and tooling}
}