Robust UI element localization for automation.
Turn flakey single-frame detections into stable, reliable element coordinates.
Vision models like OmniParser miss elements randomly frame-to-frame ("flickering"). Template matching breaks with resolution/theme changes.
Left: Raw detections showing frame-to-frame flickering
- Temporal Smoothing: Aggregate detections across frames, keep only stable elements
- Text Anchoring: Match elements by OCR text (resolution-independent)
Side-by-side: Raw flickering (left) vs Stabilized detection (right)
| Metric | Raw (30% dropout) | Stabilized |
|---|---|---|
| Avg Detection Rate | ~60-70% | 80-100% |
| Min Detection Rate | ~40% | Consistent |
| Consistency | Flickering | Stable |
| Scale | Resolution | Elements Found | Status |
|---|---|---|---|
| 1.0x | 800x600 | All | ✓ |
| 1.25x | 1000x750 | All | ✓ |
| 1.5x | 1200x900 | All | ✓ |
| 2.0x | 1600x1200 | All | ✓ |
Stable elements after temporal filtering
uv pip install openadapt-groundingfrom openadapt_grounding import RegistryBuilder, Element
# Add detections from multiple frames
builder = RegistryBuilder()
builder.add_frame([
Element(bounds=(0.3, 0.2, 0.2, 0.05), text="Login"),
Element(bounds=(0.3, 0.3, 0.2, 0.05), text="Cancel"),
])
# ... add more frames
# Build registry (keeps elements in >50% of frames)
registry = builder.build(min_stability=0.5)
registry.save("elements.json")from openadapt_grounding import ElementLocator
from PIL import Image
locator = ElementLocator("elements.json")
screenshot = Image.open("current_screen.png")
result = locator.find("Login", screenshot)
if result.found:
# Normalized coordinates (0-1)
print(f"Found at ({result.x:.2f}, {result.y:.2f})")
# Convert to pixels
px, py = result.to_pixels(width=1920, height=1080)
print(f"Click at ({px}, {py})")uv run python -m openadapt_grounding.demoOutput:
============================================================
OpenAdapt Grounding Demo Results
============================================================
Registry: 5 stable elements
📊 Detection Stability:
Raw (with 30% dropout): 70%
Stabilized (filtered): 100%
Improvement: +30%
📐 Resolution Robustness:
✓ 1.0x (800x600): 5 elements
✓ 1.25x (1000x750): 5 elements
✓ 1.5x (1200x900): 5 elements
✓ 2.0x (1600x1200): 5 elements
📁 Outputs saved to: demo_output/
Frame 1: [Login ✓] [Cancel ✓] [Password ✗] → 2/3 detected
Frame 2: [Login ✓] [Cancel ✗] [Password ✓] → 2/3 detected
Frame 3: [Login ✓] [Cancel ✓] [Password ✓] → 3/3 detected
...
After 10 frames:
- "Login" seen 9/10 times → KEEP (90% stability)
- "Cancel" seen 7/10 times → KEEP (70% stability)
- "Password" seen 8/10 times → KEEP (80% stability)
At runtime, we use OCR to find text on screen, then match against the registry:
# Registry knows "Login" button exists
# OCR finds "Login" text at (0.45, 0.35)
# → Return those coordinates with high confidenceUse with OmniParser for real UI element detection:
# Install deploy dependencies
uv pip install openadapt-grounding[deploy]
# Set AWS credentials (or use .env file)
cp .env.example .env
# Edit .env with your AWS credentials
# Deploy to EC2 (g6.xlarge with L4 GPU)
uv run python -m openadapt_grounding.deploy start
# Stop when done (terminates instance)
uv run python -m openadapt_grounding.deploy stop# Check instance and server status
$ uv run python -m openadapt_grounding.deploy status
Instance: i-0f57529053cb507ca | State: running | URL: http://98.92.234.13:8000
Auto-shutdown: Enabled (60 min timeout)
# Show container status
$ uv run python -m openadapt_grounding.deploy ps
CONTAINER ID IMAGE CREATED STATUS PORTS NAMES
c9343a65e85b omniparser:latest 2 hours ago Up 2 hours 0.0.0.0:8000->8000/tcp omniparser-container
# View container logs
$ uv run python -m openadapt_grounding.deploy logs --lines=5
INFO: 99.230.67.57:61252 - "POST /parse/ HTTP/1.1" 200 OK
start parsing...
image size: (1200, 779)
len(filtered_boxes): 160 124
time: 4.438266754150391
# Test endpoint with synthetic image
$ uv run python -m openadapt_grounding.deploy test
Server is healthy!
Sending test image to server...
Found 5 elements:
- [text] "Login" at ['0.08', '0.10', '0.38', '0.23']
- [text] "Cancel" at ['0.08', '0.30', '0.38', '0.43']
...uv run python -m openadapt_grounding.deploy build # Rebuild Docker image
uv run python -m openadapt_grounding.deploy run # Start container
uv run python -m openadapt_grounding.deploy ssh # SSH into instanceReal screenshot parsed by OmniParser:
| Input | Output (160 elements detected) |
|---|---|
![]() |
![]() |
Synthetic UI test:
| Input | Output |
|---|---|
![]() |
![]() |
# Run test with synthetic UI
uv run python -m openadapt_grounding.deploy test --save_outputfrom openadapt_grounding import OmniParserClient, collect_frames
from PIL import Image
# Connect to deployed server
client = OmniParserClient("http://<server-ip>:8000")
# Take a screenshot
screenshot = Image.open("screen.png")
# Run parser 10 times, keep elements in >50% of frames
registry = collect_frames(client, screenshot, num_frames=10, min_stability=0.5)
registry.save("stable_elements.json")
print(f"Found {len(registry)} stable elements")from openadapt_grounding import OmniParserClient, analyze_stability
client = OmniParserClient("http://<server-ip>:8000")
stats = analyze_stability(client, screenshot, num_frames=10)
print(f"Average stability: {stats['avg_stability']:.0%}")
for elem in stats['elements']:
print(f" {elem['text']}: {elem['stability']:.0%}")UI-TARS 1.5 is ByteDance's SOTA UI grounding model (61.6% on ScreenSpot-Pro). Use it for direct element localization by instruction.
# Install dependencies
uv pip install openadapt-grounding[deploy,uitars]
# Deploy to EC2 (g6.2xlarge with L4 GPU)
uv run python -m openadapt_grounding.deploy.uitars start
# Check status
uv run python -m openadapt_grounding.deploy.uitars status
# Test grounding
uv run python -m openadapt_grounding.deploy.uitars test
# Stop when done
uv run python -m openadapt_grounding.deploy.uitars stopfrom openadapt_grounding import UITarsClient
from PIL import Image
# Connect to deployed server
client = UITarsClient("http://<server-ip>:8001/v1")
# Load screenshot
screenshot = Image.open("screen.png")
# Ground element by instruction
result = client.ground(screenshot, "Click on the Login button")
if result.found:
# Normalized coordinates (0-1)
print(f"Found at ({result.x:.2f}, {result.y:.2f})")
# Convert to pixels
px, py = result.to_pixels(width=1920, height=1080)
print(f"Click at ({px}, {py})")
# Optional: View model's reasoning
if result.thought:
print(f"Thought: {result.thought}")| Feature | OmniParser | UI-TARS |
|---|---|---|
| Approach | Parse all elements | Ground by query |
| Output | List of bboxes | Single click point |
| Best for | Enumeration, registry building | Direct element finding |
| Detection Rate (our benchmark) | 99.3% | 70.6% |
| Latency (per element) | ~1.4s | ~6.9s |
We provide a comprehensive evaluation framework to compare UI grounding methods.
Evaluated on synthetic dataset (100 samples, 1922 UI elements):
| Method | Detection Rate | IoU | Latency | Attempts |
|---|---|---|---|---|
| OmniParser + screenseeker | 99.3% | 0.690 | 1418ms | 2.0 |
| OmniParser + fixed | 98.1% | 0.681 | 1486ms | 2.2 |
| OmniParser baseline | 97.4% | 0.648 | 724ms | 1.0 |
| UI-TARS + screenseeker | 70.6% | - | 6914ms | 2.3 |
| UI-TARS + fixed | 66.9% | - | 6891ms | 2.4 |
| UI-TARS baseline | 36.1% | - | 2724ms | 1.0 |
On Harder Synthetic Data (48 samples, 1035 elements - more dense, smaller targets):
| Method | Detection Rate | Change from Standard |
|---|---|---|
| OmniParser + fixed | 98.2% | +0.1% |
| OmniParser + screenseeker | 96.6% | -2.7% |
| OmniParser baseline | 90.1% | -7.3% |
Key Findings:
- Cropping strategies dramatically improve UI-TARS accuracy (+96% with screenseeker) but have minimal effect on OmniParser on standard data
- On harder data, cropping becomes essential - OmniParser baseline drops 7.3% but fixed cropping maintains 98%+ accuracy
- OmniParser is 3.8-5x faster than UI-TARS while being significantly more accurate
Small elements (<32px) are hardest for UI-TARS (28.6% → 50% with cropping), while OmniParser maintains ~100% across all sizes.
OmniParser offers the best accuracy-latency tradeoff, with near-perfect detection at <1.5s per element.
The evaluation uses programmatically generated UI screenshots with ground truth:
| Easy (3-8 elements) | Hard (20-50 elements, dark theme) |
|---|---|
![]() |
![]() |
# Install dependencies
uv pip install openadapt-grounding[eval]
# Generate synthetic dataset
uv run python -m openadapt_grounding.eval generate --type synthetic --count 100
# Run evaluation (requires deployed servers)
uv run python -m openadapt_grounding.eval run --method omniparser --dataset synthetic
uv run python -m openadapt_grounding.eval run --method uitars --dataset synthetic
# With cropping strategies
uv run python -m openadapt_grounding.eval run --method omniparser-screenseeker --dataset synthetic
uv run python -m openadapt_grounding.eval run --method uitars-screenseeker --dataset synthetic
# Generate comparison charts
uv run python -m openadapt_grounding.eval compare --charts-dir evaluation/charts| Method | Description |
|---|---|
omniparser |
OmniParser baseline (full image) |
omniparser-fixed |
OmniParser + fixed cropping (200, 300, 500px) |
omniparser-screenseeker |
OmniParser + heuristic UI region cropping |
uitars |
UI-TARS baseline (full image) |
uitars-fixed |
UI-TARS + fixed cropping |
uitars-screenseeker |
UI-TARS + heuristic UI region cropping |
See Evaluation Documentation for methodology and metrics.
add_frame(elements)- Add a frame's detectionsbuild(min_stability=0.5)- Build registry, filtering unstable elements
find(query, screenshot)- Find element by textfind_by_uid(uid, screenshot)- Find element by registry UID
found: bool- Whether element was foundx, y: float- Normalized coordinates (0-1)confidence: float- Match confidenceto_pixels(w, h)- Convert to pixel coordinates
is_available()- Check if server is runningparse(image)- Parse screenshot, return elementsparse_with_metadata(image)- Parse with latency info
is_available()- Check if server is runningground(image, instruction)- Find element by instruction, returnGroundingResult
found: bool- Whether element was foundx, y: float- Normalized coordinates (0-1)confidence: float- Match confidencethought: str- Model's reasoning (if include_thought=True)to_pixels(w, h)- Convert to pixel coordinates
- Run parser multiple times, build stable registry
- Report per-element detection stability
| Document | Description |
|---|---|
| Evaluation Findings | Analysis of why OmniParser outperforms UI-TARS on our task |
| Literature Review | SOTA analysis: UI-TARS (61.6%), OmniParser (39.6%), ScreenSeekeR cropping |
| Experiment Plan | Comparison methodology: 6 methods, 3 datasets, evaluation metrics |
| Evaluation Harness | Benchmarking framework, dataset formats, CLI usage |
| UI-TARS Deployment | UI-TARS deployment design, vLLM setup, API format |
From our benchmark on synthetic UI data:
- OmniParser dominates on our task: 97-99% detection vs UI-TARS's 36-70%
- Cropping becomes essential on harder data: OmniParser baseline drops to 90%, but fixed cropping maintains 98%+
- OmniParser is 3.8-5x faster than UI-TARS while being more accurate
- Literature benchmarks don't transfer directly: UI-TARS leads on ScreenSpot-Pro (complex instruction-following) but OmniParser wins on element detection
- Small elements (<32px) remain hardest for UI-TARS (28.6% baseline → 50% with cropping)
See Evaluation Findings for analysis of why results differ from literature benchmarks.
git clone https://github.com/OpenAdaptAI/openadapt-grounding
cd openadapt-grounding
uv sync
uv run pytestMIT











