Motivation / Background
AskUI Vision Agent already lets users hot-swap between PTA-1, Qwen-VL, Claude Sonnet, etc. (see model matrix in the README).
Since 10 June 2025 the community has an even stronger OSS option:
- Holo-1 (3 B & 7 B) – best-in-class open-weight UI VLM, ∼76 % screen-element F1.
- Runner H / Surfer H – thin REST layer that turns Holo-1 into a fully autonomous computer-use act model (LAM), matching Claude’s “Computer Use” API at ~¼ cost.
Adding first-class support will:
- Give users an enterprise-friendly, self-hostable alternative to Anthropic & PTA-1.
- Offer higher accuracy on real-world web UIs than OSS-Atlas / ShowUI.
- Keep Vision Agent the canonical “orchestrator” for every serious UI-automation model out there.
Proposed Design
Vision Agent’s architecture already encourages pluggable models via ActModel, GetModel, LocateModel, and the ModelRegistry (see “Custom Models” section).
We simply need two thin adapters:
| File |
Type |
Registers as |
Notes |
src/askui/models/holo1.py |
LocateModel |
"holo-1" |
Pure-Python HF adapter uses 🤗 AutoModelForVision2Seq + AutoProcessor. |
src/askui/models/runnerh.py |
ActModel & (LocateModel passthrough) |
"runner-h" (or "surfer-h") |
REST client to Runner H SaaS or self-hosted container. Mimics Claude Computer-Use JSON schema so upstream logic stays unchanged. |
Both classes are <200 LOC and require no change to core agent logic.
Implementation Steps
-
Dependencies
pyproject.toml
transformers = {extras = ["vision"], version = ">=4.42"}
huggingface-hub = ">=0.23"
httpx = {version = ">=0.27", extras = ["http2"]}
-
src/askui/models/holo1.py
from askui import LocateModel, Locator, ImageSource, Point
from transformers import AutoProcessor, AutoModelForVision2Seq
import torch
from PIL import Image
import os
class Holo1LocateModel(LocateModel):
def __init__(self,
model_id: str | None = None,
device: str | None = None):
self.model_id = model_id or os.getenv("HOLO1_MODEL_ID", "HCompany/Holo1-7B")
self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")
self.processor = AutoProcessor.from_pretrained(self.model_id)
self.model = AutoModelForVision2Seq.from_pretrained(
self.model_id,
torch_dtype=torch.float16 if self.device == "cuda" else torch.float32
).to(self.device).eval()
def locate(self,
locator: str | Locator,
image: ImageSource,
model_choice: str | None = None) -> Point:
prompt = locator if isinstance(locator, str) else locator.description
pil: Image.Image = Image.fromarray(image.to_numpy()) if hasattr(image, "to_numpy") else image
inputs = self.processor(text=prompt, images=pil, return_tensors="pt").to(self.device)
tokens = self.model.generate(**inputs, max_new_tokens=8)[0]
x_str, y_str = self.processor.decode(tokens, skip_special_tokens=True).split()
return int(float(x_str)), int(float(y_str))
-
src/askui/models/runnerh.py
import httpx, os, time, logging
from askui import ActModel, MessageParam, OnMessageCb
class RunnerHActModel(ActModel):
def __init__(self, base_url: str | None = None, api_key: str | None = None):
self.base_url = base_url or os.getenv("RUNNERH_URL", "https://api.runnerh.com/v1")
self.api_key = api_key or os.getenv("RUNNERH_API_KEY")
self.client = httpx.Client(base_url=self.base_url, headers={"Authorization": f"Bearer {self.api_key}"})
def act(self, messages: list[MessageParam], model_choice: str, on_message: OnMessageCb | None = None):
goal = messages[0].content
resp = self.client.post("/agent/act", json={"goal": goal, "ui_mode": "vision"})
resp.raise_for_status()
task_id = resp.json()["id"]
while True:
r = self.client.get(f"/agent/status/{task_id}")
r.raise_for_status()
state = r.json()
if on_message:
on_message(role="assistant", content=state["last_event"])
if state["status"] == "finished":
return
elif state["status"] == "failed":
raise RuntimeError(state["error"])
time.sleep(1)
-
Export registry helpers
src/askui/models/__init__.py
from .holo1 import Holo1LocateModel # noqa
from .runnerh import RunnerHActModel # noqa
# ...
DEFAULT_MODELS.update({
"holo-1": Holo1LocateModel(),
"runner-h": RunnerHActModel(),
})
-
README refresh
Add Holo-1 & Runner H rows to the model matrix with short descriptions & example usage:
with VisionAgent(model={"click": "holo-1", "act": "runner-h"}) as agent:
agent.act("Book a flight from JNB to CPT next Friday")
-
Integration tests
tests/test_holo1_integration.py – smoke test that feeds a static screenshot & asserts coordinates are within expected bounds (skip if HOLO1_SKIP_TEST env var set).
tests/test_runnerh_async.py – mocked HTTP server verifying polling logic.
Open Questions
- Licensing – Holo-1 is Apache-2.0, fully compatible; Runner H SaaS is free-beta but REST under CC-BY-SA. Any legal blockers?
- Weights caching – should we instruct users to set
HF_HOME or rely on the default cache?
- GPU vs CPU – default to half-precision on CUDA; otherwise full FP32 on CPU (≈ 3 s inference @ 7 B).
Acceptance Criteria
Happy to put together a PR if this direction is approved!
Motivation / Background
AskUI Vision Agent already lets users hot-swap between PTA-1, Qwen-VL, Claude Sonnet, etc. (see model matrix in the README).
Since 10 June 2025 the community has an even stronger OSS option:
Adding first-class support will:
Proposed Design
Vision Agent’s architecture already encourages pluggable models via
ActModel,GetModel,LocateModel, and theModelRegistry(see “Custom Models” section).We simply need two thin adapters:
src/askui/models/holo1.pyLocateModel"holo-1"uses 🤗
AutoModelForVision2Seq+AutoProcessor.src/askui/models/runnerh.pyActModel& (LocateModelpassthrough)"runner-h"(or"surfer-h")Mimics Claude Computer-Use JSON schema so upstream logic stays unchanged.
Both classes are <200 LOC and require no change to core agent logic.
Implementation Steps
Dependencies
pyproject.tomlsrc/askui/models/holo1.pysrc/askui/models/runnerh.pyExport registry helpers
src/askui/models/__init__.pyREADME refresh
Add Holo-1 & Runner H rows to the model matrix with short descriptions & example usage:
Integration tests
tests/test_holo1_integration.py– smoke test that feeds a static screenshot & asserts coordinates are within expected bounds (skip ifHOLO1_SKIP_TESTenv var set).tests/test_runnerh_async.py– mocked HTTP server verifying polling logic.Open Questions
HF_HOMEor rely on the default cache?Acceptance Criteria
pip install askui[holo1]pulls new deps & passes CI.VisionAgent(..., model="holo-1")canclick()on a test screenshot with <45 px error.agent.act(..., model="runner-h")succeeds end-to-end on the public Runner H demo page.Happy to put together a PR if this direction is approved!