Skip to content

Support open-weight Holo-1 VLM & Runner H UI-agent (high-accuracy, OSS computer-use models) #81

@tristdrum

Description

@tristdrum

Motivation / Background

AskUI Vision Agent already lets users hot-swap between PTA-1, Qwen-VL, Claude Sonnet, etc. (see model matrix in the README).

Since 10 June 2025 the community has an even stronger OSS option:

  • Holo-1 (3 B & 7 B) – best-in-class open-weight UI VLM, ∼76 % screen-element F1.
  • Runner H / Surfer H – thin REST layer that turns Holo-1 into a fully autonomous computer-use act model (LAM), matching Claude’s “Computer Use” API at ~¼ cost.

Adding first-class support will:

  • Give users an enterprise-friendly, self-hostable alternative to Anthropic & PTA-1.
  • Offer higher accuracy on real-world web UIs than OSS-Atlas / ShowUI.
  • Keep Vision Agent the canonical “orchestrator” for every serious UI-automation model out there.

Proposed Design

Vision Agent’s architecture already encourages pluggable models via ActModel, GetModel, LocateModel, and the ModelRegistry (see “Custom Models” section).
We simply need two thin adapters:

File Type Registers as Notes
src/askui/models/holo1.py LocateModel "holo-1" Pure-Python HF adapter
uses 🤗 AutoModelForVision2Seq + AutoProcessor.
src/askui/models/runnerh.py ActModel & (LocateModel passthrough) "runner-h" (or "surfer-h") REST client to Runner H SaaS or self-hosted container.
Mimics Claude Computer-Use JSON schema so upstream logic stays unchanged.

Both classes are <200 LOC and require no change to core agent logic.


Implementation Steps

  1. Dependencies
    pyproject.toml

    transformers = {extras = ["vision"], version = ">=4.42"}
    huggingface-hub = ">=0.23"
    httpx = {version = ">=0.27", extras = ["http2"]}
  2. src/askui/models/holo1.py

    from askui import LocateModel, Locator, ImageSource, Point
    from transformers import AutoProcessor, AutoModelForVision2Seq
    import torch
    from PIL import Image
    import os
    
    class Holo1LocateModel(LocateModel):
        def __init__(self,
                     model_id: str | None = None,
                     device: str | None = None):
            self.model_id = model_id or os.getenv("HOLO1_MODEL_ID", "HCompany/Holo1-7B")
            self.device   = device or ("cuda" if torch.cuda.is_available() else "cpu")
            self.processor = AutoProcessor.from_pretrained(self.model_id)
            self.model     = AutoModelForVision2Seq.from_pretrained(
                self.model_id,
                torch_dtype=torch.float16 if self.device == "cuda" else torch.float32
            ).to(self.device).eval()
    
        def locate(self,
                   locator: str | Locator,
                   image: ImageSource,
                   model_choice: str | None = None) -> Point:
            prompt = locator if isinstance(locator, str) else locator.description
            pil: Image.Image = Image.fromarray(image.to_numpy()) if hasattr(image, "to_numpy") else image
            inputs = self.processor(text=prompt, images=pil, return_tensors="pt").to(self.device)
            tokens = self.model.generate(**inputs, max_new_tokens=8)[0]
            x_str, y_str = self.processor.decode(tokens, skip_special_tokens=True).split()
            return int(float(x_str)), int(float(y_str))
  3. src/askui/models/runnerh.py

    import httpx, os, time, logging
    from askui import ActModel, MessageParam, OnMessageCb
    
    class RunnerHActModel(ActModel):
        def __init__(self, base_url: str | None = None, api_key: str | None = None):
            self.base_url = base_url or os.getenv("RUNNERH_URL", "https://api.runnerh.com/v1")
            self.api_key  = api_key  or os.getenv("RUNNERH_API_KEY")
            self.client   = httpx.Client(base_url=self.base_url, headers={"Authorization": f"Bearer {self.api_key}"})
    
        def act(self, messages: list[MessageParam], model_choice: str, on_message: OnMessageCb | None = None):
            goal = messages[0].content
            resp = self.client.post("/agent/act", json={"goal": goal, "ui_mode": "vision"})
            resp.raise_for_status()
            task_id = resp.json()["id"]
    
            while True:
                r = self.client.get(f"/agent/status/{task_id}")
                r.raise_for_status()
                state = r.json()
                if on_message:
                    on_message(role="assistant", content=state["last_event"])
                if state["status"] == "finished":
                    return
                elif state["status"] == "failed":
                    raise RuntimeError(state["error"])
                time.sleep(1)
  4. Export registry helpers
    src/askui/models/__init__.py

    from .holo1 import Holo1LocateModel     # noqa
    from .runnerh import RunnerHActModel    # noqa
    # ...
    DEFAULT_MODELS.update({
        "holo-1": Holo1LocateModel(),
        "runner-h": RunnerHActModel(),
    })
  5. README refresh
    Add Holo-1 & Runner H rows to the model matrix with short descriptions & example usage:

    with VisionAgent(model={"click": "holo-1", "act": "runner-h"}) as agent:
        agent.act("Book a flight from JNB to CPT next Friday")
  6. Integration tests
    tests/test_holo1_integration.py – smoke test that feeds a static screenshot & asserts coordinates are within expected bounds (skip if HOLO1_SKIP_TEST env var set).
    tests/test_runnerh_async.py – mocked HTTP server verifying polling logic.


Open Questions

  • Licensing – Holo-1 is Apache-2.0, fully compatible; Runner H SaaS is free-beta but REST under CC-BY-SA. Any legal blockers?
  • Weights caching – should we instruct users to set HF_HOME or rely on the default cache?
  • GPU vs CPU – default to half-precision on CUDA; otherwise full FP32 on CPU (≈ 3 s inference @ 7 B).

Acceptance Criteria

  • pip install askui[holo1] pulls new deps & passes CI.
  • VisionAgent(..., model="holo-1") can click() on a test screenshot with <45 px error.
  • agent.act(..., model="runner-h") succeeds end-to-end on the public Runner H demo page.
  • Docs + README updated; model table lists accuracy / cost figures.
  • Telemetry respects existing opt-out env var.

Happy to put together a PR if this direction is approved!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions