Support open-weight Holo-1 VLM & Runner H UI-agent (high-accuracy, OSS computer-use models)

### **Motivation / Background**

AskUI Vision Agent already lets users hot-swap between PTA-1, Qwen-VL, Claude Sonnet, etc. (see model matrix in the README).

Since 10 June 2025 the community has an even stronger OSS option:

* **Holo-1** (3 B & 7 B) – best-in-class open-weight UI VLM, ∼76 % screen-element F1.
* **Runner H / Surfer H** – thin REST layer that turns Holo-1 into a fully autonomous computer-use *act* model (LAM), matching Claude’s “Computer Use” API at \~¼ cost.

Adding first-class support will:

* Give users an enterprise-friendly, self-hostable alternative to Anthropic & PTA-1.
* Offer higher accuracy on real-world web UIs than OSS-Atlas / ShowUI.
* Keep Vision Agent the canonical “orchestrator” for every serious UI-automation model out there.

---

### **Proposed Design**

Vision Agent’s architecture already encourages pluggable models via `ActModel`, `GetModel`, `LocateModel`, and the `ModelRegistry` (see “Custom Models” section).
We simply need two thin adapters:

| File                          | Type                                     | Registers as                   | Notes                                                                                                                                   |
| ----------------------------- | ---------------------------------------- | ------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------- |
| `src/askui/models/holo1.py`   | `LocateModel`                            | `"holo-1"`                     | Pure-Python HF adapter<br>uses 🤗 `AutoModelForVision2Seq` + `AutoProcessor`.                                                           |
| `src/askui/models/runnerh.py` | `ActModel` & (`LocateModel` passthrough) | `"runner-h"` (or `"surfer-h"`) | REST client to Runner H SaaS **or** self-hosted container.<br>Mimics Claude Computer-Use JSON schema so upstream logic stays unchanged. |

Both classes are <200 LOC and require **no** change to core agent logic.

---

### **Implementation Steps**

1. **Dependencies**
   *`pyproject.toml`*

   ```toml
   transformers = {extras = ["vision"], version = ">=4.42"}
   huggingface-hub = ">=0.23"
   httpx = {version = ">=0.27", extras = ["http2"]}
   ```
2. **`src/askui/models/holo1.py`**

   ```python
   from askui import LocateModel, Locator, ImageSource, Point
   from transformers import AutoProcessor, AutoModelForVision2Seq
   import torch
   from PIL import Image
   import os

   class Holo1LocateModel(LocateModel):
       def __init__(self,
                    model_id: str | None = None,
                    device: str | None = None):
           self.model_id = model_id or os.getenv("HOLO1_MODEL_ID", "HCompany/Holo1-7B")
           self.device   = device or ("cuda" if torch.cuda.is_available() else "cpu")
           self.processor = AutoProcessor.from_pretrained(self.model_id)
           self.model     = AutoModelForVision2Seq.from_pretrained(
               self.model_id,
               torch_dtype=torch.float16 if self.device == "cuda" else torch.float32
           ).to(self.device).eval()

       def locate(self,
                  locator: str | Locator,
                  image: ImageSource,
                  model_choice: str | None = None) -> Point:
           prompt = locator if isinstance(locator, str) else locator.description
           pil: Image.Image = Image.fromarray(image.to_numpy()) if hasattr(image, "to_numpy") else image
           inputs = self.processor(text=prompt, images=pil, return_tensors="pt").to(self.device)
           tokens = self.model.generate(**inputs, max_new_tokens=8)[0]
           x_str, y_str = self.processor.decode(tokens, skip_special_tokens=True).split()
           return int(float(x_str)), int(float(y_str))
   ```
3. **`src/askui/models/runnerh.py`**

   ```python
   import httpx, os, time, logging
   from askui import ActModel, MessageParam, OnMessageCb

   class RunnerHActModel(ActModel):
       def __init__(self, base_url: str | None = None, api_key: str | None = None):
           self.base_url = base_url or os.getenv("RUNNERH_URL", "https://api.runnerh.com/v1")
           self.api_key  = api_key  or os.getenv("RUNNERH_API_KEY")
           self.client   = httpx.Client(base_url=self.base_url, headers={"Authorization": f"Bearer {self.api_key}"})

       def act(self, messages: list[MessageParam], model_choice: str, on_message: OnMessageCb | None = None):
           goal = messages[0].content
           resp = self.client.post("/agent/act", json={"goal": goal, "ui_mode": "vision"})
           resp.raise_for_status()
           task_id = resp.json()["id"]

           while True:
               r = self.client.get(f"/agent/status/{task_id}")
               r.raise_for_status()
               state = r.json()
               if on_message:
                   on_message(role="assistant", content=state["last_event"])
               if state["status"] == "finished":
                   return
               elif state["status"] == "failed":
                   raise RuntimeError(state["error"])
               time.sleep(1)
   ```
4. **Export registry helpers**
   *`src/askui/models/__init__.py`*

   ```python
   from .holo1 import Holo1LocateModel     # noqa
   from .runnerh import RunnerHActModel    # noqa
   # ...
   DEFAULT_MODELS.update({
       "holo-1": Holo1LocateModel(),
       "runner-h": RunnerHActModel(),
   })
   ```
5. **README refresh**
   Add Holo-1 & Runner H rows to the model matrix with short descriptions & example usage:

   ```python
   with VisionAgent(model={"click": "holo-1", "act": "runner-h"}) as agent:
       agent.act("Book a flight from JNB to CPT next Friday")
   ```
6. **Integration tests**
   *`tests/test_holo1_integration.py`* – smoke test that feeds a static screenshot & asserts coordinates are within expected bounds (skip if `HOLO1_SKIP_TEST` env var set).
   *`tests/test_runnerh_async.py`* – mocked HTTP server verifying polling logic.

---

### **Open Questions**

* **Licensing** – Holo-1 is Apache-2.0, fully compatible; Runner H SaaS is free-beta but REST under CC-BY-SA. Any legal blockers?
* **Weights caching** – should we instruct users to set `HF_HOME` or rely on the default cache?
* **GPU vs CPU** – default to half-precision on CUDA; otherwise full FP32 on CPU (≈ 3 s inference @ 7 B).

---

### **Acceptance Criteria**

* [ ] `pip install askui[holo1]` pulls new deps & passes CI.
* [ ] `VisionAgent(..., model="holo-1")` can `click()` on a test screenshot with <45 px error.
* [ ] `agent.act(..., model="runner-h")` succeeds end-to-end on the public Runner H demo page.
* [ ] Docs + README updated; model table lists accuracy / cost figures.
* [ ] Telemetry respects existing opt-out env var.

Happy to put together a PR if this direction is approved!

File	Type	Registers as	Notes
`src/askui/models/holo1.py`	`LocateModel`	`"holo-1"`	Pure-Python HF adapter uses 🤗 `AutoModelForVision2Seq` + `AutoProcessor`.
`src/askui/models/runnerh.py`	`ActModel` & (`LocateModel` passthrough)	`"runner-h"` (or `"surfer-h"`)	REST client to Runner H SaaS or self-hosted container. Mimics Claude Computer-Use JSON schema so upstream logic stays unchanged.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support open-weight Holo-1 VLM & Runner H UI-agent (high-accuracy, OSS computer-use models) #81

Motivation / Background

Proposed Design

Implementation Steps

Open Questions

Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Support open-weight Holo-1 VLM & Runner H UI-agent (high-accuracy, OSS computer-use models) #81

Description

Motivation / Background

Proposed Design

Implementation Steps

Open Questions

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions