- Provide a hands-off, deterministic “vision” substrate that reads CI/PR/HIL/IDE panes and emits tiny, typed facts for the orchestrator. No clicking, no heuristics — just facts and JSON Patch deltas that gate automation.
- End-to-end firmware autopilot for K1-09: task intake → agentic coding → CI → HIL → merge, with human override only. Reliability comes from gates: compile/tests, HIL proofs, and this Vision layer’s ground truth.
- Task intake → structured issue spec
- Orchestrator (n8n) fans out, loops on failures, applies merge policy
- CI runs (PlatformIO, static checks, size budget)
- HIL: flash → serial/logic/camera → metrics
- Vision layer (this repo): AX/DOM first, OCR second; emits RFC-6902 JSON Patch deltas and optional RFC-7464 JSON-seq stream
- Sensors: macOS Accessibility (AX) for native panes; Playwright DOM for web UIs. OCR (Apple Vision) is fallback only.
- Capture: ScreenCaptureKit (stream or still). Avoid CGWindow APIs.
- Geometry mapping: Use per-frame attachments: ContentRect (points), ScaleFactor (display scale), ContentScale (backing scale). This yields visible rect and stable crop pixels on retina/multi-display systems.
- Parsing: deterministic regex/state machines → typed facts → JSON Patch deltas; stream via JSON Text Sequences when tailing.
- CI:
CI_SUMMARY,CHECKS_LIST,CI_LOGS_DETAIL(DOM) - PR:
PR_BANNER,PR_DIFF_SUMMARY,PR_THREAD_SUMMARY(DOM) - IDE:
IDE_PROBLEMS,IDE_TERMINAL(AX) - HIL:
HIL_CHART,HIL_LOGS(DOM/native as applicable)
- Locate
- AX: traverse
AXUIElementtree, read roles/titles, getkAXFrame - DOM: Playwright locators (
getByRole,getByTestId).locator.boundingBox()is viewport-relative or null when hidden
- AX: traverse
- Capture
- Stream (
SCStream) or still (SCScreenshotManager) withSCContentFilter(SCWindow) - Read per-frame attachments: ContentRect, ScaleFactor, ContentScale
- Stream (
- Map DOM/AX → pixels (per frame)
- Given DOM bbox in CSS px and viewport {Wv,Hv}, and SCK visible rect {Vx,Vy,Vw,Vh} in pixels:
sx = Vw/Wv,sy = Vh/Hvcrop.x = Vx + bbox.x * sx,crop.y = Vy + bbox.y * sycrop.w = bbox.width * sx,crop.h = bbox.height * sy
- Given DOM bbox in CSS px and viewport {Wv,Hv}, and SCK visible rect {Vx,Vy,Vw,Vh} in pixels:
- Read Text (fallback)
- Apple Vision
VNRecognizeTextRequestwith explicit languages,.accurate, and tunedminimumTextHeight
- Apple Vision
- Parse → facts → deltas
- Source-specific parsers generate typed facts; emit RFC-6902 patches only on change; optional RFC-7464 streaming
- Confidence & debouncing
- Fuse confidence:
min(structured_conf, ocr_conf, parse_score); only multi-read when confidence dips or AX/DOM disagrees with OCR
- Fuse confidence:
- ScreenCaptureKit replaces deprecated CGWindow and supplies per-frame geometry
- Playwright locators stabilize DOM reads; bbox semantics are documented
- OCR is local and used only when structure is unavailable
- JSON Patch + JSON-seq keep OA bandwidth minimal and traceable
vision/visiond-swift— Swift daemon using ScreenCaptureKit to resolve panes, map geometry, capture frames, and serve HTTPvision/dom-bridge— Playwright-based DOM bridge that returns stable viewport bboxes and text for web panesvision/parserd— Python service that fuses AX/DOM/OCR, parses deterministic facts, and emits JSON Patch/JSON-seqvision/schemas— JSON Schemas validating targets and fact payloadsvision/tests— fixtures, goldens, and a smoke test harness
- Bootstrap
bash vision/scripts/bootstrap_mac.sh- Grant Screen Recording and Accessibility permissions (see
vision/scripts/permissions.md)
- DOM auth (optional, for GitHub panes)
cd vision/dom-bridge && ./scripts/bootstrap_dom_bridge.sh && npm run login
- Run the stack
bash vision/scripts/run_all.sh
visiond(Swift)GET /healthz— stream status: fps, latency, engine mixPOST /capture_once {pane_id}— returns sensors + OCR tokens + optional PNG
parserd(Python)GET /healthz— parse/emit metricsPOST /analyze_once {pane_id}— returns{facts, confidence, observation}POST /watch {pane_id,fps}— JSON Text Sequence (default) or SSE stream of deltas
- Facts: typed, source-specific fields (e.g.,
ci.status,pr.mergeable) - Deltas: RFC-6902 JSON Patch, emitted only when fields change
- Streaming: RFC-7464 JSON Text Sequences for tailing and low-latency orchestration
- Prefer AX/DOM over OCR; treat OCR as a last resort
- Use role/test-id locators before raw CSS/text to resist UI drift
- Capture fixtures for light/dark themes and minor zoom variations (0.9–1.1) to harden parsers
- Expand pane coverage with fixtures and goldens
- Wire OA webhook (
OA_WEBHOOK_URL) and merge gating policies - Add CI to lint Swift/Python/Node and to replay fixtures for regression
vision/README.mdfor component-level details and local APIsdocs/architecture.mdfor diagrams and data-path specifics
- Apache License 2.0. See
LICENSEandNOTICE.