xpath: implement XPath 1.0 (Document.evaluate, XPathResult, DOM.performSearch) by navidemad · Pull Request #2305 · lightpanda-io/browser

navidemad · 2026-04-28T17:20:51Z

Summary

Ports the capybara-lightpanda XPath 1.0 polyfill into Lightpanda's Zig codebase. Exposes the WHATWG Document.evaluate / XPathResult / XPathEvaluator / XPathExpression surface and routes CDP DOM.performSearch XPath queries through the new evaluator. Motivation: the capybara-lightpanda gem currently injects a ~700-line JS polyfill on every page load to make document.evaluate work; this PR moves that into the engine so the gem can drop the polyfill in its next release.

Full behavior spec + the 91-case acceptance battery this port implements: capybara-lightpanda/XPATH_COMPLIANCE.md.

Note

Downstream coordination: capybara-lightpanda will drop its ~700-line JS polyfill in the same release that bumps MINIMUM_NIGHTLY_BUILD past the merge build of this PR. No action needed from this repo's reviewers — flagged for visibility.

Scope

New src/browser/xpath/ module: Tokenizer, Parser, Ast, Evaluator, Functions, Result (~3,000 LOC).
New src/browser/webapi/{XPathResult,XPathExpression,XPathEvaluator}.zig WHATWG types (~470 LOC).
Document.evaluate, Document.createExpression, Document.createNSResolver wired via the JS bridge.
DOM.performSearch heuristic-based XPath/CSS branching — gives Playwright/Puppeteer/Capybara XPath-via-CDP for free.
Behavior fixtures: 91-case conformance battery, result-type surface, evaluator-API surface, CDP perform-search.

`Document.evaluate` flow

sequenceDiagram
    participant JS
    participant Document
    participant XPathResult
    participant Parser
    participant Evaluator

    JS->>Document: evaluate(expr, ctx, resolver, type, result)
    Document->>XPathResult: fromExpression(expr, ctx, type, frame)
    XPathResult->>Parser: parse(arena, expr)
    Parser-->>XPathResult: *Ast.Expr
    XPathResult->>Evaluator: evaluate(arena, frame, ast, ctx)
    Evaluator-->>XPathResult: Result.Result
    XPathResult-->>JS: XPathResult { resultType, ... }

`DOM.performSearch` query-type detection

flowchart LR
    A[CDP query] --> B{isXPathQuery?}
    B -->|path operator or :: axis| C[xpath.searchAll]
    B -->|otherwise| D[Selector.querySelectorAll]
    C --> E[finishSearch]
    D --> E
    E --> F[CDP result: searchId, resultCount]

The full heuristic: query starts with /, .//, (/, (./, or contains :: substring (axis specifier) → XPath; otherwise CSS.

Coverage

Full table in XPATH_COMPLIANCE.md.

Area	Implemented	Stubs
Path expressions	11/11	—
Axes	11/12	`namespace::` (returns `[]`)
Node tests	4 type tests + name + wildcard	`processing-instruction('name')` consumes target literal but matches any PI
Operators	union, `or`/`and`, `=`/`!=`, `<`/`<=`/`>`/`>=`, `+`/`-`, `*`/`div`/`mod`, unary `-`	—
Node-set fns	`last`, `position`, `count`, `id`, `local-name`, `name`	`namespace-uri()` always `""`
String fns	`string`, `concat`, `starts-with`, `contains`, `substring-before`, `substring-after`, `substring`, `string-length`, `normalize-space`, `translate`	—
Boolean fns	`boolean`, `not`, `true`, `false`	`lang()` always `false`
Number fns	`number`, `sum`, `floor`, `ceiling`, `round`	—
Variable refs	parsed	`$name` always `""`

Intentional stubs (HTML pragmatism — polyfill parity)

These match the prior capybara-lightpanda polyfill, which is the original motivation for this PR (HTML pragmatism over strict XPath 1.0):

lang(string) → always false
namespace:: axis → always []
namespace-uri() → always ""
processing-instruction('target') → target literal consumed but matches any PI
name() / local-name() → lowercased (HTML-friendly, not source-case as the spec demands)
Variable references $name → always ""
Tokenizer skips unknown chars silently
DOM.performSearch top-level scalar XPath → empty (matches the polyfill's xpathFind semantics — the API is for finding nodes)

Test plan

make test — 601/601 pass
30 Tokenizer + 40 Parser (parametric over the 91-case battery) + 11 Evaluator + 15 Functions + 9 Result Zig unit tests
4 HTML behavior fixtures: document_evaluate.html, xpath_conformance.html (91 cases), xpath_result.html, xpath_evaluator.html
CDP fixture perform_search_xpath.html + Zig test driving 5 query shapes
Custom test runner detects leaks under debug build; clean run confirms no leaks

karlseguin

My biggest concern is about the performance of common queries, especially with respect to memory usage. I'll be honest, I only have a high level understanding of xpath and of your implementation. In our CSS selector, we apply some optimizations, including mostly doing right-to-left processing, to minimize the in-memory size. Claude informs me that probably doesn't work so well with xpath since they simply allow for a much broader set of instructions.

I also don't want to make the PR or the code too complicated. So tell me what you think of these possible optimization. I don't think any of these are mutually exclusive, but possibly some of them don't make sense or are impractical.

1 - Optimizing for common shapes that have easy optimizations. e.g. //foo, //foo[@id='x'] and //*[@id='x']

2 - right-to-left processing where where predicates can be evaluated independently per node, without knowing their position, e.g. //foo[bar]

3 - Introducing a re-usable arena so that scratch allocations made within each step, or ideally in the per-ctx inner loopof step, doesn't go into the main arena. The arena can be reset between each iteration.

4 - Using an iterator for the axis (rather than collecting them in memory).
- Allows for further optimizations (early exist)
- last() adds complexity to this, but it can be detected at parse time, and that could trigger materialization while still exposing an a consistent iterator API to everything.

5 - Whether it's an iterator or we continue to collect the full axis, there are common cases where the dedupe list is already in document order. This can be detected (not by iterating the list, but based on what the xpath itself) and, when true, dedupe.keys() does not need to be sorted

karlseguin · 2026-05-06T08:11:38Z

+
+/// Public entry. Returns the AST's value; node-sets are sorted into
+/// document order before return per XPath spec §3.3.
+pub fn evaluate(arena: Allocator, frame: *Frame, expr: *const Ast.Expr, context_node: *Node) Error!Result.Result {


frame is generally the last parameter. There's no strict reason for this, but since the js bridge enforces this for webapis, it's something we tend to follow throughout.

Done — moved frame to last on Evaluator.evaluate, Evaluator.searchAll, Functions.call, and Functions.idFn. Call sites updated in webapi/XPath{Result,Expression}.zig and cdp/domains/dom.zig.

navidemad · 2026-05-06T17:19:40Z

@karlseguin Thanks — spent some time on measurement and Point 1 (the cheapest, highest-value lever from your list). Recap below, then numbers.

Benchmark fixture

Added src/browser/tests/xpath/xpath_perf.html: deterministic 500-node DOM (mix of div/span/p × alpha/beta/gamma, decorrelated periods so //div[@class='alpha'] isn't a degenerate restatement of //div), 50 iterations per query after a 3-iter warmup, asserts result counts so a regression in correctness can't hide behind a timing line.

Run: TEST_VERBOSE=false TEST_FILTER="XPath perf" zig build test.

Point 1 — id-lookup fast path

tryIdLookupFastPath at the top of Evaluator.evalPath. Recognizes:

//tag[@id='x']    →  ds::node() / child::tag[@id='literal']
.//tag[@id='x']   →  self::node() / ds::node() / child::tag[@id='literal']

…and serves them via frame.getElementByIdFromNode (case-insensitive tag check, containment check for relative paths, accepts the literal on either side of =). Falls through to the general path on any deviation: extra step, extra predicate, non-eq, non-literal RHS, or unresolvable search root.

Inherits the same compromise webapi/selector/List.zig:optimizeSelector already ships for querySelector(All) — the id-map only stores the first element per ID in document order, so a page with duplicate IDs returns one match where a strict tree walk would find all. Capybara/Selenium hot paths assume unique IDs and the same compromise has been deployed in CSS for years.

Results — debug build, 500-node tree, 50 iters/query

label	baseline µs	after µs	speedup
`//*[@id='target']`	3231	22.6	143×
`//span[@id='target']` (hit)	2931	22.4	131×
`//div[@id='target']` (miss, tag mismatch)	2089	21.7	96×
`//div`	2964	1133	flat*
`//span`	2659	3354	flat*
`//*`	8986	6595	flat*
`//*[@class='alpha']`	7210	5751	flat*
`//div[@class='alpha']`	3128	3272	flat
`(//div)[1]`	2631	3518	flat*
`(//div)[last()]`	2331	2561	flat
`//div[contains(@class,'alpha')]`	4005	4390	flat
`//div[starts-with(@id,'n')]`	4637	4724	flat
`count(//div)`	2880	1962	flat*

* ±30% across runs (debug build, 50 iters). Fast path falls through immediately for these shapes — any movement is system noise. The 22µs floor on ID lookups is parser + bridge round-trip; at that point we're measuring V8 ↔ Zig overhead, not XPath work.

All 79 XPath tests still pass, including cases the fast path correctly rejects at the predicate-shape check (e.g. //*[@id='heading' and @class='primary']).

Point 4 (iterator axes with parse-time needs_size detection) — folds in Point 2's logic, biggest payoff for (//x)[1] / count(//x) / //foo[bar]. Worth it if the noise-level numbers above bother you in real Capybara runs. The biggest remaining cost is //*[@class='x'] (~7ms) and //* (~9ms), both of which touch the full tree.
Point 3 (per-step scratch arena) — defer until profilers point at peak alloc.
Point 5 (skip dedupe sort when already in doc order) — skip. Detection is fragile (single ctx + all forward axes + no overlapping subtrees) and the savings are bounded by sort cost, which is sub-ms in practice.

navidemad · 2026-05-06T17:43:31Z

Update — pushed f61be05f extending the Point 1 fast path to non-positional descendant queries.

tryFusedDescendantFastPath accepts the same AST shape (//<test> / .//<test> lowering) but allows any boolean/node-set predicate that doesn't depend on outer position. Walks the search root's descendants once in document order, applies the node test + predicates inline. No per-step axis materialization, no dedup hash map (single-context forward walk preserves doc order).

Safety gate rejects predicates whose top-level expression is numeric (number literal, neg, arithmetic binop, numeric-returning fn-call) and any predicate containing position() or last() anywhere. Conservative — a nested sub-path's local positional predicate ([bar[1]]) is rejected even though [1] is scoped to bar's axis. Easy to relax later if it shows up in real usage.

tryIdLookupFastPath refactored to share matchDescendantPathShape with the new fast path; behavior unchanged.

Numbers (debug build, 500-node tree, 50 iters/query)

label	baseline	after Point 1	after fused	speedup
`//*[@id='target']`	3231	22.6	22.6	143×
`//span[@id='target']` (hit)	2931	22.4	22.8	129×
`//div[@id='target']` (tag miss)	2089	21.7	21.7	96×
`//div`	2964	1133	386	8×
`//span`	2659	3354	384	7×
`//*`	8986	6595	1018	9×
`//*[@class='alpha']`	7210	5751	1498	5×
`//div[@class='alpha']`	3128	3272	625	5×
`(//div)[1]`	2631	3518	87	30×
`(//div)[last()]`	2331	2561	94	25×
`//div[contains(@class,'alpha')]`	4005	4390	704	6×
`//div[starts-with(@id,'n')]`	4637	4724	794	6×
`count(//div)`	2880	1962	90	32×

(//div)[1], (//div)[last()], and count(//div) pick up the speedup transitively — //div is now a fused-walk shape, so the inner expression is fast even though the outer filter/count is unchanged.

All 79 XPath tests still pass. Conformance cases that correctly fall through to the general path because their predicates are positional or reference position()/last(): //li[1], //li[last()], //li[position() > 2], //li[position() mod 2 = 1], (//section)[2].

navidemad · 2026-05-06T17:52:05Z

Two commits in, the fast paths land 5–143× wins across the benchmark — see the previous comment for the full table. The fused-walk extension absorbed most of what I'd originally pegged for Point 4, since (//x)[1] / count(//x) benefit transitively from the inner path being fast. Points 2, 4, and 5 from your list are effectively done or skipped.

One remaining lever I'd consider: Point 3 (per-predicate-iteration scratch arena). Worth more than I implied earlier — the XPath result arena (XPathResult._arena) lives until JS releases the result via XPathResult.deinit, not just for the duration of the call, so peak per-call allocation persists across calls if a script holds results. Point 3 would target that. Structurally more invasive than the surgical shape-detection though: need to thread a separate scratch arena through evalStep / predicate evaluation.

@karlseguin How do you want to proceed?

Land as-is, file Point 3 as a separate follow-up.
Push deeper on this branch — Point 3 before merge, bigger diff but more complete perf coverage.
Re-run the benchmark in release mode first to see if the gains are still material outside debug builds.

Lightly leaning toward the first option — your original framing was "common shapes that have easy optimizations" and we're past that. But happy to keep going if you'd rather not bifurcate the work.

karlseguin · 2026-05-08T00:31:08Z

We can merge this as-is. I do think a follow up to leverage a scratch arena is worth it, those temp allocations can be quite large (entire DOM tree) and they're going to live until v8 frees them - which might be only on page end. While the new fast paths should help, I imagine some scripts do a lot of these xpath queries to fetch various fields and it could add up.

@krichprollsch still want to review it more?

karlseguin · 2026-05-08T00:52:37Z

+    try std.testing.expect(isXPathQuery("(./bar)[2]"));
+    try std.testing.expect(isXPathQuery("descendant::p"));
+    try std.testing.expect(isXPathQuery("ancestor-or-self::*"));
+    try std.testing.expect(isXPathQuery("//*[@id='x']"));


I think if you add cases like a::before it'll fail. You should use a StaticStringMap and match against the specific axis names.

Good catch — a::before did indeed pass (a matches [a-zA-Z-]). Fixed in d4de5e6: walks back the identifier run before :: and looks it up in a StaticStringMap of the 13 XPath 1.0 axis names. Test now also covers a::before, div::after, p::first-line, input::placeholder, and [data-x="x::y"].

Ports the capybara-lightpanda XPath 1.0 polyfill into Lightpanda. Exposes the WHATWG Document.evaluate / XPathResult / XPathEvaluator / XPathExpression surface and routes CDP DOM.performSearch XPath queries through the new evaluator. The capybara-lightpanda gem can drop its ~700-line JS polyfill in the next release. New module src/browser/xpath/ (Tokenizer, Parser, Ast, Evaluator, Functions, Result). New webapi types XPathResult, XPathExpression, XPathEvaluator. Coverage and stubs match the polyfill 1:1 — see capybara-lightpanda/XPATH_COMPLIANCE.md for the full spec. Tests: 91-case conformance + result-API + evaluator-API + CDP fixtures, plus the engine's Zig unit suite (601/601 pass).

The Parser borrows string slices from its input for AST literals, names, and var refs. Without duping, the AST holds slices into the JS call_arena, which is reset when the top-level call returns — every subsequent evaluate() of a cached XPathExpression would dereference freed memory.

A bare indexOf("::") matched CSS pseudo-elements (a::before) and attribute values containing '::' ([data-x="x::y"]), misrouting them to the XPath evaluator. Require an axis-name shape ([a-zA-Z-]) immediately before '::' so only real axis specifiers like descendant::p are dispatched to XPath.

The attribute axis was calling Entry.toAttribute on every visit, materializing fresh *Attribute structs (plus duped name/value strings) into page-lifetime storage. Repeated XPath queries — the Capybara/ Selenium polling pattern this PR targets — accumulated unbounded copies for the same DOM entries. Route through frame._attribute_lookup so each Entry resolves to a single cached *Attribute, matching List.getAttribute and NamedNodeMap.getAtIndex.

Per XPath 1.0 §5.7, the data model has no CDATASection node — CDATA content is part of the text node value. The text() node test was only matching DOM nodeType 3 (Text), silently excluding CDATA sections (nodeType 4) parsed via DOMParser/XMLDocument and inline foreign content like SVG with embedded scripts.

- Rename Result.zig / Ast.zig / Functions.zig to snake_case (no top-level fields per Zig style guide) - Restructure imports across xpath module: lib (std/lp) → relative (further → nearer) → aliases - Move `frame` to last parameter on Evaluator.evaluate, searchAll, Functions.call, idFn (matches js bridge convention); call sites updated in webapi/XPath{Result,Expression}.zig and cdp/domains/dom.zig - Local-pos style in XPathResult.iterateNext

- Document.evaluate / XPathEvaluator.evaluate / XPathExpression.evaluate: result_type / requested_type now optional u16 defaulting to ANY_TYPE (matches WHATWG: `optional unsigned short type = 0`). context_node stays nullable with a fallback to the document — preserves the polyfill's behavior asserted by the `default_context` fixture - ast.zig NodeTest: clarify that namespaced names (`prefix:*`, `prefix:local`) are stored verbatim and fall through to a literal match against the node name — consistent with the `namespace::` axis stub (decision lightpanda-io#3). Adds a TODO for if the polyfill ever drops the stub - Parser: cap recursive descent at depth 64 with new error.MaxDepthExceeded; depth tracked across parseExpr (parens, predicates, function args) and parseUnaryExpr (chained `-`). Two regression tests cover deep parenthesization and deep unary minus

@id

evalPath recognizes //tag[@id='x'] and .//tag[@id='x'] (plus the //*[@id='x'] wildcard) and serves them via frame.getElementByIdFromNode. ~100-150x speedup on ID lookups (3231us -> 22.6us for //*[@id='target'] in the new benchmark). Falls through to general path on any deviation (extra step, extra predicate, non-eq, non-literal RHS). Inherits the same duplicate-ID compromise selector/List.zig ships for querySelector(All): the id-map stores only the first element per ID in document order. Capybara/Selenium hot paths assume unique IDs. tests/xpath/xpath_perf.html is the 13-query micro-benchmark used to collect the numbers; batched console.warn output survives test runner interleaving.

@id

Generalizes 8733e33's //tag[@id='x'] shape: tryFusedDescendantFastPath handles any //tag[safe] or .//tag[safe] where the predicates are non-positional boolean/node-set checks. Walks the search root's descendants once in document order, applies node test + predicates inline, no per-step materialization, no dedup. 5-9x on //div, //*, //*[@Class='x'], //div[contains(...)]; ~25x on (//div)[1] and count(//div) where the inner path is the shape. Safety gate rejects predicates that could produce a number at the top level (number, neg, arithmetic binop, numeric-returning fn-call) and any predicate containing position()/last() anywhere. Conservative: a nested sub-path's local positional predicate is rejected even though it's scoped to its own axis.

Disable xpath_perf benchmark from test run as its quite verbose.

The previous `::` heuristic accepted any identifier-like character before `::`, which misrouted CSS pseudo-elements (`a::before`, `div::after`) to the XPath evaluator. Walk back the run of [a-zA-Z-] characters and look the candidate up in a StaticStringMap of the 13 XPath 1.0 named axes, so only real axis names match.

Strip mentions of the private gem and its internal paths from xpath module docstrings, the conformance test header, and the dom dispatch heuristic. Comments now describe behavior directly without pointing at sources public readers can't access.

krichprollsch · 2026-05-11T08:02:22Z

Thanks @navidemad for this hard work, it's a nice and useful addition to the browser 🙏
Thanks @karlseguin for the review 🙏

navidemad mentioned this pull request Apr 28, 2026

review: XPath 1.0 evaluator (mirror of upstream #2305) navidemad/browser#2

Closed

krichprollsch self-requested a review May 4, 2026 07:49

karlseguin reviewed May 6, 2026

View reviewed changes

karlseguin reviewed May 8, 2026

View reviewed changes

navidemad and others added 11 commits May 8, 2026 08:44

Naming convention fixes

9830da0

Disable xpath_perf benchmark from test run as its quite verbose.

navidemad force-pushed the feat/xpath-1.0-evaluator branch from d4de5e6 to 0b0a34c Compare May 8, 2026 06:46

krichprollsch approved these changes May 11, 2026

View reviewed changes

krichprollsch merged commit d2151b6 into lightpanda-io:main May 11, 2026
35 checks passed

github-actions Bot locked and limited conversation to collaborators May 11, 2026

Conversation

navidemad commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope

Document.evaluate flow

DOM.performSearch query-type detection

Coverage

Test plan

Uh oh!

karlseguin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

karlseguin May 6, 2026

Choose a reason for hiding this comment

Uh oh!

navidemad May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

navidemad commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark fixture

Point 1 — id-lookup fast path

Results — debug build, 500-node tree, 50 iters/query

Next

Uh oh!

navidemad commented May 6, 2026

Numbers (debug build, 500-node tree, 50 iters/query)

Uh oh!

navidemad commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karlseguin commented May 8, 2026

Uh oh!

karlseguin May 8, 2026

Choose a reason for hiding this comment

Uh oh!

navidemad May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

krichprollsch commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

navidemad commented Apr 28, 2026 •

edited

Loading

`Document.evaluate` flow

`DOM.performSearch` query-type detection

navidemad commented May 6, 2026 •

edited

Loading

navidemad commented May 6, 2026 •

edited

Loading