Skip to content

xpath: implement XPath 1.0 (Document.evaluate, XPathResult, DOM.performSearch)#2305

Merged
krichprollsch merged 12 commits into
lightpanda-io:mainfrom
navidemad:feat/xpath-1.0-evaluator
May 11, 2026
Merged

xpath: implement XPath 1.0 (Document.evaluate, XPathResult, DOM.performSearch)#2305
krichprollsch merged 12 commits into
lightpanda-io:mainfrom
navidemad:feat/xpath-1.0-evaluator

Conversation

@navidemad
Copy link
Copy Markdown
Contributor

@navidemad navidemad commented Apr 28, 2026

Summary

Ports the capybara-lightpanda XPath 1.0 polyfill into Lightpanda's Zig codebase. Exposes the WHATWG Document.evaluate / XPathResult / XPathEvaluator / XPathExpression surface and routes CDP DOM.performSearch XPath queries through the new evaluator. Motivation: the capybara-lightpanda gem currently injects a ~700-line JS polyfill on every page load to make document.evaluate work; this PR moves that into the engine so the gem can drop the polyfill in its next release.

Full behavior spec + the 91-case acceptance battery this port implements: capybara-lightpanda/XPATH_COMPLIANCE.md.

Note

Downstream coordination: capybara-lightpanda will drop its ~700-line JS polyfill in the same release that bumps MINIMUM_NIGHTLY_BUILD past the merge build of this PR. No action needed from this repo's reviewers — flagged for visibility.

Scope

  • New src/browser/xpath/ module: Tokenizer, Parser, Ast, Evaluator, Functions, Result (~3,000 LOC).
  • New src/browser/webapi/{XPathResult,XPathExpression,XPathEvaluator}.zig WHATWG types (~470 LOC).
  • Document.evaluate, Document.createExpression, Document.createNSResolver wired via the JS bridge.
  • DOM.performSearch heuristic-based XPath/CSS branching — gives Playwright/Puppeteer/Capybara XPath-via-CDP for free.
  • Behavior fixtures: 91-case conformance battery, result-type surface, evaluator-API surface, CDP perform-search.

Document.evaluate flow

sequenceDiagram
    participant JS
    participant Document
    participant XPathResult
    participant Parser
    participant Evaluator

    JS->>Document: evaluate(expr, ctx, resolver, type, result)
    Document->>XPathResult: fromExpression(expr, ctx, type, frame)
    XPathResult->>Parser: parse(arena, expr)
    Parser-->>XPathResult: *Ast.Expr
    XPathResult->>Evaluator: evaluate(arena, frame, ast, ctx)
    Evaluator-->>XPathResult: Result.Result
    XPathResult-->>JS: XPathResult { resultType, ... }
Loading

DOM.performSearch query-type detection

flowchart LR
    A[CDP query] --> B{isXPathQuery?}
    B -->|path operator or :: axis| C[xpath.searchAll]
    B -->|otherwise| D[Selector.querySelectorAll]
    C --> E[finishSearch]
    D --> E
    E --> F[CDP result: searchId, resultCount]
Loading

The full heuristic: query starts with /, .//, (/, (./, or contains :: substring (axis specifier) → XPath; otherwise CSS.

Coverage

Full table in XPATH_COMPLIANCE.md.

Area Implemented Stubs
Path expressions 11/11
Axes 11/12 namespace:: (returns [])
Node tests 4 type tests + name + wildcard processing-instruction('name') consumes target literal but matches any PI
Operators union, or/and, =/!=, </<=/>/>=, +/-, */div/mod, unary -
Node-set fns last, position, count, id, local-name, name namespace-uri() always ""
String fns string, concat, starts-with, contains, substring-before, substring-after, substring, string-length, normalize-space, translate
Boolean fns boolean, not, true, false lang() always false
Number fns number, sum, floor, ceiling, round
Variable refs parsed $name always ""
Intentional stubs (HTML pragmatism — polyfill parity)

These match the prior capybara-lightpanda polyfill, which is the original motivation for this PR (HTML pragmatism over strict XPath 1.0):

  • lang(string) → always false
  • namespace:: axis → always []
  • namespace-uri() → always ""
  • processing-instruction('target') → target literal consumed but matches any PI
  • name() / local-name() → lowercased (HTML-friendly, not source-case as the spec demands)
  • Variable references $name → always ""
  • Tokenizer skips unknown chars silently
  • DOM.performSearch top-level scalar XPath → empty (matches the polyfill's xpathFind semantics — the API is for finding nodes)

Test plan

  • make test — 601/601 pass
  • 30 Tokenizer + 40 Parser (parametric over the 91-case battery) + 11 Evaluator + 15 Functions + 9 Result Zig unit tests
  • 4 HTML behavior fixtures: document_evaluate.html, xpath_conformance.html (91 cases), xpath_result.html, xpath_evaluator.html
  • CDP fixture perform_search_xpath.html + Zig test driving 5 query shapes
  • Custom test runner detects leaks under debug build; clean run confirms no leaks

Copy link
Copy Markdown
Collaborator

@karlseguin karlseguin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My biggest concern is about the performance of common queries, especially with respect to memory usage. I'll be honest, I only have a high level understanding of xpath and of your implementation. In our CSS selector, we apply some optimizations, including mostly doing right-to-left processing, to minimize the in-memory size. Claude informs me that probably doesn't work so well with xpath since they simply allow for a much broader set of instructions.

I also don't want to make the PR or the code too complicated. So tell me what you think of these possible optimization. I don't think any of these are mutually exclusive, but possibly some of them don't make sense or are impractical.

1 - Optimizing for common shapes that have easy optimizations. e.g. //foo, //foo[@id='x'] and //*[@id='x']

2 - right-to-left processing where where predicates can be evaluated independently per node, without knowing their position, e.g. //foo[bar]

3 - Introducing a re-usable arena so that scratch allocations made within each step, or ideally in the per-ctx inner loopof step, doesn't go into the main arena. The arena can be reset between each iteration.

4 - Using an iterator for the axis (rather than collecting them in memory).
- Allows for further optimizations (early exist)
- last() adds complexity to this, but it can be detected at parse time, and that could trigger materialization while still exposing an a consistent iterator API to everything.

5 - Whether it's an iterator or we continue to collect the full axis, there are common cases where the dedupe list is already in document order. This can be detected (not by iterating the list, but based on what the xpath itself) and, when true, dedupe.keys() does not need to be sorted

Comment thread src/browser/webapi/Document.zig
Comment thread src/browser/webapi/XPathResult.zig Outdated
Comment thread src/browser/webapi/XPathExpression.zig Outdated
Comment thread src/browser/xpath/Evaluator.zig
Comment thread src/browser/xpath/Evaluator.zig Outdated

/// Public entry. Returns the AST's value; node-sets are sorted into
/// document order before return per XPath spec §3.3.
pub fn evaluate(arena: Allocator, frame: *Frame, expr: *const Ast.Expr, context_node: *Node) Error!Result.Result {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

frame is generally the last parameter. There's no strict reason for this, but since the js bridge enforces this for webapis, it's something we tend to follow throughout.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — moved frame to last on Evaluator.evaluate, Evaluator.searchAll, Functions.call, and Functions.idFn. Call sites updated in webapi/XPath{Result,Expression}.zig and cdp/domains/dom.zig.

Comment thread src/browser/xpath/result.zig
Comment thread src/browser/xpath/Ast.zig Outdated
Comment thread src/browser/xpath/Parser.zig
@navidemad
Copy link
Copy Markdown
Contributor Author

navidemad commented May 6, 2026

@karlseguin Thanks — spent some time on measurement and Point 1 (the cheapest, highest-value lever from your list). Recap below, then numbers.

Benchmark fixture

Added src/browser/tests/xpath/xpath_perf.html: deterministic 500-node DOM (mix of div/span/p × alpha/beta/gamma, decorrelated periods so //div[@class='alpha'] isn't a degenerate restatement of //div), 50 iterations per query after a 3-iter warmup, asserts result counts so a regression in correctness can't hide behind a timing line.

Run: TEST_VERBOSE=false TEST_FILTER="XPath perf" zig build test.

Point 1 — id-lookup fast path

tryIdLookupFastPath at the top of Evaluator.evalPath. Recognizes:

//tag[@id='x']    →  ds::node() / child::tag[@id='literal']
.//tag[@id='x']   →  self::node() / ds::node() / child::tag[@id='literal']

…and serves them via frame.getElementByIdFromNode (case-insensitive tag check, containment check for relative paths, accepts the literal on either side of =). Falls through to the general path on any deviation: extra step, extra predicate, non-eq, non-literal RHS, or unresolvable search root.

Inherits the same compromise webapi/selector/List.zig:optimizeSelector already ships for querySelector(All) — the id-map only stores the first element per ID in document order, so a page with duplicate IDs returns one match where a strict tree walk would find all. Capybara/Selenium hot paths assume unique IDs and the same compromise has been deployed in CSS for years.

Results — debug build, 500-node tree, 50 iters/query

label baseline µs after µs speedup
//*[@id='target'] 3231 22.6 143×
//span[@id='target'] (hit) 2931 22.4 131×
//div[@id='target'] (miss, tag mismatch) 2089 21.7 96×
//div 2964 1133 flat*
//span 2659 3354 flat*
//* 8986 6595 flat*
//*[@class='alpha'] 7210 5751 flat*
//div[@class='alpha'] 3128 3272 flat
(//div)[1] 2631 3518 flat*
(//div)[last()] 2331 2561 flat
//div[contains(@class,'alpha')] 4005 4390 flat
//div[starts-with(@id,'n')] 4637 4724 flat
count(//div) 2880 1962 flat*

* ±30% across runs (debug build, 50 iters). Fast path falls through immediately for these shapes — any movement is system noise. The 22µs floor on ID lookups is parser + bridge round-trip; at that point we're measuring V8 ↔ Zig overhead, not XPath work.

All 79 XPath tests still pass, including cases the fast path correctly rejects at the predicate-shape check (e.g. //*[@id='heading' and @class='primary']).

Next

Pushed Point 1 to this PR in 8733e33b. My read on the rest:

  • Point 4 (iterator axes with parse-time needs_size detection) — folds in Point 2's logic, biggest payoff for (//x)[1] / count(//x) / //foo[bar]. Worth it if the noise-level numbers above bother you in real Capybara runs. The biggest remaining cost is //*[@class='x'] (~7ms) and //* (~9ms), both of which touch the full tree.
  • Point 3 (per-step scratch arena) — defer until profilers point at peak alloc.
  • Point 5 (skip dedupe sort when already in doc order) — skip. Detection is fragile (single ctx + all forward axes + no overlapping subtrees) and the savings are bounded by sort cost, which is sub-ms in practice.

@navidemad
Copy link
Copy Markdown
Contributor Author

Update — pushed f61be05f extending the Point 1 fast path to non-positional descendant queries.

tryFusedDescendantFastPath accepts the same AST shape (//<test> / .//<test> lowering) but allows any boolean/node-set predicate that doesn't depend on outer position. Walks the search root's descendants once in document order, applies the node test + predicates inline. No per-step axis materialization, no dedup hash map (single-context forward walk preserves doc order).

Safety gate rejects predicates whose top-level expression is numeric (number literal, neg, arithmetic binop, numeric-returning fn-call) and any predicate containing position() or last() anywhere. Conservative — a nested sub-path's local positional predicate ([bar[1]]) is rejected even though [1] is scoped to bar's axis. Easy to relax later if it shows up in real usage.

tryIdLookupFastPath refactored to share matchDescendantPathShape with the new fast path; behavior unchanged.

Numbers (debug build, 500-node tree, 50 iters/query)

label baseline after Point 1 after fused speedup
//*[@id='target'] 3231 22.6 22.6 143×
//span[@id='target'] (hit) 2931 22.4 22.8 129×
//div[@id='target'] (tag miss) 2089 21.7 21.7 96×
//div 2964 1133 386
//span 2659 3354 384
//* 8986 6595 1018
//*[@class='alpha'] 7210 5751 1498
//div[@class='alpha'] 3128 3272 625
(//div)[1] 2631 3518 87 30×
(//div)[last()] 2331 2561 94 25×
//div[contains(@class,'alpha')] 4005 4390 704
//div[starts-with(@id,'n')] 4637 4724 794
count(//div) 2880 1962 90 32×

(//div)[1], (//div)[last()], and count(//div) pick up the speedup transitively — //div is now a fused-walk shape, so the inner expression is fast even though the outer filter/count is unchanged.

All 79 XPath tests still pass. Conformance cases that correctly fall through to the general path because their predicates are positional or reference position()/last(): //li[1], //li[last()], //li[position() > 2], //li[position() mod 2 = 1], (//section)[2].

@navidemad
Copy link
Copy Markdown
Contributor Author

navidemad commented May 6, 2026

Two commits in, the fast paths land 5–143× wins across the benchmark — see the previous comment for the full table. The fused-walk extension absorbed most of what I'd originally pegged for Point 4, since (//x)[1] / count(//x) benefit transitively from the inner path being fast. Points 2, 4, and 5 from your list are effectively done or skipped.

One remaining lever I'd consider: Point 3 (per-predicate-iteration scratch arena). Worth more than I implied earlier — the XPath result arena (XPathResult._arena) lives until JS releases the result via XPathResult.deinit, not just for the duration of the call, so peak per-call allocation persists across calls if a script holds results. Point 3 would target that. Structurally more invasive than the surgical shape-detection though: need to thread a separate scratch arena through evalStep / predicate evaluation.

@karlseguin How do you want to proceed?

  • Land as-is, file Point 3 as a separate follow-up.
  • Push deeper on this branch — Point 3 before merge, bigger diff but more complete perf coverage.
  • Re-run the benchmark in release mode first to see if the gains are still material outside debug builds.

Lightly leaning toward the first option — your original framing was "common shapes that have easy optimizations" and we're past that. But happy to keep going if you'd rather not bifurcate the work.

@karlseguin
Copy link
Copy Markdown
Collaborator

We can merge this as-is. I do think a follow up to leverage a scratch arena is worth it, those temp allocations can be quite large (entire DOM tree) and they're going to live until v8 frees them - which might be only on page end. While the new fast paths should help, I imagine some scripts do a lot of these xpath queries to fetch various fields and it could add up.

@krichprollsch still want to review it more?

Comment thread src/cdp/domains/dom.zig
try std.testing.expect(isXPathQuery("(./bar)[2]"));
try std.testing.expect(isXPathQuery("descendant::p"));
try std.testing.expect(isXPathQuery("ancestor-or-self::*"));
try std.testing.expect(isXPathQuery("//*[@id='x']"));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you add cases like a::before it'll fail. You should use a StaticStringMap and match against the specific axis names.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — a::before did indeed pass (a matches [a-zA-Z-]). Fixed in d4de5e6: walks back the identifier run before :: and looks it up in a StaticStringMap of the 13 XPath 1.0 axis names. Test now also covers a::before, div::after, p::first-line, input::placeholder, and [data-x="x::y"].

navidemad and others added 11 commits May 8, 2026 08:44
Ports the capybara-lightpanda XPath 1.0 polyfill into Lightpanda.
Exposes the WHATWG Document.evaluate / XPathResult / XPathEvaluator
/ XPathExpression surface and routes CDP DOM.performSearch XPath
queries through the new evaluator. The capybara-lightpanda gem can
drop its ~700-line JS polyfill in the next release.

New module src/browser/xpath/ (Tokenizer, Parser, Ast, Evaluator,
Functions, Result). New webapi types XPathResult,
XPathExpression, XPathEvaluator. Coverage and stubs match the
polyfill 1:1 — see capybara-lightpanda/XPATH_COMPLIANCE.md for
the full spec.

Tests: 91-case conformance + result-API + evaluator-API + CDP
fixtures, plus the engine's Zig unit suite (601/601 pass).
The Parser borrows string slices from its input for AST literals,
names, and var refs. Without duping, the AST holds slices into the JS
call_arena, which is reset when the top-level call returns — every
subsequent evaluate() of a cached XPathExpression would dereference
freed memory.
A bare indexOf("::") matched CSS pseudo-elements (a::before) and
attribute values containing '::' ([data-x="x::y"]), misrouting them
to the XPath evaluator. Require an axis-name shape ([a-zA-Z-])
immediately before '::' so only real axis specifiers like
descendant::p are dispatched to XPath.
The attribute axis was calling Entry.toAttribute on every visit,
materializing fresh *Attribute structs (plus duped name/value strings)
into page-lifetime storage. Repeated XPath queries — the Capybara/
Selenium polling pattern this PR targets — accumulated unbounded
copies for the same DOM entries. Route through frame._attribute_lookup
so each Entry resolves to a single cached *Attribute, matching
List.getAttribute and NamedNodeMap.getAtIndex.
Per XPath 1.0 §5.7, the data model has no CDATASection node — CDATA
content is part of the text node value. The text() node test was only
matching DOM nodeType 3 (Text), silently excluding CDATA sections
(nodeType 4) parsed via DOMParser/XMLDocument and inline foreign
content like SVG with embedded scripts.
- Rename Result.zig / Ast.zig / Functions.zig to snake_case (no
  top-level fields per Zig style guide)
- Restructure imports across xpath module: lib (std/lp) → relative
  (further → nearer) → aliases
- Move `frame` to last parameter on Evaluator.evaluate, searchAll,
  Functions.call, idFn (matches js bridge convention); call sites
  updated in webapi/XPath{Result,Expression}.zig and cdp/domains/dom.zig
- Local-pos style in XPathResult.iterateNext
- Document.evaluate / XPathEvaluator.evaluate / XPathExpression.evaluate:
  result_type / requested_type now optional u16 defaulting to ANY_TYPE
  (matches WHATWG: `optional unsigned short type = 0`). context_node
  stays nullable with a fallback to the document — preserves the
  polyfill's behavior asserted by the `default_context` fixture
- ast.zig NodeTest: clarify that namespaced names (`prefix:*`,
  `prefix:local`) are stored verbatim and fall through to a literal
  match against the node name — consistent with the `namespace::` axis
  stub (decision lightpanda-io#3). Adds a TODO for if the polyfill ever drops the
  stub
- Parser: cap recursive descent at depth 64 with new
  error.MaxDepthExceeded; depth tracked across parseExpr (parens,
  predicates, function args) and parseUnaryExpr (chained `-`). Two
  regression tests cover deep parenthesization and deep unary minus
evalPath recognizes //tag[@id='x'] and .//tag[@id='x'] (plus the
//*[@id='x'] wildcard) and serves them via frame.getElementByIdFromNode.
~100-150x speedup on ID lookups (3231us -> 22.6us for //*[@id='target']
in the new benchmark). Falls through to general path on any deviation
(extra step, extra predicate, non-eq, non-literal RHS).

Inherits the same duplicate-ID compromise selector/List.zig ships for
querySelector(All): the id-map stores only the first element per ID in
document order. Capybara/Selenium hot paths assume unique IDs.

tests/xpath/xpath_perf.html is the 13-query micro-benchmark used to
collect the numbers; batched console.warn output survives test runner
interleaving.
Generalizes 8733e33's //tag[@id='x'] shape: tryFusedDescendantFastPath
handles any //tag[safe] or .//tag[safe] where the predicates are
non-positional boolean/node-set checks. Walks the search root's
descendants once in document order, applies node test + predicates
inline, no per-step materialization, no dedup.

5-9x on //div, //*, //*[@Class='x'], //div[contains(...)]; ~25x on
(//div)[1] and count(//div) where the inner path is the shape.

Safety gate rejects predicates that could produce a number at the
top level (number, neg, arithmetic binop, numeric-returning fn-call)
and any predicate containing position()/last() anywhere. Conservative:
a nested sub-path's local positional predicate is rejected even though
it's scoped to its own axis.
Disable xpath_perf benchmark from test run as its quite verbose.
The previous `::` heuristic accepted any identifier-like character before
`::`, which misrouted CSS pseudo-elements (`a::before`, `div::after`) to
the XPath evaluator. Walk back the run of [a-zA-Z-] characters and look
the candidate up in a StaticStringMap of the 13 XPath 1.0 named axes,
so only real axis names match.
@navidemad navidemad force-pushed the feat/xpath-1.0-evaluator branch from d4de5e6 to 0b0a34c Compare May 8, 2026 06:46
Strip mentions of the private gem and its internal paths from xpath
module docstrings, the conformance test header, and the dom dispatch
heuristic. Comments now describe behavior directly without pointing at
sources public readers can't access.
@krichprollsch krichprollsch merged commit d2151b6 into lightpanda-io:main May 11, 2026
35 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators May 11, 2026
@krichprollsch
Copy link
Copy Markdown
Member

Thanks @navidemad for this hard work, it's a nice and useful addition to the browser 🙏
Thanks @karlseguin for the review 🙏

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants