xpath: implement XPath 1.0 (Document.evaluate, XPathResult, DOM.performSearch)#2305
Conversation
karlseguin
left a comment
There was a problem hiding this comment.
My biggest concern is about the performance of common queries, especially with respect to memory usage. I'll be honest, I only have a high level understanding of xpath and of your implementation. In our CSS selector, we apply some optimizations, including mostly doing right-to-left processing, to minimize the in-memory size. Claude informs me that probably doesn't work so well with xpath since they simply allow for a much broader set of instructions.
I also don't want to make the PR or the code too complicated. So tell me what you think of these possible optimization. I don't think any of these are mutually exclusive, but possibly some of them don't make sense or are impractical.
1 - Optimizing for common shapes that have easy optimizations. e.g. //foo, //foo[@id='x'] and //*[@id='x']
2 - right-to-left processing where where predicates can be evaluated independently per node, without knowing their position, e.g. //foo[bar]
3 - Introducing a re-usable arena so that scratch allocations made within each step, or ideally in the per-ctx inner loopof step, doesn't go into the main arena. The arena can be reset between each iteration.
4 - Using an iterator for the axis (rather than collecting them in memory).
- Allows for further optimizations (early exist)
- last() adds complexity to this, but it can be detected at parse time, and that could trigger materialization while still exposing an a consistent iterator API to everything.
5 - Whether it's an iterator or we continue to collect the full axis, there are common cases where the dedupe list is already in document order. This can be detected (not by iterating the list, but based on what the xpath itself) and, when true, dedupe.keys() does not need to be sorted
|
|
||
| /// Public entry. Returns the AST's value; node-sets are sorted into | ||
| /// document order before return per XPath spec §3.3. | ||
| pub fn evaluate(arena: Allocator, frame: *Frame, expr: *const Ast.Expr, context_node: *Node) Error!Result.Result { |
There was a problem hiding this comment.
frame is generally the last parameter. There's no strict reason for this, but since the js bridge enforces this for webapis, it's something we tend to follow throughout.
There was a problem hiding this comment.
Done — moved frame to last on Evaluator.evaluate, Evaluator.searchAll, Functions.call, and Functions.idFn. Call sites updated in webapi/XPath{Result,Expression}.zig and cdp/domains/dom.zig.
|
@karlseguin Thanks — spent some time on measurement and Point 1 (the cheapest, highest-value lever from your list). Recap below, then numbers. Benchmark fixtureAdded Run: Point 1 — id-lookup fast path
…and serves them via Inherits the same compromise Results — debug build, 500-node tree, 50 iters/query
* ±30% across runs (debug build, 50 iters). Fast path falls through immediately for these shapes — any movement is system noise. The 22µs floor on ID lookups is parser + bridge round-trip; at that point we're measuring V8 ↔ Zig overhead, not XPath work. All 79 XPath tests still pass, including cases the fast path correctly rejects at the predicate-shape check (e.g. NextPushed Point 1 to this PR in
|
|
Update — pushed
Safety gate rejects predicates whose top-level expression is numeric (number literal, neg, arithmetic binop, numeric-returning fn-call) and any predicate containing
Numbers (debug build, 500-node tree, 50 iters/query)
All 79 XPath tests still pass. Conformance cases that correctly fall through to the general path because their predicates are positional or reference |
|
Two commits in, the fast paths land 5–143× wins across the benchmark — see the previous comment for the full table. The fused-walk extension absorbed most of what I'd originally pegged for Point 4, since One remaining lever I'd consider: Point 3 (per-predicate-iteration scratch arena). Worth more than I implied earlier — the XPath result arena ( @karlseguin How do you want to proceed?
Lightly leaning toward the first option — your original framing was "common shapes that have easy optimizations" and we're past that. But happy to keep going if you'd rather not bifurcate the work. |
|
We can merge this as-is. I do think a follow up to leverage a scratch arena is worth it, those temp allocations can be quite large (entire DOM tree) and they're going to live until v8 frees them - which might be only on page end. While the new fast paths should help, I imagine some scripts do a lot of these xpath queries to fetch various fields and it could add up. @krichprollsch still want to review it more? |
| try std.testing.expect(isXPathQuery("(./bar)[2]")); | ||
| try std.testing.expect(isXPathQuery("descendant::p")); | ||
| try std.testing.expect(isXPathQuery("ancestor-or-self::*")); | ||
| try std.testing.expect(isXPathQuery("//*[@id='x']")); |
There was a problem hiding this comment.
I think if you add cases like a::before it'll fail. You should use a StaticStringMap and match against the specific axis names.
There was a problem hiding this comment.
Good catch — a::before did indeed pass (a matches [a-zA-Z-]). Fixed in d4de5e6: walks back the identifier run before :: and looks it up in a StaticStringMap of the 13 XPath 1.0 axis names. Test now also covers a::before, div::after, p::first-line, input::placeholder, and [data-x="x::y"].
Ports the capybara-lightpanda XPath 1.0 polyfill into Lightpanda. Exposes the WHATWG Document.evaluate / XPathResult / XPathEvaluator / XPathExpression surface and routes CDP DOM.performSearch XPath queries through the new evaluator. The capybara-lightpanda gem can drop its ~700-line JS polyfill in the next release. New module src/browser/xpath/ (Tokenizer, Parser, Ast, Evaluator, Functions, Result). New webapi types XPathResult, XPathExpression, XPathEvaluator. Coverage and stubs match the polyfill 1:1 — see capybara-lightpanda/XPATH_COMPLIANCE.md for the full spec. Tests: 91-case conformance + result-API + evaluator-API + CDP fixtures, plus the engine's Zig unit suite (601/601 pass).
The Parser borrows string slices from its input for AST literals, names, and var refs. Without duping, the AST holds slices into the JS call_arena, which is reset when the top-level call returns — every subsequent evaluate() of a cached XPathExpression would dereference freed memory.
A bare indexOf("::") matched CSS pseudo-elements (a::before) and
attribute values containing '::' ([data-x="x::y"]), misrouting them
to the XPath evaluator. Require an axis-name shape ([a-zA-Z-])
immediately before '::' so only real axis specifiers like
descendant::p are dispatched to XPath.
The attribute axis was calling Entry.toAttribute on every visit, materializing fresh *Attribute structs (plus duped name/value strings) into page-lifetime storage. Repeated XPath queries — the Capybara/ Selenium polling pattern this PR targets — accumulated unbounded copies for the same DOM entries. Route through frame._attribute_lookup so each Entry resolves to a single cached *Attribute, matching List.getAttribute and NamedNodeMap.getAtIndex.
Per XPath 1.0 §5.7, the data model has no CDATASection node — CDATA content is part of the text node value. The text() node test was only matching DOM nodeType 3 (Text), silently excluding CDATA sections (nodeType 4) parsed via DOMParser/XMLDocument and inline foreign content like SVG with embedded scripts.
- Rename Result.zig / Ast.zig / Functions.zig to snake_case (no
top-level fields per Zig style guide)
- Restructure imports across xpath module: lib (std/lp) → relative
(further → nearer) → aliases
- Move `frame` to last parameter on Evaluator.evaluate, searchAll,
Functions.call, idFn (matches js bridge convention); call sites
updated in webapi/XPath{Result,Expression}.zig and cdp/domains/dom.zig
- Local-pos style in XPathResult.iterateNext
- Document.evaluate / XPathEvaluator.evaluate / XPathExpression.evaluate: result_type / requested_type now optional u16 defaulting to ANY_TYPE (matches WHATWG: `optional unsigned short type = 0`). context_node stays nullable with a fallback to the document — preserves the polyfill's behavior asserted by the `default_context` fixture - ast.zig NodeTest: clarify that namespaced names (`prefix:*`, `prefix:local`) are stored verbatim and fall through to a literal match against the node name — consistent with the `namespace::` axis stub (decision lightpanda-io#3). Adds a TODO for if the polyfill ever drops the stub - Parser: cap recursive descent at depth 64 with new error.MaxDepthExceeded; depth tracked across parseExpr (parens, predicates, function args) and parseUnaryExpr (chained `-`). Two regression tests cover deep parenthesization and deep unary minus
evalPath recognizes //tag[@id='x'] and .//tag[@id='x'] (plus the //*[@id='x'] wildcard) and serves them via frame.getElementByIdFromNode. ~100-150x speedup on ID lookups (3231us -> 22.6us for //*[@id='target'] in the new benchmark). Falls through to general path on any deviation (extra step, extra predicate, non-eq, non-literal RHS). Inherits the same duplicate-ID compromise selector/List.zig ships for querySelector(All): the id-map stores only the first element per ID in document order. Capybara/Selenium hot paths assume unique IDs. tests/xpath/xpath_perf.html is the 13-query micro-benchmark used to collect the numbers; batched console.warn output survives test runner interleaving.
Generalizes 8733e33's //tag[@id='x'] shape: tryFusedDescendantFastPath handles any //tag[safe] or .//tag[safe] where the predicates are non-positional boolean/node-set checks. Walks the search root's descendants once in document order, applies node test + predicates inline, no per-step materialization, no dedup. 5-9x on //div, //*, //*[@Class='x'], //div[contains(...)]; ~25x on (//div)[1] and count(//div) where the inner path is the shape. Safety gate rejects predicates that could produce a number at the top level (number, neg, arithmetic binop, numeric-returning fn-call) and any predicate containing position()/last() anywhere. Conservative: a nested sub-path's local positional predicate is rejected even though it's scoped to its own axis.
Disable xpath_perf benchmark from test run as its quite verbose.
The previous `::` heuristic accepted any identifier-like character before `::`, which misrouted CSS pseudo-elements (`a::before`, `div::after`) to the XPath evaluator. Walk back the run of [a-zA-Z-] characters and look the candidate up in a StaticStringMap of the 13 XPath 1.0 named axes, so only real axis names match.
d4de5e6 to
0b0a34c
Compare
Strip mentions of the private gem and its internal paths from xpath module docstrings, the conformance test header, and the dom dispatch heuristic. Comments now describe behavior directly without pointing at sources public readers can't access.
|
Thanks @navidemad for this hard work, it's a nice and useful addition to the browser 🙏 |
Summary
Ports the capybara-lightpanda XPath 1.0 polyfill into Lightpanda's Zig codebase. Exposes the WHATWG
Document.evaluate/XPathResult/XPathEvaluator/XPathExpressionsurface and routes CDPDOM.performSearchXPath queries through the new evaluator. Motivation: the capybara-lightpanda gem currently injects a ~700-line JS polyfill on every page load to makedocument.evaluatework; this PR moves that into the engine so the gem can drop the polyfill in its next release.Full behavior spec + the 91-case acceptance battery this port implements: capybara-lightpanda/XPATH_COMPLIANCE.md.
Note
Downstream coordination: capybara-lightpanda will drop its ~700-line JS polyfill in the same release that bumps
MINIMUM_NIGHTLY_BUILDpast the merge build of this PR. No action needed from this repo's reviewers — flagged for visibility.Scope
src/browser/xpath/module: Tokenizer, Parser, Ast, Evaluator, Functions, Result (~3,000 LOC).src/browser/webapi/{XPathResult,XPathExpression,XPathEvaluator}.zigWHATWG types (~470 LOC).Document.evaluate,Document.createExpression,Document.createNSResolverwired via the JS bridge.DOM.performSearchheuristic-based XPath/CSS branching — gives Playwright/Puppeteer/Capybara XPath-via-CDP for free.Document.evaluateflowsequenceDiagram participant JS participant Document participant XPathResult participant Parser participant Evaluator JS->>Document: evaluate(expr, ctx, resolver, type, result) Document->>XPathResult: fromExpression(expr, ctx, type, frame) XPathResult->>Parser: parse(arena, expr) Parser-->>XPathResult: *Ast.Expr XPathResult->>Evaluator: evaluate(arena, frame, ast, ctx) Evaluator-->>XPathResult: Result.Result XPathResult-->>JS: XPathResult { resultType, ... }DOM.performSearchquery-type detectionflowchart LR A[CDP query] --> B{isXPathQuery?} B -->|path operator or :: axis| C[xpath.searchAll] B -->|otherwise| D[Selector.querySelectorAll] C --> E[finishSearch] D --> E E --> F[CDP result: searchId, resultCount]The full heuristic: query starts with
/,.//,(/,(./, or contains::substring (axis specifier) → XPath; otherwise CSS.Coverage
Full table in XPATH_COMPLIANCE.md.
namespace::(returns[])processing-instruction('name')consumes target literal but matches any PIor/and,=/!=,</<=/>/>=,+/-,*/div/mod, unary-last,position,count,id,local-name,namenamespace-uri()always""string,concat,starts-with,contains,substring-before,substring-after,substring,string-length,normalize-space,translateboolean,not,true,falselang()alwaysfalsenumber,sum,floor,ceiling,round$namealways""Intentional stubs (HTML pragmatism — polyfill parity)
These match the prior capybara-lightpanda polyfill, which is the original motivation for this PR (HTML pragmatism over strict XPath 1.0):
lang(string)→ alwaysfalsenamespace::axis → always[]namespace-uri()→ always""processing-instruction('target')→ target literal consumed but matches any PIname()/local-name()→ lowercased (HTML-friendly, not source-case as the spec demands)$name→ always""DOM.performSearchtop-level scalar XPath → empty (matches the polyfill'sxpathFindsemantics — the API is for finding nodes)Test plan
make test— 601/601 passdocument_evaluate.html,xpath_conformance.html(91 cases),xpath_result.html,xpath_evaluator.htmlperform_search_xpath.html+ Zig test driving 5 query shapes