Bound CDP macrotask drains so commands aren't queued behind page work by navidemad · Pull Request #2405 · lightpanda-io/browser

navidemad · 2026-05-09T13:30:44Z

What this fixes

On pages with sustained JS activity (Angular SPAs in change-detection / requestAnimationFrame chains), every session-scoped CDP command — Runtime.evaluate, DOM.getDocument, DOM.getOuterHTML — stalls for 14–20 seconds in serve mode. fetch --dump html of the same page works fine because --wait-ms caps it; CDP has no equivalent ceiling.

The cleanest signal in the issue's reproducer is that Runtime.evaluate('1+1') — a zero-cost roundtrip — takes 14.7 seconds. Whatever the command is, it isn't slow because of what it does; it's slow because it can't be read off the WebSocket. See #2402 for the full trace.

Root cause

Runner._tick (CDP mode, .html / .complete branch) calls browser.runMacrotasks() before yielding to socket I/O via http_client.tick(...). Browser.runMacrotasks drains three loops back-to-back, none of which yield:

env.runMacrotasks() → Scheduler.runQueue's while (queue.peek()) — every ready user-scheduled task (timers, setTimeout, requestAnimationFrame, …)
env.pumpMessageLoop() → while (v8__Platform__PumpMessageLoop(...)) {} — every V8 platform task
env.runMicrotasks() — every queued microtask

On the OPSWAT page this drain runs for 14–20 s before returning. CDP commands sent during that window sit in the kernel WebSocket buffer the whole time. Once the drain finishes, the very next http_client.tick(0) reads them and they execute in ~0 ms — confirming the bottleneck is the drain, not the command.

What this PR changes

Threads an optional monotonic-clock deadline through the drain chain. Inner loops check it between tasks and yield to the caller when it has elapsed, leaving still-ready tasks queued for the next pass.

The hot edit is in Runner.zig:

// In CDP mode, bound the drain so a long Angular-style macrotask
// chain doesn't block us from polling the WebSocket below. In
// non-CDP mode (fetch), let the drain run to completion — there
// is no socket to service and `wait_ms` already caps wall time.
const macrotask_deadline: ?u64 = if (comptime is_cdp)
    milliTimestamp(.monotonic) + CDP_MACROTASK_BUDGET_MS
else
    null;

try browser.runMacrotasks(macrotask_deadline);

The deadline is plumbed through Browser.runMacrotasks → Env.runMacrotasks / Env.pumpMessageLoop → Scheduler.run / Scheduler.runQueue as deadline_ms: ?u64. All five gain the parameter. null preserves the existing unbounded behavior — that's what every non-CDP caller passes (worker Local.runMacrotasks, Context.deinit, the direct scheduler.run in ScriptManagerBase, and Runner._tick in fetch mode).

Why 50 ms

It's a trade-off between page progress and CDP responsiveness. The deadline is checked between tasks, so:

Smaller (e.g. 5 ms) → finer-grained yielding, more time spent re-entering the tick loop than running JS.
Larger (e.g. 500 ms) → coarser yielding; on a page where individual callbacks are short, you'd see CDP commands wait that long behind the next batch.
50 ms → reasonable middle ground. On a page with short callbacks, CDP commands are picked up within ~50 ms. On a page with long single callbacks (like OPSWAT, see verification below), the per-command floor is set by the longest individual callback, not the budget — sub-second response on those pages needs ask make: fix help w/ linux #3.

Hard-coded as CDP_MACROTASK_BUDGET_MS in Runner.zig. Could be exposed as a serve flag later — the issue reporter is comfortable with anything well under 1 s, so a follow-up is fine.

Caveats

A single long synchronous callback still blocks for its full duration. The deadline is checked between tasks, not inside one. On the OPSWAT page in CDP commands stall for 15-20s on pages with sustained JS activity, even when fetch --dump returns the page in 5s #2402, individual change-detection callbacks run ~2.75 s, so the per-command latency floor lands there even with the budget. Driving sub-second response on those pages needs V8 RequestInterrupt (ask make: fix help w/ linux #3 in the issue), which is the right long-term answer.
Microtasks (env.runMicrotasks) are not bounded. Bounding them requires RequestInterrupt. They're typically cheap.
This is ask Implement HTMLDocument.createElement #1 only. Asks Generate #2 (CDP Lightpanda.stopJs / --terminate-ms equivalent), make: fix help w/ linux #3 (V8 RequestInterrupt), Install: add build and test instructions #4 (msToNextMacrotask exposure) are separate features and not addressed here.

Where to focus review

The new parameter on Browser.runMacrotasks / Env.runMacrotasks / Scheduler.run / Scheduler.runQueue is mechanical — each had exactly one caller before this PR.

Env.pumpMessageLoop had three callers, and the two non-Browser ones run in different lifecycles. They both pass null (i.e. no behavior change), but I'd appreciate a second read on:

src/browser/js/Local.zig:111 — worker context, called from ScriptManagerBase, WorkerGlobalScope, Worker
src/browser/js/Context.zig:229 — context deinit, drains residual platform tasks before MicrotaskQueue deletion

Both should be unchanged. But worker / shutdown paths are exactly the kind of edge case where it would be easy to miss something.

Test plan

make test — 523/523 pass, including two new Scheduler.run unit tests
Scheduler.run(null) drains all 50 queued tasks
Scheduler.run(1) (deadline already in the past) runs exactly 1 task, leaves the rest queued; a follow-up Scheduler.run(null) drains them
Manual verification against the OPSWAT reproducer in CDP commands stall for 15-20s on pages with sustained JS activity, even when fetch --dump returns the page in 5s #2402 — every session-scoped command moves from "TIMEOUT 20 s" (current main) to ~2.75 s (this PR). See the verification comment below for the side-by-side probe output.

Refs #2402

Pages with sustained JS activity (Angular RAF chains, change detection on heavy SPAs) hold the V8 thread inside `Browser.runMacrotasks` for many seconds at a time. Because `Runner._tick` only polls the WebSocket *after* the drain returns, queued CDP commands sit in the kernel buffer for the duration — every session-scoped command stalled 14–20s on the reproducer in lightpanda-io#2402, even `Runtime.evaluate('1+1')`. Thread an optional monotonic-clock deadline through `Browser.runMacrotasks` → `Env.runMacrotasks` / `Env.pumpMessageLoop` → `Scheduler.run` / `Scheduler.runQueue`. Inner loops check the deadline after each task and yield back to the caller when it elapses, leaving still-ready tasks in the queue. `Runner._tick` sets a 50ms deadline in CDP mode and `null` (unbounded) in fetch mode, preserving existing fetch behavior. Other `pumpMessageLoop` callers (worker context, Context.deinit) and the single direct `scheduler.run` call in ScriptManagerBase pass `null`. Adds two unit tests on `Scheduler.run` covering the no-deadline drain and the elapsed-deadline yield behavior. Refs lightpanda-io#2402

navidemad · 2026-05-09T13:54:47Z

Verified against the issue's cdp-probe.mjs reproducer on the OPSWAT Angular SPA (https://www.opswat.com/docs/mdmft/metadefender-mft), macOS / Darwin 25.4.0:

Command	Unpatched (`6e9156a8`)	Patched (`2bdc0ae1`)
`Runtime.evaluate document.documentElement.outerHTML`	TIMEOUT 20 s	OK in 3352 ms
`Runtime.evaluate document.documentElement` (returnByValue:false)	TIMEOUT 20 s	OK in 2776 ms
`DOM.getDocument {depth:0}`	TIMEOUT 20 s	OK in 2773 ms
`DOM.getDocument {}` (default depth=3)	TIMEOUT 20 s	OK in 2791 ms
`Runtime.evaluate('1+1')` (sanity)	TIMEOUT 20 s	OK in 2750 ms
`DOM.getDocument retry {depth:0}`	TIMEOUT 20 s	OK in 2744 ms
`Target.closeTarget`	TIMEOUT 5 s	OK in 2744 ms

Every session-scoped command moves from "never returns" to ~2.75 s. The per-command latency floor matches the duration of a single Angular change-detection callback on this page — exactly the caveat about long single tasks I called out in the body. Sub-second response on pages like this needs V8 RequestInterrupt (issue #2402 ask #3, separate change).

Full probe output

Unpatched:

[+0ms]      connected
[+17ms]     Target.createTarget: OK
[+19ms]     Target.attachToTarget: OK
[+20ms]     Network.enable: OK
[+20ms]     Page.enable: OK
[+20ms]     Emulation.setUserAgentOverride: OK
[+282ms]    Page.navigate: OK
[+282ms]    sleeping 8s post-navigate
[+28284ms]  Runtime.evaluate document.documentElement.outerHTML (returnByValue:true): FAIL in 20000ms — TIMEOUT
[+48286ms]  Runtime.evaluate document.documentElement (returnByValue:false):          FAIL in 20002ms — TIMEOUT
[+68286ms]  DOM.getDocument {depth:0}:                                                FAIL in 20000ms — TIMEOUT
[+88287ms]  DOM.getDocument {} (default depth=3):                                     FAIL in 20001ms — TIMEOUT
[+108288ms] Runtime.evaluate 1+1 (sanity):                                            FAIL in 20000ms — TIMEOUT
[+128289ms] DOM.getDocument retry {depth:0}:                                          FAIL in 20001ms — TIMEOUT
[+133290ms] Target.closeTarget:                                                       FAIL in 5001ms — TIMEOUT

Patched:

[+0ms]      connected
[+2ms]      Target.createTarget: OK
[+3ms]      Target.attachToTarget: OK
[+3ms]      Network.enable: OK
[+3ms]      Page.enable: OK
[+4ms]      Emulation.setUserAgentOverride: OK
[+106ms]    Page.navigate: OK
[+106ms]    sleeping 8s post-navigate
[+11459ms]  Runtime.evaluate document.documentElement.outerHTML (returnByValue:true): OK in 3352ms
[+14236ms]  Runtime.evaluate document.documentElement (returnByValue:false):          OK in 2776ms
[+17009ms]  DOM.getDocument {depth:0}:                                                OK in 2773ms
[+19800ms]  DOM.getDocument {} (default depth=3):                                     OK in 2791ms
[+22550ms]  Runtime.evaluate 1+1 (sanity):                                            OK in 2750ms
[+25294ms]  DOM.getDocument retry {depth:0}:                                          OK in 2744ms
[+28038ms]  Target.closeTarget:                                                       OK in 2744ms

karlseguin · 2026-05-10T11:32:01Z

I think #2393 does the job, while being much simpler.

navidemad · 2026-05-11T12:52:25Z

Closing as redundant — PR #2393 (Add timeslice to scheduler, merged 2026-05-06, in nightly 6105) already resolved the underlying stall. The reporter of #2402 confirmed against 6105 that Runtime.evaluate 1+1 returns in ~627 ms vs. 14.7 s pre-fix, and #2402 was closed as fixed. This PR's macrotask-drain cap improved things to ~2.75 s/command, but the scheduler timeslice is a more fundamental fix and supersedes this approach. No need to land both.

navidemad marked this pull request as ready for review May 9, 2026 13:39

This was referenced May 9, 2026

CDP commands stall for 15-20s on pages with sustained JS activity, even when fetch --dump returns the page in 5s #2402

Closed

Segmentation Fault #2373

Closed

navidemad closed this May 11, 2026

github-actions Bot locked and limited conversation to collaborators May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bound CDP macrotask drains so commands aren't queued behind page work#2405

Bound CDP macrotask drains so commands aren't queued behind page work#2405
navidemad wants to merge 1 commit into
lightpanda-io:mainfrom
navidemad:worktree-fix-2402-cdp-macrotask-budget

navidemad commented May 9, 2026 •

edited

Loading

Uh oh!

navidemad commented May 9, 2026

Uh oh!

karlseguin commented May 10, 2026

Uh oh!

navidemad commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

navidemad commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this fixes

Root cause

What this PR changes

Why 50 ms

Caveats

Where to focus review

Test plan

Uh oh!

navidemad commented May 9, 2026

Uh oh!

karlseguin commented May 10, 2026

Uh oh!

navidemad commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

navidemad commented May 9, 2026 •

edited

Loading