Bound CDP macrotask drains so commands aren't queued behind page work#2405
Bound CDP macrotask drains so commands aren't queued behind page work#2405navidemad wants to merge 1 commit into
Conversation
Pages with sustained JS activity (Angular RAF chains, change detection on heavy SPAs) hold the V8 thread inside `Browser.runMacrotasks` for many seconds at a time. Because `Runner._tick` only polls the WebSocket *after* the drain returns, queued CDP commands sit in the kernel buffer for the duration — every session-scoped command stalled 14–20s on the reproducer in lightpanda-io#2402, even `Runtime.evaluate('1+1')`. Thread an optional monotonic-clock deadline through `Browser.runMacrotasks` → `Env.runMacrotasks` / `Env.pumpMessageLoop` → `Scheduler.run` / `Scheduler.runQueue`. Inner loops check the deadline after each task and yield back to the caller when it elapses, leaving still-ready tasks in the queue. `Runner._tick` sets a 50ms deadline in CDP mode and `null` (unbounded) in fetch mode, preserving existing fetch behavior. Other `pumpMessageLoop` callers (worker context, Context.deinit) and the single direct `scheduler.run` call in ScriptManagerBase pass `null`. Adds two unit tests on `Scheduler.run` covering the no-deadline drain and the elapsed-deadline yield behavior. Refs lightpanda-io#2402
|
Verified against the issue's
Every session-scoped command moves from "never returns" to ~2.75 s. The per-command latency floor matches the duration of a single Angular change-detection callback on this page — exactly the caveat about long single tasks I called out in the body. Sub-second response on pages like this needs V8 Full probe outputUnpatched: Patched: |
|
I think #2393 does the job, while being much simpler. |
|
Closing as redundant — PR #2393 ( |
What this fixes
On pages with sustained JS activity (Angular SPAs in change-detection /
requestAnimationFramechains), every session-scoped CDP command —Runtime.evaluate,DOM.getDocument,DOM.getOuterHTML— stalls for 14–20 seconds inservemode.fetch --dump htmlof the same page works fine because--wait-mscaps it; CDP has no equivalent ceiling.The cleanest signal in the issue's reproducer is that
Runtime.evaluate('1+1')— a zero-cost roundtrip — takes 14.7 seconds. Whatever the command is, it isn't slow because of what it does; it's slow because it can't be read off the WebSocket. See #2402 for the full trace.Root cause
Runner._tick(CDP mode,.html/.completebranch) callsbrowser.runMacrotasks()before yielding to socket I/O viahttp_client.tick(...).Browser.runMacrotasksdrains three loops back-to-back, none of which yield:env.runMacrotasks()→Scheduler.runQueue'swhile (queue.peek())— every ready user-scheduled task (timers,setTimeout,requestAnimationFrame, …)env.pumpMessageLoop()→while (v8__Platform__PumpMessageLoop(...)) {}— every V8 platform taskenv.runMicrotasks()— every queued microtaskOn the OPSWAT page this drain runs for 14–20 s before returning. CDP commands sent during that window sit in the kernel WebSocket buffer the whole time. Once the drain finishes, the very next
http_client.tick(0)reads them and they execute in ~0 ms — confirming the bottleneck is the drain, not the command.What this PR changes
Threads an optional monotonic-clock deadline through the drain chain. Inner loops check it between tasks and yield to the caller when it has elapsed, leaving still-ready tasks queued for the next pass.
The hot edit is in
Runner.zig:The deadline is plumbed through
Browser.runMacrotasks→Env.runMacrotasks/Env.pumpMessageLoop→Scheduler.run/Scheduler.runQueueasdeadline_ms: ?u64. All five gain the parameter.nullpreserves the existing unbounded behavior — that's what every non-CDP caller passes (workerLocal.runMacrotasks,Context.deinit, the directscheduler.runinScriptManagerBase, andRunner._tickin fetch mode).Why 50 ms
It's a trade-off between page progress and CDP responsiveness. The deadline is checked between tasks, so:
Hard-coded as
CDP_MACROTASK_BUDGET_MSinRunner.zig. Could be exposed as aserveflag later — the issue reporter is comfortable with anything well under 1 s, so a follow-up is fine.Caveats
fetch --dumpreturns the page in 5s #2402, individual change-detection callbacks run ~2.75 s, so the per-command latency floor lands there even with the budget. Driving sub-second response on those pages needs V8RequestInterrupt(ask make: fix help w/ linux #3 in the issue), which is the right long-term answer.env.runMicrotasks) are not bounded. Bounding them requiresRequestInterrupt. They're typically cheap.Lightpanda.stopJs/--terminate-msequivalent), make: fix help w/ linux #3 (V8RequestInterrupt), Install: add build and test instructions #4 (msToNextMacrotaskexposure) are separate features and not addressed here.Where to focus review
The new parameter on
Browser.runMacrotasks/Env.runMacrotasks/Scheduler.run/Scheduler.runQueueis mechanical — each had exactly one caller before this PR.Env.pumpMessageLoophad three callers, and the two non-Browserones run in different lifecycles. They both passnull(i.e. no behavior change), but I'd appreciate a second read on:src/browser/js/Local.zig:111— worker context, called fromScriptManagerBase,WorkerGlobalScope,Workersrc/browser/js/Context.zig:229— context deinit, drains residual platform tasks beforeMicrotaskQueuedeletionBoth should be unchanged. But worker / shutdown paths are exactly the kind of edge case where it would be easy to miss something.
Test plan
make test— 523/523 pass, including two newScheduler.rununit testsScheduler.run(null)drains all 50 queued tasksScheduler.run(1)(deadline already in the past) runs exactly 1 task, leaves the rest queued; a follow-upScheduler.run(null)drains themfetch --dumpreturns the page in 5s #2402 — every session-scoped command moves from "TIMEOUT 20 s" (currentmain) to ~2.75 s (this PR). See the verification comment below for the side-by-side probe output.Refs #2402