Skip to content

fix(xterm): patch IntersectionObserver retention (-367 MB/30 mode-toggles)#617

Merged
srid merged 1 commit intomasterfrom
fix/xterm-observer-leak
Apr 17, 2026
Merged

fix(xterm): patch IntersectionObserver retention (-367 MB/30 mode-toggles)#617
srid merged 1 commit intomasterfrom
fix/xterm-observer-leak

Conversation

@srid
Copy link
Copy Markdown
Member

@srid srid commented Apr 17, 2026

Summary

Fixes the canvas/focus-toggle memory leak that was pushing production pureintent to 1.2 GB Memory Footprint. The leak is upstream in xterm.js; this PR bumps the pnpm.overrides pointer to a fork branch that stacks one additional fix on top of fix/dispose-leaks-built.

Upstream tracking: xtermjs/xterm.js#5820 (issue), xtermjs/xterm.js#5821 (PR).

The leak

RenderService._registerIntersectionObserver creates an IntersectionObserver whose callback closes over this directly. Although _observerDisposable calls observer.disconnect() on dispose, the retained callback chain keeps this (RenderService) alive in practice — along with _coreService → _bufferService → buffers → BufferLine → Uint32Array cell data.

Heap-snapshot diff across 30 mount/unmount cycles of 7 Terminal instances:

Class Δ count Δ bytes
native:system / JSArrayBufferData +175,594 +220 MB
object:Uint32Array +175,594 +10 MB
object:ArrayBuffer +175,594 +9 MB
object:BufferLine +175,594 +5 MB

Every retained Uint32Array traced through the same retainer signature: global IntersectionObserver registry → callback closure → RenderService → service graph → BufferLines. 175,594 = 30 toggles × 7 terminals × ~830 scrollback lines — basically the full buffer of every disposed Terminal pinned past terminal.dispose().

The fix wraps this in a WeakRef. Functional semantics preserved: while RenderService is alive, deref() returns it and the handler runs unchanged. Once no strong refs remain, the callback is a no-op and the BufferService graph can GC.

Ground-truth measurement

Local kolu@zest (fresh tab, 30 canvas↔focus toggles, 7 terminals restored from session):

Metric Before After
Memory Footprint Δ / 30 toggles +367 MB −3 MB
JS live Δ +66 MB ~0
BufferLine instances retained +175,594 ~0

Also verified the fix/dispose-leaks fix (upstream #5817) still ships — the stacked fork branch contains both.

Alternatives considered

  • Chase why disconnect() isn't releasing the callback. Could be DevTools instrumentation, a Chrome extension patching window.IntersectionObserver, a native registry quirk — unclear, and likely environment-dependent. The WeakRef wrap is defensive and preserves semantics regardless of root cause.
  • Null out _coreService / _bufferService in RenderService.dispose(). Narrower scope but requires a dispose override. WeakRef is cleaner — no new dispose path, just a smaller capture surface.

Test plan

🤖 Diagnosis assisted by Claude Code

…fork

Bumps the pnpm override pointer from `fix/dispose-leaks-built` to
`fix/kolu-xterm-fixes-built`, which stacks a second fix on top of
the existing dispose-leaks patches:

- xtermjs/xterm.js#5817 (`fix/dispose-leaks`): register
  CursorBlinkStateManager + _pausedResizeTask disposables. Already
  in production.
- xtermjs/xterm.js#5821 (new, `fix/intersection-observer-weakref`):
  wrap `this` in a `WeakRef` inside `RenderService`'s IntersectionObserver
  callback. `observer.disconnect()` on our side wasn't releasing the
  callback in practice — heap snapshot showed 175,594 retained
  Uint32Arrays (~220 MB / 30 mode-toggles with 7 terminals) traced
  through the global IntersectionObserver registry → callback closure
  → RenderService → BufferService → BufferLines. WeakRef breaks that
  chain regardless of why the native registry held the callback.

Local measurement on kolu@zest (fresh tab, 30 canvas↔focus toggles,
7 terminals restored):

                                 Before fix  →  After fix
  Task Manager Memory Footprint    +367 MB   →   -3 MB
  BufferLine (Z1) instances Δ      +175,594  →   ~0

Upstream tracking: xtermjs/xterm.js#5820 (issue),
xtermjs/xterm.js#5821 (PR). Fork consumption will collapse to a plain
version bump once upstream merges + releases.
@srid srid force-pushed the fix/xterm-observer-leak branch from c9794db to 6800302 Compare April 17, 2026 23:22
@srid srid merged commit 1b18af1 into master Apr 17, 2026
3 checks passed
@srid srid deleted the fix/xterm-observer-leak branch April 17, 2026 23:29
srid added a commit that referenced this pull request Apr 17, 2026
…osis

Adds the IntersectionObserver / BufferLine retention story to
memory-learnings.md (#617, upstream xtermjs/xterm.js#5820 + #5821).

Skill updates:

- New "Ground truth: Task Manager Memory Footprint, not proxies" section.
  performance.memory, system/Context count, closure:* count are all
  proxies that can diverge from Task Manager by 100x. #614 reduced
  Context growth 89% across six commits with zero Memory Footprint
  improvement — the cautionary tale.

- New "Quiet-session A/B" requirement. Active agent terminals grow
  xterm scrollback legitimately; measurements on a busy session look
  indistinguishable from retention. #618 closed after a quiet-session
  A/B showed the +69 MB residual was all agent-stream activity.

- New leak shape "Callback retained past dispose()" — when an
  observer's disconnect()/dispose() doesn't fully release the callback
  closure in practice (DevTools instrumentation, extensions, native
  registry quirks). Fix pattern: wrap `this` in WeakRef inside the
  callback. This is what #617 did for xterm's RenderService.

- Promoted diff-heap.mjs + find-retainers.mjs to the top of the
  analyzer list, sorted by "start here". Sort heap diffs by bytes, not
  count — a 220 MB Uint32Array leak dominates any number of 40-byte
  Context churn.

Also commits the two diagnostic scripts that had been sitting
untracked in docs/perf-investigations/scripts/.

Regenerated .claude/skills/perf-diagnose/SKILL.md via `just ai::apm`.
srid added a commit that referenced this pull request Apr 17, 2026
…osis

Adds the IntersectionObserver / BufferLine retention story to
memory-learnings.md (#617, upstream xtermjs/xterm.js#5820 + #5821).

Skill updates:

- New "Ground truth: Task Manager Memory Footprint, not proxies" section.
  performance.memory, system/Context count, closure:* count are all
  proxies that can diverge from Task Manager by 100x. #614 reduced
  Context growth 89% across six commits with zero Memory Footprint
  improvement — the cautionary tale.

- New "Quiet-session A/B" requirement. Active agent terminals grow
  xterm scrollback legitimately; measurements on a busy session look
  indistinguishable from retention. #618 closed after a quiet-session
  A/B showed the +69 MB residual was all agent-stream activity.

- New leak shape "Callback retained past dispose()" — when an
  observer's disconnect()/dispose() doesn't fully release the callback
  closure in practice (DevTools instrumentation, extensions, native
  registry quirks). Fix pattern: wrap `this` in WeakRef inside the
  callback. This is what #617 did for xterm's RenderService.

- Promoted diff-heap.mjs + find-retainers.mjs to the top of the
  analyzer list, sorted by "start here". Sort heap diffs by bytes, not
  count — a 220 MB Uint32Array leak dominates any number of 40-byte
  Context churn.

Also commits the two diagnostic scripts that had been sitting
untracked in docs/perf-investigations/scripts/.

Regenerated .claude/skills/perf-diagnose/SKILL.md via `just ai::apm`.
srid added a commit that referenced this pull request Apr 17, 2026
… skill (#619)

## Summary

Documents the Chapter 3 investigation so future agents don't burn three
days chasing proxy metrics the way we did.

**Source of truth edits** (regenerated into `.claude/` via `just
ai::apm`):

- `docs/perf-investigations/memory-learnings.md` — new Chapter 3 section
covering #614 (closed without merge, the false trail) and #617 (the
one-line WeakRef that actually moved Task Manager Memory Footprint by
−81%).
- `agents/.apm/skills/perf-diagnose/SKILL.md` — runbook additions:
- New "Ground truth: Task Manager Memory Footprint, not proxies" rule at
the top. `performance.memory`, `system/Context` count, and `closure:*`
count are all proxies — they can drift 100× from Task Manager and
mislead you into declaring a fix that does nothing.
- New "Quiet-session A/B" step. Active agent terminals grow scrollback
buffers legitimately; a busy-session measurement looks indistinguishable
from retention.
- New **third** leak shape: "Callback retained past `dispose()`". When a
`Window.<Observer>` native registry (or a DevTools extension wrapping
it) holds the callback closure past the explicit
`observer.disconnect()`, the callback keeps `this` (and its whole
service graph) reachable. Fix pattern: wrap `this` in `WeakRef` inside
the callback.
- Promoted `diff-heap.mjs` + `find-retainers.mjs` to the top of the
analyzer list with "start here" language. Sort heap diffs by **bytes**,
not count — a 220 MB `Uint32Array` leak drowns any number of 40-byte
Context churn.

**New committed tooling** (had been sitting untracked):

- `docs/perf-investigations/scripts/diff-heap.mjs` — per-class
byte-delta between two heap snapshots.
- `docs/perf-investigations/scripts/find-retainers.mjs` — BFS from GC
roots to every instance of a target class, grouped by path signature.

## Why

The three-day version of the story: I jumped straight to heap snapshots,
found tens of thousands of retained `system/Context` objects, and
shipped six refactoring commits on #614 that reduced that count 89%.
Task Manager didn't move. The actual load-bearing retention was a single
`IntersectionObserver` callback in `xterm`'s `RenderService` holding 220
MB of `Uint32Array` BufferLines. One heap diff sorted by bytes (not
count) named the culprit in one line of output; one `WeakRef` wrap fixed
it.

Codifying these so the next agent doesn't re-tread the path.

## Test plan

- [x] `just fmt` clean
- [x] `just ai::apm` regenerated `.claude/skills/perf-diagnose/SKILL.md`
to match APM source
- [ ] `ci/apm-sync` validates `.claude/` matches sources
- [ ] `ci/fmt` + `ci/nix` green

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant