Skip to content

fix(QUICStream): handle peers that start with zero stream credit#157

Open
lmvdz wants to merge 1 commit intoMatrixAI:stagingfrom
lmvdz:fix/stream-limit-zero-initial-credit
Open

fix(QUICStream): handle peers that start with zero stream credit#157
lmvdz wants to merge 1 commit intoMatrixAI:stagingfrom
lmvdz:fix/stream-limit-zero-initial-credit

Conversation

@lmvdz
Copy link
Copy Markdown

@lmvdz lmvdz commented Apr 18, 2026

Problem

QUICStream's constructor eagerly primes every new stream with connection.conn.streamSend(streamId, new Uint8Array(0), false) to keep local stream state symmetric with closing behavior (src/QUICStream.ts around L280-310). When quiche returns `StreamLimit` on that prime call, the constructor throws `ErrorQUICStreamLimit`.

That throw leaves the system in a broken state:

  1. The local stream ID allocator in QUICConnection.newStream has already consumed the ID.
  2. Quiche has no record of the stream (it only records once real bytes flow).
  3. The next `newStream('uni')` call then hits `ErrorQUICUndefinedBehaviour: We should never repeat streamIds when creating streams`.

The connection is effectively dead for outbound streams, permanently.

Where this matters

This bites any peer that advertises `initial_max_streams_uni: 0` (or an already-exhausted count) and uses `MAX_STREAMS` frames to grant credit post-handshake. That's how several production servers implement rate-limited / stake-weighted QoS.

Concrete case: Solana's Agave TPU-QUIC server. It advertises 0 initial uni streams to unstaked clients and drip-feeds `MAX_STREAMS` frames under a stake-weighted rate limiter. With the current `@matrixai/quic@2.0.9`, every `newStream('uni')` against an Agave TPU fails with `StreamLimit` before any bytes can be written. This affects ~80% of Solana mainnet leader slots.

Fix

Two small, narrowly-scoped changes in `QUICStream`:

  1. `createQUICStream`: if the eager-prime throws `StreamLimit`, swallow it instead of propagating. The stream object is still constructed locally — the caller gets a live stream it can write to. Quiche's internal state isn't touched by a failed zero-length prime, so the stream ID remains free to use when `writableWrite` actually sends bytes.

  2. `writableWrite`: bounded retry on `StreamLimit`. Up to 20 attempts at 50 ms intervals (≈1 s total budget). This gives the connection's receive loop time to process incoming `MAX_STREAMS` frames before we fail the caller. If credit doesn't arrive within the budget, the existing `ErrorQUICStreamInternal` path fires unchanged.

For peers that advertise non-zero initial credit, behavior is unchanged — the eager-prime succeeds on the first try and the retry loop never fires.

Verification

I'm building a Solana TPU client in TypeScript (lmvdz/tpu-client). Tested both of these scenarios:

  • Local Agave (`solana-test-validator` 3.1.11): integration test submits a signed `SystemProgram::transfer` via TPU-QUIC and polls `getSignatureStatuses` until `processed`. Pre-fix: every attempt returns `StreamLimit` before any bytes leave the client. Post-fix: tx lands.
  • Live mainnet-beta: 6-node comparative probe (3 Agave 3.1.13, 3 Frankendancer 0.820.30113) writing a 100-byte stub on a client-initiated uni stream. Pre-fix: 0/3 Agave sends succeeded, 3/3 Frankendancer. Post-fix: 2/3 Agave (1 was an unrelated network timeout on a non-leader), 3/3 Frankendancer.

Not included here

  • No new tests added upstream here. Happy to add one that uses `initial_max_streams_uni: 0` on the server side + a synchronized `MAX_STREAMS` write, if that'd help land this. Flag if you'd like me to.
  • I kept the eager-prime intact and just made the StreamLimit path non-fatal. Didn't want to restructure a hot path more than necessary. An alternative would be to drop the prime entirely for streams where the local peer knows it has no credit yet, but that's a bigger behavioral change.

Retaining behavior summary

Scenario Pre-fix Post-fix
Peer advertises adequate initial stream credit ✅ works ✅ works (unchanged)
Peer advertises 0 initial credit, grants via MAX_STREAMS fast ❌ eager-prime throws, connection dead ✅ retry loop bridges the gap
Peer never grants credit ❌ StreamLimit error ❌ StreamLimit error after ~1 s (same terminal state, just delayed)

Two related fixes in `QUICStream` for peers that advertise
`initial_max_streams_uni: 0` (or any already-exhausted count) and grant
stream credit post-handshake via `MAX_STREAMS` frames.

Problem
-------
The constructor eagerly primes every new stream with
`streamSend(streamId, new Uint8Array(0), false)` to make local stream
state symmetric with closing behavior. When quiche returns
`StreamLimit` on that prime call, the constructor throws
`ErrorQUICStreamLimit`. But the stream ID has already been consumed by
the local allocator, and quiche has no record of the stream — so the
next `newStream('uni')` hits
`ErrorQUICUndefinedBehaviour: We should never repeat streamIds when
creating streams`, permanently breaking outbound stream creation on
that connection.

Encountered in the wild against Solana's Agave TPU-QUIC server: Agave
advertises 0 initial uni streams to unstaked clients and drip-feeds
MAX_STREAMS frames under its stake-weighted QoS rate limiter. The
eager-prime races ahead of the first credit grant, and every stream
attempt on the connection fails from that point.

Fix
---
1. `createQUICStream`: if the eager-prime returns `StreamLimit`,
   swallow it instead of throwing. The stream object is still
   constructed locally; the caller gets a live stream it can write to.
   Quiche's internal state is untouched by the failed zero-length
   prime (it only records a stream once real bytes flow), so the
   stream ID is free to be used later when `writableWrite` retries.
2. `writableWrite`: bounded retry on `StreamLimit` — up to 20 attempts
   with 50 ms backoff (total ~1 s budget). Lets the connection's
   receive loop process incoming MAX_STREAMS frames before we fail
   the write. If no credit arrives within the budget, we fall through
   to the existing `ErrorQUICStreamInternal` path.

Behavior for peers that advertise non-zero initial credit is unchanged:
the eager-prime succeeds, the retry loop never fires.

Verified
--------
- Integration test against `solana-test-validator` (Agave 3.1.11 TPU):
  transaction successfully submitted via TPU-QUIC and landed at
  `processed` commitment, where previously every attempt returned
  `StreamLimit` before any bytes were written.
- Live mainnet-beta probe against 3 Agave 3.1.13 + 3 Frankendancer
  0.820.30113 nodes: all reachable nodes now accept a test write on
  a client-initiated uni stream. Pre-fix: 0/3 Agave sends succeeded.
  Post-fix: 2/3 Agave succeed (1 was an unrelated network timeout),
  3/3 Frankendancer succeed (unchanged — they already worked).

Downstream context
------------------
Discovered while building a Solana TPU client in TypeScript. Upstream
patch request so the downstream project can drop its `patch-package`
shim.
lmvdz added a commit to lmvdz/js-quic that referenced this pull request Apr 18, 2026
Distribution branch containing the prebuilt dist/ of @matrixai/quic@2.0.9
with two small edits to dist/QUICStream.js that let the library survive
peers advertising initial_max_streams_uni: 0 and granting stream credit
via post-handshake MAX_STREAMS frames (Solana Agave TPU-QUIC unstaked
path).

Consume via:
  "@matrixai/quic": "github:lmvdz/js-quic#release/tpu-fix"

Native binaries resolve from npm via optionalDependencies unchanged.

Upstream PR: MatrixAI#157
lmvdz added a commit to lmvdz/solana-tpu-client that referenced this pull request Apr 18, 2026
The big one. Our TPU-QUIC send path now successfully lands transactions
against Agave — verified end-to-end against both solana-test-validator
(Agave 3.1.11) locally and live mainnet-beta Agave 3.1.13 nodes.

Root cause (from research + source read of @matrixai/quic@2.0.9)
----------------------------------------------------------------
QUICStream.createQUICStream eagerly primes each new stream with
connection.conn.streamSend(streamId, new Uint8Array(0), false) to make
local state symmetric with closing behavior. When the peer advertises
initial_max_streams_uni: 0 (Agave's unstaked-client QoS advertises
exactly zero and drip-feeds MAX_STREAMS frames post-handshake), that
prime call returns StreamLimit. The library wraps it as
ErrorQUICStreamLimit and throws — leaving the local stream-ID
allocator consumed but quiche with no record of the stream. Every
subsequent newStream('uni') then hits
ErrorQUICUndefinedBehaviour: We should never repeat streamIds,
permanently breaking outbound streams on the connection.

Path A — upstream PR
--------------------
Forked MatrixAI/js-quic, applied a two-part fix to src/QUICStream.ts,
pushed, and opened:

  MatrixAI/js-quic#157

The PR does two things, narrowly scoped:
  1. createQUICStream: swallow StreamLimit from the eager-prime. The
     stream object is still constructed locally and the ID remains
     free to use (quiche only records streams when real bytes flow).
  2. writableWrite: bounded retry on StreamLimit — 20 attempts at
     50 ms intervals (~1 s budget). Lets the receive loop process
     incoming MAX_STREAMS frames before failing the write.

Peers with nonzero initial credit are unaffected: the prime succeeds
on the first try, the retry loop never fires.

Path B — patch-package in our repo
-----------------------------------
The exact same two-part diff, applied to our local
node_modules/@matrixai/quic/dist/QUICStream.js via patch-package.
Checked in as patches/@MatrixAI+quic+2.0.9.patch and applied at our
postinstall so our CI, unit tests, integration test, and smoke
scripts all exercise the fixed library.

Honest caveat: patch-package does not automatically propagate to
downstream consumers (npm's install model prohibits package A from
modifying C's tree via B). The patch file DOES ship in our tarball
(patches/ added to files[]) so consumers can copy it and apply
themselves until the upstream release lands. Documented clearly in
README + CHANGELOG alpha.5.

Verification
------------
- tsc --noEmit (src + tests): clean
- eslint . --ext ts: clean
- vitest run test/unit: 83/83 passing
- vitest run test/integration (TPU_INTEGRATION=1):
    "sends and confirms a transfer via TPU" — PASSES.
    End-to-end path: mint payer, airdrop, build signed transfer,
    submit via TPU, poll getSignatureStatuses, observe landing
    at 'processed' commitment.
- smoke:firedancer (mainnet-beta live):
    Pre-patch: 0/3 Agave sends succeeded (all StreamLimit),
               3/3 Frankendancer succeeded.
    Post-patch: 2/3 Agave succeeded (1 unrelated network timeout
                on a non-leader), 3/3 Frankendancer succeeded.
                Includes successful sends to actively-leading
                Agave validators during the probe window.
- npm audit: 0 vulnerabilities
- npm pack --dry-run: 54 files, 2.0.0-alpha.5.tgz, includes
  patches/ directory so manual application is possible.

Changes
-------
- patches/@MatrixAI+quic+2.0.9.patch (new, checked in).
- package.json: patch-package + postinstall-postinstall added as
  devDeps; "postinstall": "patch-package" in scripts; patches/ added
  to files[].
- test/integration/validator.test.ts: fanoutSlots: 1 (single
  validator = per-IP rate limit triggers on 4 parallel conns); polls
  getSignatureStatuses after send instead of using
  sendAndConfirmTpuTransactionFactory (test-validator's fast slot
  advance races blockhash expiry); retries send up to 20 s to absorb
  unstaked-QoS drops.
- README Staked QoS section: honest disclosure of the bug, the fix,
  and the upstream PR status.
- CHANGELOG alpha.5: full context — root cause, both fix paths,
  honest limitations of patch-package for library authors.
- package.json version bumped to 2.0.0-alpha.5.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
lmvdz added a commit to lmvdz/solana-tpu-client that referenced this pull request Apr 18, 2026
"npm install tpu-client" now Just Works — unstaked or staked client,
no patch-package setup, no copied patches, no manual steps.

How
---
- @matrixai/quic dependency moved to a github: URL pointing at our
  fork's release branch:
      "@matrixai/quic": "github:lmvdz/js-quic#release/tpu-fix"
  The branch contains @matrixai/quic@2.0.9 with dist/QUICStream.js
  already patched to handle peers that advertise
  initial_max_streams_uni: 0 (Agave's unstaked TPU-QUIC path) and
  grant credit via post-handshake MAX_STREAMS frames. Version renamed
  to 2.0.9-tpu-fix.0 so `npm ls` shows the provenance.
- Fork branch also has build scripts stripped (dist/ is pre-built;
  tsc on install would fail because this branch deliberately ships
  no src/) so install is just a filesystem extract.
- npm "overrides" entry forces every transitive @matrixai/quic
  resolution onto the fork too, preventing any downstream dep from
  smuggling in the buggy registry version.
- Native binaries (@matrixai/quic-linux-x64, -darwin-arm64,
  -darwin-x64, -darwin-universal, -win32-x64) continue to resolve
  from npm via optionalDependencies. No Rust toolchain needed on the
  consumer side — our patch is to the TypeScript-side JS wrapper
  only, the Rust core is untouched.

Removed
-------
- patch-package + postinstall-postinstall devDeps.
- "postinstall": "patch-package" script.
- patches/@MatrixAI+quic+2.0.9.patch file.
- patches/ from package.json files[].

The fix now lives in the fork's dist/ directly. patch-package was
only useful for our own dev-loop anyway (npm's install model
prevented it from patching downstream consumers' trees), and the
fork approach replaces it with something that actually reaches users.

Verified (clean install from scratch)
-------------------------------------
- `rm -rf node_modules package-lock.json && npm install`
  → @matrixai/quic resolves to
    git+ssh://git@github.com/lmvdz/js-quic.git#b538c57... @ 2.0.9-tpu-fix.0
  → patch markers present in dist/QUICStream.js (grep == 2)
  → native binary @matrixai/quic-linux-x64 installed from npm
- tsc --noEmit (src + tests): clean
- eslint: clean
- vitest run test/unit: 83/83
- TPU_INTEGRATION=1 vitest run test/integration: 1/1
  (real transaction lands via TPU-QUIC on solana-test-validator)
- npm audit: 0 vulnerabilities
- npm pack --dry-run: 53 files, tpu-client-2.0.0-alpha.6.tgz

Upstream PR: MatrixAI/js-quic#157
Once merged + released, we drop the override and return to the
canonical @matrixai/quic package.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
lmvdz added a commit to lmvdz/solana-tpu-client that referenced this pull request Apr 18, 2026
The big one. Our TPU-QUIC send path now successfully lands transactions
against Agave — verified end-to-end against both solana-test-validator
(Agave 3.1.11) locally and live mainnet-beta Agave 3.1.13 nodes.

Root cause (from research + source read of @matrixai/quic@2.0.9)
----------------------------------------------------------------
QUICStream.createQUICStream eagerly primes each new stream with
connection.conn.streamSend(streamId, new Uint8Array(0), false) to make
local state symmetric with closing behavior. When the peer advertises
initial_max_streams_uni: 0 (Agave's unstaked-client QoS advertises
exactly zero and drip-feeds MAX_STREAMS frames post-handshake), that
prime call returns StreamLimit. The library wraps it as
ErrorQUICStreamLimit and throws — leaving the local stream-ID
allocator consumed but quiche with no record of the stream. Every
subsequent newStream('uni') then hits
ErrorQUICUndefinedBehaviour: We should never repeat streamIds,
permanently breaking outbound streams on the connection.

Path A — upstream PR
--------------------
Forked MatrixAI/js-quic, applied a two-part fix to src/QUICStream.ts,
pushed, and opened:

  MatrixAI/js-quic#157

The PR does two things, narrowly scoped:
  1. createQUICStream: swallow StreamLimit from the eager-prime. The
     stream object is still constructed locally and the ID remains
     free to use (quiche only records streams when real bytes flow).
  2. writableWrite: bounded retry on StreamLimit — 20 attempts at
     50 ms intervals (~1 s budget). Lets the receive loop process
     incoming MAX_STREAMS frames before failing the write.

Peers with nonzero initial credit are unaffected: the prime succeeds
on the first try, the retry loop never fires.

Path B — patch-package in our repo
-----------------------------------
The exact same two-part diff, applied to our local
node_modules/@matrixai/quic/dist/QUICStream.js via patch-package.
Checked in as patches/@MatrixAI+quic+2.0.9.patch and applied at our
postinstall so our CI, unit tests, integration test, and smoke
scripts all exercise the fixed library.

Honest caveat: patch-package does not automatically propagate to
downstream consumers (npm's install model prohibits package A from
modifying C's tree via B). The patch file DOES ship in our tarball
(patches/ added to files[]) so consumers can copy it and apply
themselves until the upstream release lands. Documented clearly in
README + CHANGELOG alpha.5.

Verification
------------
- tsc --noEmit (src + tests): clean
- eslint . --ext ts: clean
- vitest run test/unit: 83/83 passing
- vitest run test/integration (TPU_INTEGRATION=1):
    "sends and confirms a transfer via TPU" — PASSES.
    End-to-end path: mint payer, airdrop, build signed transfer,
    submit via TPU, poll getSignatureStatuses, observe landing
    at 'processed' commitment.
- smoke:firedancer (mainnet-beta live):
    Pre-patch: 0/3 Agave sends succeeded (all StreamLimit),
               3/3 Frankendancer succeeded.
    Post-patch: 2/3 Agave succeeded (1 unrelated network timeout
                on a non-leader), 3/3 Frankendancer succeeded.
                Includes successful sends to actively-leading
                Agave validators during the probe window.
- npm audit: 0 vulnerabilities
- npm pack --dry-run: 54 files, 2.0.0-alpha.5.tgz, includes
  patches/ directory so manual application is possible.

Changes
-------
- patches/@MatrixAI+quic+2.0.9.patch (new, checked in).
- package.json: patch-package + postinstall-postinstall added as
  devDeps; "postinstall": "patch-package" in scripts; patches/ added
  to files[].
- test/integration/validator.test.ts: fanoutSlots: 1 (single
  validator = per-IP rate limit triggers on 4 parallel conns); polls
  getSignatureStatuses after send instead of using
  sendAndConfirmTpuTransactionFactory (test-validator's fast slot
  advance races blockhash expiry); retries send up to 20 s to absorb
  unstaked-QoS drops.
- README Staked QoS section: honest disclosure of the bug, the fix,
  and the upstream PR status.
- CHANGELOG alpha.5: full context — root cause, both fix paths,
  honest limitations of patch-package for library authors.
- package.json version bumped to 2.0.0-alpha.5.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
lmvdz added a commit to lmvdz/solana-tpu-client that referenced this pull request Apr 18, 2026
"npm install tpu-client" now Just Works — unstaked or staked client,
no patch-package setup, no copied patches, no manual steps.

How
---
- @matrixai/quic dependency moved to a github: URL pointing at our
  fork's release branch:
      "@matrixai/quic": "github:lmvdz/js-quic#release/tpu-fix"
  The branch contains @matrixai/quic@2.0.9 with dist/QUICStream.js
  already patched to handle peers that advertise
  initial_max_streams_uni: 0 (Agave's unstaked TPU-QUIC path) and
  grant credit via post-handshake MAX_STREAMS frames. Version renamed
  to 2.0.9-tpu-fix.0 so `npm ls` shows the provenance.
- Fork branch also has build scripts stripped (dist/ is pre-built;
  tsc on install would fail because this branch deliberately ships
  no src/) so install is just a filesystem extract.
- npm "overrides" entry forces every transitive @matrixai/quic
  resolution onto the fork too, preventing any downstream dep from
  smuggling in the buggy registry version.
- Native binaries (@matrixai/quic-linux-x64, -darwin-arm64,
  -darwin-x64, -darwin-universal, -win32-x64) continue to resolve
  from npm via optionalDependencies. No Rust toolchain needed on the
  consumer side — our patch is to the TypeScript-side JS wrapper
  only, the Rust core is untouched.

Removed
-------
- patch-package + postinstall-postinstall devDeps.
- "postinstall": "patch-package" script.
- patches/@MatrixAI+quic+2.0.9.patch file.
- patches/ from package.json files[].

The fix now lives in the fork's dist/ directly. patch-package was
only useful for our own dev-loop anyway (npm's install model
prevented it from patching downstream consumers' trees), and the
fork approach replaces it with something that actually reaches users.

Verified (clean install from scratch)
-------------------------------------
- `rm -rf node_modules package-lock.json && npm install`
  → @matrixai/quic resolves to
    git+ssh://git@github.com/lmvdz/js-quic.git#b538c57... @ 2.0.9-tpu-fix.0
  → patch markers present in dist/QUICStream.js (grep == 2)
  → native binary @matrixai/quic-linux-x64 installed from npm
- tsc --noEmit (src + tests): clean
- eslint: clean
- vitest run test/unit: 83/83
- TPU_INTEGRATION=1 vitest run test/integration: 1/1
  (real transaction lands via TPU-QUIC on solana-test-validator)
- npm audit: 0 vulnerabilities
- npm pack --dry-run: 53 files, tpu-client-2.0.0-alpha.6.tgz

Upstream PR: MatrixAI/js-quic#157
Once merged + released, we drop the override and return to the
canonical @matrixai/quic package.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant