Skip to content

perf(prefetch): release lock during fetch + bigger floor for mmap-friendly files#129

Open
XciD wants to merge 6 commits intomainfrom
perf/mmap-prefetch
Open

perf(prefetch): release lock during fetch + bigger floor for mmap-friendly files#129
XciD wants to merge 6 commits intomainfrom
perf/mmap-prefetch

Conversation

@XciD
Copy link
Copy Markdown
Member

@XciD XciD commented Apr 25, 2026

Summary

Two stacked optimisations to fix the parallel-mmap stall pattern that
currently causes transformers PR #45547 to disable mmap on hf-mount entirely.

1. Release the per-file mutex during network I/O (2a0f1c1)

Previously the prefetch mutex was held across the await on the CAS
fetch, serialising all reads on the file. Now we lock briefly to plan +
take the stream, drop the lock, fetch, re-lock to store. Multiple
readers can have concurrent RangeDownloads in flight.

2. Floor on RangeDownload fetch_size (2a0f1c1)

The kernel splits a userspace pread(4 MiB) into 32 separate FUSE
chunks of 128 KiB each. Without a floor, that becomes 32 separate CAS
round-trips per pread. New floors: 4 MiB by default, 32 MiB for
files that look mmap-friendly (.safetensors .bin .pt .gguf).

3. Tensor-aligned fetch_size for safetensors (839d496)

Parse the safetensors JSON header at open time, stash absolute offsets
of every tensor in PrefetchState. When the read offset falls inside a
known tensor, set fetch_size to the exact remainder of that tensor
(capped at 64 MiB, floored at the mmap-friendly value). Never
over-fetches into the next tensor; never under-fetches inside a big
one. Falls back gracefully if the header doesn't parse.

Header probe: a single bounded 1 MiB CAS request at open_lazy
(typical header is 50-200 KiB).

Bench

m6i.xlarge, google/gemma-4-E2B-it, 8 threads × 30 random 4 MiB reads on a 9.6 GiB model.safetensors:

workload baseline (main) lockless + mmap_hint + tensor-aligned
mmap, throughput 7.5 MiB/s 24.5 MiB/s 17.4 MiB/s
mmap, max latency 21 s 6 s 6 s
mmap, stalls > 15 s 4 0 0
pread, throughput 5.1 MiB/s 41.2 MiB/s 56.5 MiB/s (×11)
pread, stalls > 15 s 25 0 0

Sequential 1-thread reads still hit the forward-buffer fast path (no regression).

Notes

  • 250 unit tests pass. The three changes are independent and could be reverted individually.
  • The transformers _is_on_hf_mount workaround can stay as belt-and-suspenders, but in practice mmap loads no longer trigger pathological stalls.
  • The synthetic mmap workload is roughly flat between commit 2 and 3 (within variance) since random small reads jumping between tensors don't benefit from tensor-aligned over-fetch. The realistic loader pattern (sequential tensor scans, what pread approximates) is where commit 3 shines.
  • Reproducer scripts (stress_mmap.py, stress_pread.py) and full bench logs available on request.

XciD added 6 commits April 25, 2026 19:16
…endly files

Two combined optimizations to fix the parallel-mmap stall pattern that
causes transformers to disable mmap on hf-mount entirely:

1. **Release the per-file mutex during network I/O.** Previously the
   prefetch mutex was held across the await on the CAS fetch, serialising
   all reads on the file (sequential or random). Now we lock briefly
   to plan + take the stream, drop the lock, fetch, re-lock to store.
   Multiple readers can have concurrent RangeDownloads in flight.

2. **Floor on RangeDownload fetch_size** to absorb the kernel splitting
   a userspace pread into 128 KiB FUSE chunks. Without this, a 4 MiB
   pread becomes 32 separate CAS round-trips. The floor is 4 MiB by
   default and 32 MiB for files that look mmap-friendly (.safetensors
   .bin .pt .gguf), where transformers/torch loaders mmap and access
   the file with many small adjacent page reads inside each tensor.

Bench (m6i.xlarge, gemma-4-E2B-it model.safetensors, 9.6 GiB):

  | workload (8 threads, 30×4 MiB random) | before | after |
  |--------------------------------------|--------|-------|
  | mmap, throughput                      | 7.5    | 24.5  | MiB/s
  | mmap, max latency                     | 21 s   | 6 s   |
  | mmap, stalls > 15 s                   | 4      | 0     |
  | pread, throughput                     | 5.1    | 41.2  | MiB/s
  | pread, stalls > 15 s                  | 25     | 0     |

The transformers `_is_on_hf_mount` workaround can stay as belt-and-
suspenders, but in practice mmap loads no longer trigger pathological
stalls.
Parse the safetensors JSON header at open time and stash the absolute
`(start, end)` offsets of every tensor in PrefetchState. In
`prepare_fetch`, when the read offset falls inside a known tensor, set
`fetch_size` to the exact remainder of that tensor (capped at 64 MiB,
floored at the generic mmap-friendly value). This:

  - never over-fetches across the next tensor boundary on small reads
  - never under-fetches inside a big tensor (one CAS round-trip covers
    most of it, kernel mmap pages all hit the forward buffer)
  - falls back gracefully if the header doesn't parse (random .bin
    files mistakenly named .safetensors, partial files, etc.)

Added a single bounded CAS request at `open_lazy` to read the header
(typical size 50-200 KiB; 1 MiB initial probe then a follow-up if
larger). Bench (m6i.xlarge, gemma-4-E2B-it, 8 threads × 30 random 4 MiB
preads on a 9.6 GiB safetensors):

  baseline (main)         5.1 MiB/s, 25 stalls > 15 s
  lockless + mmap_hint   41.2 MiB/s,  0 stalls
  + tensor-aligned       56.5 MiB/s,  0 stalls   (x11 vs main)

The synthetic mmap workload is roughly flat against the previous commit
(within run-to-run variance); the real win is on the realistic loader
pattern (sequential tensor scans), which the pread bench approximates.
The 1 MiB CAS request issued at open time to read the safetensors header
already pulls bytes 0..1 MiB. The first ~50-200 KiB of those are the
header itself (parsed and discarded), but the trailing portion is the
start of tensor 0 data. Instead of throwing it away, seed the prefetch
forward buffer with it.

Effect: the very first reads of a safetensors file hit the buffer with
zero added latency.

  - Header introspection (e.g. transformers reading tensor metadata
    before the actual tensor scan) is served from the buffer.
  - The first FUSE chunk of the first tensor read is also a hit.

Free win — these bytes are paid for by the layout probe regardless;
this just stops wasting them.
When a read crosses 75% of the current tensor (and the tensor is at
least 4 MiB — small tensors are usually configs, not scans), spawn a
background tokio task that pulls the start of the next tensor into a
'speculative' buffer slot on PrefetchState. The next read at that
tensor's offset matches and is promoted to the forward buffer with
zero added latency.

Heuristics (kept conservative to not waste bandwidth on random
workloads):

  - Trigger only past 75 % of the current tensor.
  - Skip if current tensor < 4 MiB (config-style, no real scan).
  - Skip if next tensor < 1 MiB.
  - Single in-flight slot per file, marked via `speculative_inflight`
    to avoid double-spawn from concurrent readers.
  - Cap each speculative fetch at MAX_TENSOR_FETCH (64 MiB).

The win is hidden by synthetic random benches (random offsets don't
match the scan heuristic) but should significantly cut tail latency
on real transformers loaders that iterate tensors in declaration
order: while the user copies/casts tensor i, tensor i+1 lands in the
buffer, and the next read returns instantly.
Four bugs flagged by an external review of the previous commits in this
branch. Each is independent; minimal-diff fix per finding.

  1. Stream invalidation on non-contiguous overwrite (data corruption).
     With the lock-released-during-fetch pattern, a concurrent
     RangeDownload could overwrite the forward buffer while another
     reader's persistent stream survived in PrefetchState. The next
     ContinueStream read would return bytes from the wrong file offset.
     Fix: store_fetched() now drops the stream when offset !=
     old_buf_end. Test: store_fetched_invalidates_stream_on_non_contiguous_overwrite.

  2. TOCTOU between speculation choose and mark-inflight (double spawn,
     wasted CAS bandwidth). The previous code split 'pick next tensor'
     and 'mark inflight' across two lock acquisitions, so two concurrent
     readers could both spawn the same 64 MiB prefetch. Collapsed into a
     single locked method claim_speculative_prefetch() that atomically
     decides + claims. Test: claim_speculative_prefetch_marks_inflight.

  3. Mutating probe in speculative-promotion path (direct-io regression).
     try_serve_forward() drains consumed bytes in forward_only mode, so
     using it as a predicate before promote_speculative would burn the
     bytes that the second call wanted to return. Replaced the predicate
     with a non-mutating would_serve_speculative() check.

  4. Permissive safetensors header parser (security / correctness).
     parse_safetensors_layout() previously accepted offsets past EOF,
     overlapping tensors, and unchecked u64 arithmetic. Now: takes
     file_size, validates end <= file_size for every tensor, rejects
     overlaps, uses checked_add. Tests:
     safetensors_layout_rejects_offset_past_eof,
     safetensors_layout_rejects_overlapping_tensors,
     safetensors_layout_rejects_overflow_arithmetic,
     safetensors_layout_accepts_valid.

Total: 8 new unit tests, 258 pass. Bench unchanged within run-to-run
variance (pread 53 MiB/s, mmap 19 MiB/s, 0 stalls).
Codex's follow-up review found my previous 'invalidate stream when
offset != old_buf_end' check incomplete: it misses the concurrent
exact-contiguous race.

Scenario: fetch A is mid-flight reading from a stream at X. Fetch B
(also picked up after a buffer drain) finishes first, stores a buffer
ending at X with its own stream installed at X. Fetch A then stores
[X..X+n) — the offset coincidentally matches old_buf_end, so the guard
skipped the invalidation, and A's stream_to_return was discarded
because state.stream was already Some. Next ContinueStream read used
B's stream at X, returning duplicate bytes.

Fix: store_fetched now takes the produced stream as a parameter and
ALWAYS installs it. Buffer and stream are paired atomically: whoever
writes last installs the stream that matches the buffer they just
produced. Range-style fetches pass None and clear any stale stream.

The ProbeStream-based test verifies the contiguous case actually
installs the new stream (would_serve_speculative-style identity check
via call counters, since Box<dyn Trait> can't be compared).

258 tests pass.
@XciD XciD marked this pull request as ready for review April 25, 2026 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant