perf(prefetch): release lock during fetch + bigger floor for mmap-friendly files#129
Open
perf(prefetch): release lock during fetch + bigger floor for mmap-friendly files#129
Conversation
…endly files Two combined optimizations to fix the parallel-mmap stall pattern that causes transformers to disable mmap on hf-mount entirely: 1. **Release the per-file mutex during network I/O.** Previously the prefetch mutex was held across the await on the CAS fetch, serialising all reads on the file (sequential or random). Now we lock briefly to plan + take the stream, drop the lock, fetch, re-lock to store. Multiple readers can have concurrent RangeDownloads in flight. 2. **Floor on RangeDownload fetch_size** to absorb the kernel splitting a userspace pread into 128 KiB FUSE chunks. Without this, a 4 MiB pread becomes 32 separate CAS round-trips. The floor is 4 MiB by default and 32 MiB for files that look mmap-friendly (.safetensors .bin .pt .gguf), where transformers/torch loaders mmap and access the file with many small adjacent page reads inside each tensor. Bench (m6i.xlarge, gemma-4-E2B-it model.safetensors, 9.6 GiB): | workload (8 threads, 30×4 MiB random) | before | after | |--------------------------------------|--------|-------| | mmap, throughput | 7.5 | 24.5 | MiB/s | mmap, max latency | 21 s | 6 s | | mmap, stalls > 15 s | 4 | 0 | | pread, throughput | 5.1 | 41.2 | MiB/s | pread, stalls > 15 s | 25 | 0 | The transformers `_is_on_hf_mount` workaround can stay as belt-and- suspenders, but in practice mmap loads no longer trigger pathological stalls.
Parse the safetensors JSON header at open time and stash the absolute
`(start, end)` offsets of every tensor in PrefetchState. In
`prepare_fetch`, when the read offset falls inside a known tensor, set
`fetch_size` to the exact remainder of that tensor (capped at 64 MiB,
floored at the generic mmap-friendly value). This:
- never over-fetches across the next tensor boundary on small reads
- never under-fetches inside a big tensor (one CAS round-trip covers
most of it, kernel mmap pages all hit the forward buffer)
- falls back gracefully if the header doesn't parse (random .bin
files mistakenly named .safetensors, partial files, etc.)
Added a single bounded CAS request at `open_lazy` to read the header
(typical size 50-200 KiB; 1 MiB initial probe then a follow-up if
larger). Bench (m6i.xlarge, gemma-4-E2B-it, 8 threads × 30 random 4 MiB
preads on a 9.6 GiB safetensors):
baseline (main) 5.1 MiB/s, 25 stalls > 15 s
lockless + mmap_hint 41.2 MiB/s, 0 stalls
+ tensor-aligned 56.5 MiB/s, 0 stalls (x11 vs main)
The synthetic mmap workload is roughly flat against the previous commit
(within run-to-run variance); the real win is on the realistic loader
pattern (sequential tensor scans), which the pread bench approximates.
The 1 MiB CAS request issued at open time to read the safetensors header
already pulls bytes 0..1 MiB. The first ~50-200 KiB of those are the
header itself (parsed and discarded), but the trailing portion is the
start of tensor 0 data. Instead of throwing it away, seed the prefetch
forward buffer with it.
Effect: the very first reads of a safetensors file hit the buffer with
zero added latency.
- Header introspection (e.g. transformers reading tensor metadata
before the actual tensor scan) is served from the buffer.
- The first FUSE chunk of the first tensor read is also a hit.
Free win — these bytes are paid for by the layout probe regardless;
this just stops wasting them.
When a read crosses 75% of the current tensor (and the tensor is at
least 4 MiB — small tensors are usually configs, not scans), spawn a
background tokio task that pulls the start of the next tensor into a
'speculative' buffer slot on PrefetchState. The next read at that
tensor's offset matches and is promoted to the forward buffer with
zero added latency.
Heuristics (kept conservative to not waste bandwidth on random
workloads):
- Trigger only past 75 % of the current tensor.
- Skip if current tensor < 4 MiB (config-style, no real scan).
- Skip if next tensor < 1 MiB.
- Single in-flight slot per file, marked via `speculative_inflight`
to avoid double-spawn from concurrent readers.
- Cap each speculative fetch at MAX_TENSOR_FETCH (64 MiB).
The win is hidden by synthetic random benches (random offsets don't
match the scan heuristic) but should significantly cut tail latency
on real transformers loaders that iterate tensors in declaration
order: while the user copies/casts tensor i, tensor i+1 lands in the
buffer, and the next read returns instantly.
Four bugs flagged by an external review of the previous commits in this
branch. Each is independent; minimal-diff fix per finding.
1. Stream invalidation on non-contiguous overwrite (data corruption).
With the lock-released-during-fetch pattern, a concurrent
RangeDownload could overwrite the forward buffer while another
reader's persistent stream survived in PrefetchState. The next
ContinueStream read would return bytes from the wrong file offset.
Fix: store_fetched() now drops the stream when offset !=
old_buf_end. Test: store_fetched_invalidates_stream_on_non_contiguous_overwrite.
2. TOCTOU between speculation choose and mark-inflight (double spawn,
wasted CAS bandwidth). The previous code split 'pick next tensor'
and 'mark inflight' across two lock acquisitions, so two concurrent
readers could both spawn the same 64 MiB prefetch. Collapsed into a
single locked method claim_speculative_prefetch() that atomically
decides + claims. Test: claim_speculative_prefetch_marks_inflight.
3. Mutating probe in speculative-promotion path (direct-io regression).
try_serve_forward() drains consumed bytes in forward_only mode, so
using it as a predicate before promote_speculative would burn the
bytes that the second call wanted to return. Replaced the predicate
with a non-mutating would_serve_speculative() check.
4. Permissive safetensors header parser (security / correctness).
parse_safetensors_layout() previously accepted offsets past EOF,
overlapping tensors, and unchecked u64 arithmetic. Now: takes
file_size, validates end <= file_size for every tensor, rejects
overlaps, uses checked_add. Tests:
safetensors_layout_rejects_offset_past_eof,
safetensors_layout_rejects_overlapping_tensors,
safetensors_layout_rejects_overflow_arithmetic,
safetensors_layout_accepts_valid.
Total: 8 new unit tests, 258 pass. Bench unchanged within run-to-run
variance (pread 53 MiB/s, mmap 19 MiB/s, 0 stalls).
Codex's follow-up review found my previous 'invalidate stream when offset != old_buf_end' check incomplete: it misses the concurrent exact-contiguous race. Scenario: fetch A is mid-flight reading from a stream at X. Fetch B (also picked up after a buffer drain) finishes first, stores a buffer ending at X with its own stream installed at X. Fetch A then stores [X..X+n) — the offset coincidentally matches old_buf_end, so the guard skipped the invalidation, and A's stream_to_return was discarded because state.stream was already Some. Next ContinueStream read used B's stream at X, returning duplicate bytes. Fix: store_fetched now takes the produced stream as a parameter and ALWAYS installs it. Buffer and stream are paired atomically: whoever writes last installs the stream that matches the buffer they just produced. Range-style fetches pass None and clear any stale stream. The ProbeStream-based test verifies the contiguous case actually installs the new stream (would_serve_speculative-style identity check via call counters, since Box<dyn Trait> can't be compared). 258 tests pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two stacked optimisations to fix the parallel-mmap stall pattern that
currently causes transformers PR #45547 to disable mmap on hf-mount entirely.
1. Release the per-file mutex during network I/O (2a0f1c1)
Previously the prefetch mutex was held across the
awaiton the CASfetch, serialising all reads on the file. Now we lock briefly to plan +
take the stream, drop the lock, fetch, re-lock to store. Multiple
readers can have concurrent
RangeDownloads in flight.2. Floor on
RangeDownloadfetch_size(2a0f1c1)The kernel splits a userspace
pread(4 MiB)into 32 separate FUSEchunks of 128 KiB each. Without a floor, that becomes 32 separate CAS
round-trips per pread. New floors: 4 MiB by default, 32 MiB for
files that look mmap-friendly (
.safetensors.bin.pt.gguf).3. Tensor-aligned
fetch_sizefor safetensors (839d496)Parse the safetensors JSON header at open time, stash absolute offsets
of every tensor in
PrefetchState. When the read offset falls inside aknown tensor, set
fetch_sizeto the exact remainder of that tensor(capped at 64 MiB, floored at the mmap-friendly value). Never
over-fetches into the next tensor; never under-fetches inside a big
one. Falls back gracefully if the header doesn't parse.
Header probe: a single bounded 1 MiB CAS request at
open_lazy(typical header is 50-200 KiB).
Bench
m6i.xlarge,
google/gemma-4-E2B-it, 8 threads × 30 random 4 MiB reads on a 9.6 GiBmodel.safetensors:Sequential 1-thread reads still hit the forward-buffer fast path (no regression).
Notes
_is_on_hf_mountworkaround can stay as belt-and-suspenders, but in practice mmap loads no longer trigger pathological stalls.mmapworkload is roughly flat between commit 2 and 3 (within variance) since random small reads jumping between tensors don't benefit from tensor-aligned over-fetch. The realistic loader pattern (sequential tensor scans, whatpreadapproximates) is where commit 3 shines.stress_mmap.py,stress_pread.py) and full bench logs available on request.