perf(prefetch): release lock during fetch + bigger floor for mmap-friendly files by XciD · Pull Request #129 · huggingface/hf-mount

XciD · 2026-04-25T17:23:40Z

Summary

Two stacked optimisations to fix the parallel-mmap stall pattern that
currently causes transformers PR #45547 to disable mmap on hf-mount entirely.

1. Release the per-file mutex during network I/O (2a0f1c1)

Previously the prefetch mutex was held across the await on the CAS
fetch, serialising all reads on the file. Now we lock briefly to plan +
take the stream, drop the lock, fetch, re-lock to store. Multiple
readers can have concurrent RangeDownloads in flight.

2. Floor on `RangeDownload` `fetch_size` (2a0f1c1)

The kernel splits a userspace pread(4 MiB) into 32 separate FUSE
chunks of 128 KiB each. Without a floor, that becomes 32 separate CAS
round-trips per pread. New floors: 4 MiB by default, 32 MiB for
files that look mmap-friendly (.safetensors .bin .pt .gguf).

3. Tensor-aligned `fetch_size` for safetensors (839d496)

Parse the safetensors JSON header at open time, stash absolute offsets
of every tensor in PrefetchState. When the read offset falls inside a
known tensor, set fetch_size to the exact remainder of that tensor
(capped at 64 MiB, floored at the mmap-friendly value). Never
over-fetches into the next tensor; never under-fetches inside a big
one. Falls back gracefully if the header doesn't parse.

Header probe: a single bounded 1 MiB CAS request at open_lazy
(typical header is 50-200 KiB).

Bench

m6i.xlarge, google/gemma-4-E2B-it, 8 threads × 30 random 4 MiB reads on a 9.6 GiB model.safetensors:

workload	baseline (main)	lockless + mmap_hint	+ tensor-aligned
mmap, throughput	7.5 MiB/s	24.5 MiB/s	17.4 MiB/s
mmap, max latency	21 s	6 s	6 s
mmap, stalls > 15 s	4	0	0
pread, throughput	5.1 MiB/s	41.2 MiB/s	56.5 MiB/s (×11)
pread, stalls > 15 s	25	0	0

Sequential 1-thread reads still hit the forward-buffer fast path (no regression).

Notes

250 unit tests pass. The three changes are independent and could be reverted individually.
The transformers _is_on_hf_mount workaround can stay as belt-and-suspenders, but in practice mmap loads no longer trigger pathological stalls.
The synthetic mmap workload is roughly flat between commit 2 and 3 (within variance) since random small reads jumping between tensors don't benefit from tensor-aligned over-fetch. The realistic loader pattern (sequential tensor scans, what pread approximates) is where commit 3 shines.
Reproducer scripts (stress_mmap.py, stress_pread.py) and full bench logs available on request.

…endly files Two combined optimizations to fix the parallel-mmap stall pattern that causes transformers to disable mmap on hf-mount entirely: 1. **Release the per-file mutex during network I/O.** Previously the prefetch mutex was held across the await on the CAS fetch, serialising all reads on the file (sequential or random). Now we lock briefly to plan + take the stream, drop the lock, fetch, re-lock to store. Multiple readers can have concurrent RangeDownloads in flight. 2. **Floor on RangeDownload fetch_size** to absorb the kernel splitting a userspace pread into 128 KiB FUSE chunks. Without this, a 4 MiB pread becomes 32 separate CAS round-trips. The floor is 4 MiB by default and 32 MiB for files that look mmap-friendly (.safetensors .bin .pt .gguf), where transformers/torch loaders mmap and access the file with many small adjacent page reads inside each tensor. Bench (m6i.xlarge, gemma-4-E2B-it model.safetensors, 9.6 GiB): | workload (8 threads, 30×4 MiB random) | before | after | |--------------------------------------|--------|-------| | mmap, throughput | 7.5 | 24.5 | MiB/s | mmap, max latency | 21 s | 6 s | | mmap, stalls > 15 s | 4 | 0 | | pread, throughput | 5.1 | 41.2 | MiB/s | pread, stalls > 15 s | 25 | 0 | The transformers `_is_on_hf_mount` workaround can stay as belt-and- suspenders, but in practice mmap loads no longer trigger pathological stalls.

Parse the safetensors JSON header at open time and stash the absolute `(start, end)` offsets of every tensor in PrefetchState. In `prepare_fetch`, when the read offset falls inside a known tensor, set `fetch_size` to the exact remainder of that tensor (capped at 64 MiB, floored at the generic mmap-friendly value). This: - never over-fetches across the next tensor boundary on small reads - never under-fetches inside a big tensor (one CAS round-trip covers most of it, kernel mmap pages all hit the forward buffer) - falls back gracefully if the header doesn't parse (random .bin files mistakenly named .safetensors, partial files, etc.) Added a single bounded CAS request at `open_lazy` to read the header (typical size 50-200 KiB; 1 MiB initial probe then a follow-up if larger). Bench (m6i.xlarge, gemma-4-E2B-it, 8 threads × 30 random 4 MiB preads on a 9.6 GiB safetensors): baseline (main) 5.1 MiB/s, 25 stalls > 15 s lockless + mmap_hint 41.2 MiB/s, 0 stalls + tensor-aligned 56.5 MiB/s, 0 stalls (x11 vs main) The synthetic mmap workload is roughly flat against the previous commit (within run-to-run variance); the real win is on the realistic loader pattern (sequential tensor scans), which the pread bench approximates.

The 1 MiB CAS request issued at open time to read the safetensors header already pulls bytes 0..1 MiB. The first ~50-200 KiB of those are the header itself (parsed and discarded), but the trailing portion is the start of tensor 0 data. Instead of throwing it away, seed the prefetch forward buffer with it. Effect: the very first reads of a safetensors file hit the buffer with zero added latency. - Header introspection (e.g. transformers reading tensor metadata before the actual tensor scan) is served from the buffer. - The first FUSE chunk of the first tensor read is also a hit. Free win — these bytes are paid for by the layout probe regardless; this just stops wasting them.

When a read crosses 75% of the current tensor (and the tensor is at least 4 MiB — small tensors are usually configs, not scans), spawn a background tokio task that pulls the start of the next tensor into a 'speculative' buffer slot on PrefetchState. The next read at that tensor's offset matches and is promoted to the forward buffer with zero added latency. Heuristics (kept conservative to not waste bandwidth on random workloads): - Trigger only past 75 % of the current tensor. - Skip if current tensor < 4 MiB (config-style, no real scan). - Skip if next tensor < 1 MiB. - Single in-flight slot per file, marked via `speculative_inflight` to avoid double-spawn from concurrent readers. - Cap each speculative fetch at MAX_TENSOR_FETCH (64 MiB). The win is hidden by synthetic random benches (random offsets don't match the scan heuristic) but should significantly cut tail latency on real transformers loaders that iterate tensors in declaration order: while the user copies/casts tensor i, tensor i+1 lands in the buffer, and the next read returns instantly.

Four bugs flagged by an external review of the previous commits in this branch. Each is independent; minimal-diff fix per finding. 1. Stream invalidation on non-contiguous overwrite (data corruption). With the lock-released-during-fetch pattern, a concurrent RangeDownload could overwrite the forward buffer while another reader's persistent stream survived in PrefetchState. The next ContinueStream read would return bytes from the wrong file offset. Fix: store_fetched() now drops the stream when offset != old_buf_end. Test: store_fetched_invalidates_stream_on_non_contiguous_overwrite. 2. TOCTOU between speculation choose and mark-inflight (double spawn, wasted CAS bandwidth). The previous code split 'pick next tensor' and 'mark inflight' across two lock acquisitions, so two concurrent readers could both spawn the same 64 MiB prefetch. Collapsed into a single locked method claim_speculative_prefetch() that atomically decides + claims. Test: claim_speculative_prefetch_marks_inflight. 3. Mutating probe in speculative-promotion path (direct-io regression). try_serve_forward() drains consumed bytes in forward_only mode, so using it as a predicate before promote_speculative would burn the bytes that the second call wanted to return. Replaced the predicate with a non-mutating would_serve_speculative() check. 4. Permissive safetensors header parser (security / correctness). parse_safetensors_layout() previously accepted offsets past EOF, overlapping tensors, and unchecked u64 arithmetic. Now: takes file_size, validates end <= file_size for every tensor, rejects overlaps, uses checked_add. Tests: safetensors_layout_rejects_offset_past_eof, safetensors_layout_rejects_overlapping_tensors, safetensors_layout_rejects_overflow_arithmetic, safetensors_layout_accepts_valid. Total: 8 new unit tests, 258 pass. Bench unchanged within run-to-run variance (pread 53 MiB/s, mmap 19 MiB/s, 0 stalls).

Codex's follow-up review found my previous 'invalidate stream when offset != old_buf_end' check incomplete: it misses the concurrent exact-contiguous race. Scenario: fetch A is mid-flight reading from a stream at X. Fetch B (also picked up after a buffer drain) finishes first, stores a buffer ending at X with its own stream installed at X. Fetch A then stores [X..X+n) — the offset coincidentally matches old_buf_end, so the guard skipped the invalidation, and A's stream_to_return was discarded because state.stream was already Some. Next ContinueStream read used B's stream at X, returning duplicate bytes. Fix: store_fetched now takes the produced stream as a parameter and ALWAYS installs it. Buffer and stream are paired atomically: whoever writes last installs the stream that matches the buffer they just produced. Range-style fetches pass None and clear any stale stream. The ProbeStream-based test verifies the contiguous case actually installs the new stream (would_serve_speculative-style identity check via call counters, since Box<dyn Trait> can't be compared). 258 tests pass.

XciD added 6 commits April 25, 2026 19:16

XciD marked this pull request as ready for review April 25, 2026 19:08

XciD mentioned this pull request Apr 26, 2026

perf(fuse): set FOPEN_KEEP_CACHE on read-only mounts #130

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(prefetch): release lock during fetch + bigger floor for mmap-friendly files#129

perf(prefetch): release lock during fetch + bigger floor for mmap-friendly files#129
XciD wants to merge 6 commits intomainfrom
perf/mmap-prefetch

XciD commented Apr 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

XciD commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Release the per-file mutex during network I/O (2a0f1c1)

2. Floor on RangeDownload fetch_size (2a0f1c1)

3. Tensor-aligned fetch_size for safetensors (839d496)

Bench

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

XciD commented Apr 25, 2026 •

edited

Loading

2. Floor on `RangeDownload` `fetch_size` (2a0f1c1)

3. Tensor-aligned `fetch_size` for safetensors (839d496)