Fixed mmap prefetch for GPU offloading by JohannesGaessler · Pull Request #2529 · ggml-org/llama.cpp

JohannesGaessler · 2023-08-06T08:27:26Z

I've noticed that currently when loading models to VRAM (with mmap) the entire model first gets loaded into RAM and then the model get loaded from RAM to VRAM. This is not how I intended it since this will very likely cause problems when running models with a total size larger than RAM. Instead the loading should be lazy so the model file only needs to be iterated over once (ideally with the CPU tensors being mlocked). Looking through the code I've noticed that when the prefetch size is > 0 the MAP_POPULATE flag gets set unconditionally; this flag eagerly loads the entire file into memory and it seems I forgot to adjust the logic when I changed prefetch from bool to size_t. This PR fixes the logic to only eagerly load the file if the requested prefetch is at least as big as the file size.

LostRuins

works for me

ggerganov approved these changes Aug 6, 2023

View reviewed changes

slaren reviewed Aug 6, 2023

View reviewed changes

Comment thread llama-util.h Outdated

Fixed mmap prefetch for GPU offloading

d9024df

JohannesGaessler force-pushed the cuda-fix-mmap-prefetch branch from 9648436 to d9024df Compare August 6, 2023 18:28

Nexesenex mentioned this pull request Aug 6, 2023

CUDA: faster non k-quant mul_mat_q kernels #2483

Merged

slaren approved these changes Aug 6, 2023

View reviewed changes

LostRuins approved these changes Aug 7, 2023

View reviewed changes

JohannesGaessler merged commit 3d9a551 into ggml-org:master Aug 7, 2023

Nexesenex mentioned this pull request Aug 7, 2023

Problem of excessive VRAM occupation growth as the context grows when using CUBLAS. LostRuins/koboldcpp#374

Open

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026

Fixed mmap prefetch for GPU offloading (ggml-org#2529)

26786fa

phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026

Fixed mmap prefetch for GPU offloading (ggml-org#2529)

19cf204

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed mmap prefetch for GPU offloading#2529

Fixed mmap prefetch for GPU offloading#2529
JohannesGaessler merged 1 commit intoggml-org:masterfrom
JohannesGaessler:cuda-fix-mmap-prefetch

JohannesGaessler commented Aug 6, 2023

Uh oh!

Uh oh!

LostRuins left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

JohannesGaessler commented Aug 6, 2023

Uh oh!

Uh oh!

LostRuins left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants