Skip to content

Fixed mmap prefetch for GPU offloading#2529

Merged
JohannesGaessler merged 1 commit intoggml-org:masterfrom
JohannesGaessler:cuda-fix-mmap-prefetch
Aug 7, 2023
Merged

Fixed mmap prefetch for GPU offloading#2529
JohannesGaessler merged 1 commit intoggml-org:masterfrom
JohannesGaessler:cuda-fix-mmap-prefetch

Conversation

@JohannesGaessler
Copy link
Copy Markdown
Contributor

I've noticed that currently when loading models to VRAM (with mmap) the entire model first gets loaded into RAM and then the model get loaded from RAM to VRAM. This is not how I intended it since this will very likely cause problems when running models with a total size larger than RAM. Instead the loading should be lazy so the model file only needs to be iterated over once (ideally with the CPU tensors being mlocked). Looking through the code I've noticed that when the prefetch size is > 0 the MAP_POPULATE flag gets set unconditionally; this flag eagerly loads the entire file into memory and it seems I forgot to adjust the logic when I changed prefetch from bool to size_t. This PR fixes the logic to only eagerly load the file if the requested prefetch is at least as big as the file size.

Comment thread llama-util.h Outdated
Copy link
Copy Markdown
Collaborator

@LostRuins LostRuins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works for me

@JohannesGaessler JohannesGaessler merged commit 3d9a551 into ggml-org:master Aug 7, 2023
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants