args: refactor mlock/mmap/directio into load-mode#20834
args: refactor mlock/mmap/directio into load-mode#20834taronaeo wants to merge 33 commits intoggml-org:masterfrom
Conversation
fe953f2 to
21603f8
Compare
|
I think this makes sense. The modes are mutually exclusive, so just flags creates a lot of overlap or impossible configurations. mlock means mmap+mlock, right? Opinions from other maintainers? |
2d1d26c to
e3ca2c4
Compare
Yep, I've updated the PR description to showcase the tests to ensure feature parity. The latest push also includes some bugfixes I've found when doing the feature parity check, updated all the documentation to the latest with |
|
@ggml-org/maintainers What do you think about this change? |
Conceptually fine I'd say.
I think in the long-run it makes sense to implement some kind of "backend-specific" loading functionality instead of keeping the modes mutually exclusive (i.e. CPU backend can use |
e3ca2c4 to
88ca79d
Compare
|
I've just pushed changes to rebase with PTAL again. RE the unrelated changes in this PR, these are artifacts from running
Yep, the deprecated flags all still work as per normal. For llama-cli / llama-completion $ build/bin/llama-completion -m ~/Documents/hf_models/deepseek-r1-distill-qwen-1.5b-bf16.gguf --mmap -lm none 2>&1 | grep mmap
DEPRECATED: --mmap and --no-mmap are deprecated. use --load-mode mmap instead
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)llama-bench $ build/bin/llama-bench -m ~/Documents/hf_models/deepseek-r1-distill-qwen-1.5b-bf16.gguf --mmap 1 -lm none
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.012 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name: MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: has tensor = false
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 26800.60 MB
DEPRECATED: -mmp and --mmap are deprecated. use --load-mode mmap instead
| model | size | params | backend | threads | lm | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ----: | --------------: | -------------------: |
| qwen2 1.5B BF16 | 3.31 GiB | 1.78 B | MTL,BLAS | 8 | mmap | pp512 | 1217.39 ± 26.36 |
| qwen2 1.5B BF16 | 3.31 GiB | 1.78 B | MTL,BLAS | 8 | mmap | tg128 | 49.02 ± 0.91 |
| qwen2 1.5B BF16 | 3.31 GiB | 1.78 B | MTL,BLAS | 8 | none | pp512 | 1231.30 ± 3.41 |
| qwen2 1.5B BF16 | 3.31 GiB | 1.78 B | MTL,BLAS | 8 | none | tg128 | 48.87 ± 0.18 |
I agree. Conceptually I think it should go like this:
Maybe in another discussion/PR? :) Edit: Addressed the usage of deprecated flags. |
We've previously tried defaulting to DirectIO and this was a bad idea. It seems to fail on a not insignificant number of configurations out there.
This is a very rare configuration. I think it makes sense to disable mmap by default or at least if a GPU is detected, but probably not DirectIO. |
a15401d to
4e19747
Compare
There was a problem hiding this comment.
Overall looks good, I tried it out locally as well. There are some diffs which are unrelated and should be fixed. Also you need to change this log line and the probably the surrounding code
Lines 2969 to 2970 in e21cdc1
I think my reply got buried in the PR history, but to quote my previous comments,
Nevertheless, I've manually went through the documentation changes made by
I've updated it to log this way. Let me know if this is the desired log message. |
|
What I meant was Lines 9303 to 9305 in 006809f |
| /*.load_mode =*/ LLAMA_LOAD_MODE_MMAP, | ||
| /*.main_gpu =*/ 0, | ||
| /*.tensor_split =*/ nullptr, | ||
| /*.progress_callback =*/ nullptr, | ||
| /*.progress_callback_user_data =*/ nullptr, | ||
| /*.kv_overrides =*/ nullptr, | ||
| /*.vocab_only =*/ false, | ||
| /*.use_mmap =*/ true, | ||
| /*.use_direct_io =*/ false, | ||
| /*.use_mlock =*/ false, |
There was a problem hiding this comment.
What I meant was llama_models_params hasn't gone through this change. I think we intend to keep that as is?
It's changed here. I don't think it's beneficial to keep the respective parameters available as the code has been refactored to use load_mode instead.
JohannesGaessler
left a comment
There was a problem hiding this comment.
Due to the changes in llama-bench.cpp it is also necessary to change scripts/compare-llama-bench.py.
|
Thanks! I forgot about that. The output for | Model | Load mode | Test | t/s master | t/s master | Speedup |
|:----------------|:------------------|:-------|-------------:|-------------:|----------:|
| qwen2 1.5B BF16 | mmap | pp512 | 1232.05 | 1232.05 | 1.00 |
| qwen2 1.5B BF16 | mmap | tg128 | 49.37 | 49.37 | 1.00 |
| qwen2 1.5B BF16 | none | pp512 | 1233.24 | 1233.24 | 1.00 |
| qwen2 1.5B BF16 | none | tg128 | 49.09 | 49.09 | 1.00 | |
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
6f1dca3 to
3ce3b6d
Compare
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
I had considered this while writing this refactor. IMO it would not be the best option here since mmap/mlock are mutually exclusive to direct-io. If we were to implement this as bit flags, I think we would run into the same problem as what we are trying to fix in this refactor, albeit now in a bit mask instead of different feature flags. I would still think that using an enum is more appropriate in this scenario. Let me know what you think. |
|
I have just rebased this PR with |
Ref: #20211 (comment)
Obsoletes: #20461
This PR overhauls the three separate loading modes (mlock, mmap, and direct-io) into one single
-lm/--load-modeoption to simplify the logic. While working on #20461, I realised that it became quite complex to maintain multiple loading modes when they are mutually exclusive of one another. This PR solves that by allowing only one loading mode to exist at a time.Flags
--mlock,--mmap,--direct-ioand their negative flags have been marked as deprecated, with help messages informing the user to use the new--load-mode.Verification
To verify that this refactor did not break any existing codepaths, I have added the following debug statements to verify that the corresponding system calls are registered correctly.
Click to expand patch file for codepath verification
--mmapor--load-mode mmap$ build/bin/llama-completion -hf ibm-granite/granite-3.3-2b-instruct-GGUF:Q4_K_M -n 15 --seed 42 --temp 0 -p "Sing me a birthday song" -no-cnv --mmapupstream/masterpr/20834--mlockor--load-mode mlock$ build/bin/llama-completion -hf ibm-granite/granite-3.3-2b-instruct-GGUF:Q4_K_M -n 15 --seed 42 --temp 0 -p "Sing me a birthday song" -no-cnv --mlockupstream/masterpr/20834--direct-ioor--load-mode dio$ build/bin/llama-completion -hf ibm-granite/granite-3.3-2b-instruct-GGUF:Q4_K_M -n 15 --seed 42 --temp 0 -p "Sing me a birthday song" -no-cnv --direct-ioupstream/masterpr/20834Responsible AI Disclosure: AI was used to write debugger code for
llama-mmap.cppto ensure feature parity betweenupstream/masterand this refactor. AI was also used to identify affected lines within the refactor, but changes were made by a human.