Skip to content

CUDA: use 1 thread if model is fully offloaded#2915

Merged
ggerganov merged 1 commit intoggml-org:masterfrom
JohannesGaessler:cuda-n-threads-fix
Sep 21, 2023
Merged

CUDA: use 1 thread if model is fully offloaded#2915
ggerganov merged 1 commit intoggml-org:masterfrom
JohannesGaessler:cuda-n-threads-fix

Conversation

@JohannesGaessler
Copy link
Copy Markdown
Contributor

Currently when using the maximum possible number of GPU layers with CUDA there is no benefit from > 1 thread. In fact, using more than 1 thread is detrimental due to increased overhead. This PR changes the logic for the default number of threads in such a way that (unless the user manually overrides it) only a single thread is used if all layers are offloaded.

I also changed the logic for llama-bench to be the same as main: -1 is interpreted as the number of logical cores.

@KerfuffleV2
Copy link
Copy Markdown
Contributor

Maybe update the help for main also?

-t N, --threads N     number of threads to use during computation (default: 8)

It might be good to mention the ability to set it to -1, -2 (in both places) and what that does.

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Sep 1, 2023

Using 1 vs 15 for me does make it .2-.5 tokens faster.

Copy link
Copy Markdown
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to implement this entirely in llama_eval_internal, similar to what has been done for BLAS:

https://github.com/ggerganov/llama.cpp/blob/8b56b4f2c396eae1f4417e5a859557fed989e0ee/llama.cpp#L2899-L2904

Comment thread common/common.cpp Outdated
Comment on lines 731 to 735
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an implementation detail that the user should not need to know and in the future we will fix this anyway

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

I haven't forgotten about this, I've only prioritized other things because I think that this is not that high priority.

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

@ggerganov thank you for the hint with llama_eval_internal, the implementation was way simpler this way.

@ggerganov ggerganov merged commit 8185710 into ggml-org:master Sep 21, 2023
pkrmf pushed a commit to morlockstudios-com/llama.cpp that referenced this pull request Sep 26, 2023
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants