CUDA: use 1 thread if model is fully offloaded by JohannesGaessler · Pull Request #2915 · ggml-org/llama.cpp

JohannesGaessler · 2023-08-30T20:05:39Z

Currently when using the maximum possible number of GPU layers with CUDA there is no benefit from > 1 thread. In fact, using more than 1 thread is detrimental due to increased overhead. This PR changes the logic for the default number of threads in such a way that (unless the user manually overrides it) only a single thread is used if all layers are offloaded.

I also changed the logic for llama-bench to be the same as main: -1 is interpreted as the number of logical cores.

KerfuffleV2 · 2023-09-01T01:35:04Z

Maybe update the help for main also?

-t N, --threads N     number of threads to use during computation (default: 8)

It might be good to mention the ability to set it to -1, -2 (in both places) and what that does.

Ph0rk0z · 2023-09-01T11:02:32Z

Using 1 vs 15 for me does make it .2-.5 tokens faster.

ggerganov

Better to implement this entirely in llama_eval_internal, similar to what has been done for BLAS:

https://github.com/ggerganov/llama.cpp/blob/8b56b4f2c396eae1f4417e5a859557fed989e0ee/llama.cpp#L2899-L2904

ggerganov · 2023-09-02T12:41:38Z

This is an implementation detail that the user should not need to know and in the future we will fix this anyway

JohannesGaessler · 2023-09-16T14:02:22Z

I haven't forgotten about this, I've only prioritized other things because I think that this is not that high priority.

JohannesGaessler · 2023-09-18T16:16:48Z

@ggerganov thank you for the hint with llama_eval_internal, the implementation was way simpler this way.

ggerganov requested changes Sep 2, 2023

View reviewed changes

Green-Sky assigned JohannesGaessler Sep 16, 2023

CUDA: use only 1 thread if fully offloaded

de8035a

JohannesGaessler force-pushed the cuda-n-threads-fix branch from ff65f9a to de8035a Compare September 18, 2023 16:15

slaren approved these changes Sep 18, 2023

View reviewed changes

ggerganov approved these changes Sep 21, 2023

View reviewed changes

ggerganov merged commit 8185710 into ggml-org:master Sep 21, 2023

JohannesGaessler mentioned this pull request Sep 21, 2023

ggml: create thread pool lazily #2674

Closed

pkrmf pushed a commit to morlockstudios-com/llama.cpp that referenced this pull request Sep 26, 2023

CUDA: use only 1 thread if fully offloaded (ggml-org#2915)

e0ab20a

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026

CUDA: use only 1 thread if fully offloaded (ggml-org#2915)

99eb363

phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026

CUDA: use only 1 thread if fully offloaded (ggml-org#2915)

2c58f6b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: use 1 thread if model is fully offloaded#2915

CUDA: use 1 thread if model is fully offloaded#2915
ggerganov merged 1 commit intoggml-org:masterfrom
JohannesGaessler:cuda-n-threads-fix

JohannesGaessler commented Aug 30, 2023

Uh oh!

KerfuffleV2 commented Sep 1, 2023

Uh oh!

Ph0rk0z commented Sep 1, 2023

Uh oh!

ggerganov left a comment

Uh oh!

ggerganov Sep 2, 2023

Uh oh!

JohannesGaessler commented Sep 16, 2023

Uh oh!

JohannesGaessler commented Sep 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

JohannesGaessler commented Aug 30, 2023

Uh oh!

KerfuffleV2 commented Sep 1, 2023

Uh oh!

Ph0rk0z commented Sep 1, 2023

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

ggerganov Sep 2, 2023

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Sep 16, 2023

Uh oh!

JohannesGaessler commented Sep 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants