CUDA: always create events for split buffers by JohannesGaessler · Pull Request #10185 · ggml-org/llama.cpp

JohannesGaessler · 2024-11-05T16:51:27Z

I think the correct way to fix it is to just create the events unconditionally. Regardless of how the data is split you always need the events on the currently active device for the other devices to wait on. You could maybe reduce the number of events by only initializing those that are actually needed but I don't think that would be worthwhile since for the vast majority of use cases all events are already being created and used anyways.

slaren · 2024-11-05T17:09:12Z

Qwen2.5-0.5B does not work with this change alone, it still crashes in the memcpy later:

CUDA error: invalid argument
  current device: 1, in function ggml_cuda_op_mul_mat at ggml/src/ggml-cuda.cu:1583
  cudaMemcpyPeerAsync( src1_ddq_i, id, src1_ddq_i_source, ctx.device, src1_ncols*src1_padded_col_size*q8_1_ts/q8_1_bs, stream)

slaren · 2024-11-05T17:12:29Z

It would also be possible to prevent using a split buffer entirely if the matrix is too small by returning false in the supports_op check.

slaren · 2024-11-09T18:14:31Z

+                // only use row split if the weight matrix is large enough for every GPU to get data (this solves some edge cases)
+                // also for small matrices the overhead is very large anyways so splitting is slow
+                if (a->buffer && ggml_backend_buft_is_cuda_split(a->buffer->buft)) {
+                    ggml_backend_cuda_split_buffer_type_context * buft_ctx = (ggml_backend_cuda_split_buffer_type_context *) a->buffer->buft->context;
+                    int64_t active_devices = 0;
+                    for (int id = 0; id < ggml_backend_cuda_get_device_count(); ++id) {
+                        int64_t row_low;
+                        int64_t row_high;
+                        get_row_split(&row_low, &row_high, a, buft_ctx->tensor_split, id);
+                        active_devices += row_low == row_high;
+                    }
+                    const int64_t rounding = get_row_rounding(buft_ctx->tensor_split);
+                    if (rounding*active_devices < a->ne[1]) {
+                        return false;
+                    }
+                }


This seems too expensive to do in this function, since this is called many times during inference by ggml_backend_sched. I think it should be possible to compute the minimum tensor size in ggml_backend_cuda_split_buffer_type, and store it in ggml_backend_cuda_split_buffer_type_context, then this function would only need to compare the tensor size to this value.

JohannesGaessler added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Nov 5, 2024

JohannesGaessler force-pushed the cuda-fix-event-initialization branch from bde4116 to 38d11f5 Compare November 5, 2024 17:03

JohannesGaessler force-pushed the cuda-fix-event-initialization branch from 38d11f5 to e151321 Compare November 8, 2024 20:42

slaren reviewed Nov 9, 2024

View reviewed changes

CUDA: no -sm row for very small matrices

84bcad6

JohannesGaessler force-pushed the cuda-fix-event-initialization branch from e151321 to 84bcad6 Compare November 9, 2024 19:56

slaren approved these changes Nov 14, 2024

View reviewed changes

JohannesGaessler merged commit 4a8ccb3 into ggml-org:master Nov 14, 2024

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024

CUDA: no -sm row for very small matrices (ggml-org#10185)

2bc8dbb

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 17, 2024

CUDA: no -sm row for very small matrices (ggml-org#10185)

afe9899

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024

CUDA: no -sm row for very small matrices (ggml-org#10185)

e8cfa65

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026

CUDA: no -sm row for very small matrices (ggml-org#10185)

64bbda1

ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026

CUDA: no -sm row for very small matrices (ggml-org#10185)

5e4ed4a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: always create events for split buffers#10185

CUDA: always create events for split buffers#10185
JohannesGaessler merged 1 commit intoggml-org:masterfrom
JohannesGaessler:cuda-fix-event-initialization

JohannesGaessler commented Nov 5, 2024

Uh oh!

slaren commented Nov 5, 2024

Uh oh!

slaren commented Nov 5, 2024

Uh oh!

slaren Nov 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JohannesGaessler commented Nov 5, 2024

Uh oh!

slaren commented Nov 5, 2024

Uh oh!

slaren commented Nov 5, 2024

Uh oh!

slaren Nov 9, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants