Fix ggml_tensor_extra_gpu memory leak by eugen-ajechiloae-clearml · Pull Request #2146 · ggml-org/llama.cpp

eugen-ajechiloae-clearml · 2023-07-08T12:43:18Z

Fixes #2145. Allocate the extra member of ggml_tensor statically

Allocate the `extra` member of `ggml_tensor` statically

JohannesGaessler

Would be fine with me but ggml changes will need to be approved by @ggerganov.

mqy · 2023-07-08T16:56:35Z

#2099 uses extra

ggerganov · 2023-07-09T08:38:14Z

Not a good change - the CUDA implementation should clean-up the allocated memory

eugen-ajechiloae-clearml · 2023-07-10T13:35:37Z

Hi @ggerganov ! Is there any reason why you allocate the memory dynamically? Cleaning up the memory seems very difficult . Looks like the same memory region is used by multiple instances of ggml_tensor, also tracking every pointer allocated is not easy to me at least.
I think that creating a new member in ggml_tensor with type ggml_tensor_extra_cpu would make memory management a lot easier, and I really don't think it adds significant overhead. We could also leave the extra pointer untouched, such that #2099 is unaffected.
What do you think?

ggerganov · 2023-07-10T17:08:54Z

The goal is to keep ggml_tensor general-purpose. Adding CUDA specific members is not desired since it incentivises doing the same for the other backends and this does not scale well.

tracking every pointer allocated is not easy to me at least.

Is it really so difficult. I guess we could pre-allocate a single large enough buffer in ggml_init_cublas() and then emplace each ggml_tensor_extra_gpu inside it. Keep a counter and increase it for each new struct. Reset the counter at the start of each eval. Finally free the buffer upon exit.

If it is indeed too difficult we can think of some alternative approaches, but the main requirement is to not couple ggml_tensor with GPU-specific things

bullno1 · 2023-07-11T11:19:56Z

AFAIK, there are two types of ggml_tensor_extra_gpu, those bound to model/context and those bound to the temporary context in llama_eval_internal. Only the latter get leaked. The former are freed in the destructor.

Since the lifecycle of them is tied to ctx0, they should also be freed at the same time as ctx0.
Their lifecycle should be tracked separately from the model/context bound ones.

All temp allocations are made in ggml_cuda_assign_buffers so it can be tracked right there.
I think it's better to make offload contextful:

typedef void (*offload_func_t)(void * offload_ctx, struct ggml_tensor * tensor);

int llama_eval_internal(...) {
    ctx0 = ggml_init(...);

    #ifdef GGML_USE_CUBLAS
    void * offload_ctx = ggml_cuda_init_offload_ctx(...);
    #endif

    // Now the offload_func will have to be called like this
    offload_func(offload_ctx, tensor);

    #ifdef GGML_USE_CUBLAS
    ggml_cuda_free_offload_ctx(offload_ctx);
    #endif

    ggml_free(ctx0);
}

It's a bit more verbose but we correctly track the life cycle of those temp allocations.
A free list or bump allocator can be used.

ggml_cuda_assign_buffers will have to take a context pointer and look different from the rest of the family like: ggml_cuda_assign_buffers_no_scratch or ggml_cuda_assign_buffers_force_inplace but it is used in a very different manner anw.

The offload_ctx can be dropped too but allocations would have to be linked to some global/TLS state.
The it would just be: ggml_cuda_offload_begin() and ggml_cuda_offload_end().

eugen-ajechiloae-clearml · 2023-07-12T13:27:25Z

Memory management seems a bit too complicated for me in this case, so I just wanted to let you know that I will not open another PR for this issue. Thank you guys for your input!

fix ggml_tensor_extra_gpu memory leak

4749543

Allocate the `extra` member of `ggml_tensor` statically

eugen-ajechiloae-clearml mentioned this pull request Jul 8, 2023

[User] ggml_tensor->extra(s) are not freed at the end of llama_eval, causing a memory leak #2145

Closed

JohannesGaessler reviewed Jul 8, 2023

View reviewed changes

ggerganov closed this Jul 9, 2023

bullno1 mentioned this pull request Jul 12, 2023

Track and free temporary ggml_tensor_extra_gpu struct #2195

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ggml_tensor_extra_gpu memory leak#2146

Fix ggml_tensor_extra_gpu memory leak#2146
eugen-ajechiloae-clearml wants to merge 1 commit intoggml-org:masterfrom
eugen-ajechiloae-clearml:master

eugen-ajechiloae-clearml commented Jul 8, 2023 •

edited

Loading

Uh oh!

JohannesGaessler left a comment

Uh oh!

mqy commented Jul 8, 2023

Uh oh!

ggerganov commented Jul 9, 2023

Uh oh!

eugen-ajechiloae-clearml commented Jul 10, 2023

Uh oh!

ggerganov commented Jul 10, 2023

Uh oh!

bullno1 commented Jul 11, 2023 •

edited

Loading

Uh oh!

eugen-ajechiloae-clearml commented Jul 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

eugen-ajechiloae-clearml commented Jul 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

mqy commented Jul 8, 2023

Uh oh!

ggerganov commented Jul 9, 2023

Uh oh!

eugen-ajechiloae-clearml commented Jul 10, 2023

Uh oh!

ggerganov commented Jul 10, 2023

Uh oh!

bullno1 commented Jul 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eugen-ajechiloae-clearml commented Jul 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

eugen-ajechiloae-clearml commented Jul 8, 2023 •

edited

Loading

bullno1 commented Jul 11, 2023 •

edited

Loading