Conversation
Upgraded Linux runners to Ubuntu 24.04
Upgraded Linux AVX512 to Ubuntu 24
…port' into update_apr_2025
update gitignore and add missing xml files
|
Fixed the submodule, thanks for the reminder.
Which platforms did you test on? At the moment we've got an issue with the Linux CPU binaries failing for some people on Linux that we're having trouble narrowing down. So if you had success there that'd be an interesting datapoint. |
I just tested There aren't any config changes, just one extra field in the config struct which is always null. If you collapse |
Thank you! I have found the problem. My modifications are not in your PR yet. They will probably get activated when you merge. This is the bug I have identified before. The SplitMode BUG in llama.cpp. If you offload all layers to the GPU and use Native.GPUSplitMode.None the code will crash. I have showed you the C++ code before where this happens. |
If you off-load all layers (GpuLayerCount = -1) it will crash. See my remark above. |
I just tried it with I just double checked the context size, since it's not specified in the config. If you trace that through KM it defaults to As a side note, there is some pretty suspect stuff happening here though!
None of this has changed, so it's not the cause of your problem. But it does look like it could use a re-work! Would you be interested in working on that in the near future? |
Yes, I think you're right. This will be the reason for the sudden peak in GPU memory use. Many contexts are created and maintained everywhere. We either need to destroy them after use or ensure they are reusable. Since it is stateless, it can be leveraged in various situations. I'm particularly curious about the embedder's context, as it might require additional parameters. I'll try to make some time over the weekend to explore a more memory-efficient solution. |
|
There is one way to do it: move |
|
They both seems like things shouldn't need a context, can they be moved to the |
I think that we better associate a context with the executor instead of the weights. I will try to propose something and then we can do the discussion there. |
|
It was a good idea Martin! I have quickly put together a modification with streamlining contexts everywhere and it decreases GPU memory use with 30%! In my solution, both LLamaEmbedder and StatelessExecutor get their special CountTokens and GetTokens functions. These are being called and we can create the context on the fly and like this there is only 1 context in memory at all times. |
- Commented out flakey test
|
Triggered a new build run, hopefully resolving the Linux CPU issues: https://github.com/martindevans/LLamaSharp/actions/runs/14956801082 |
Updated llama.cpp binaries to ceda28ef8e310a8dee60bf275077a3eedae8e36c, compiled with this run.
This PR includes work done by @nipeone in #1138 (adding Linux-ARM64 support) and by @AmSmart in #1130 (adding Android support).
Testing: