Skip to content

Added CUDA support#228

Merged
yingapple merged 21 commits intomindverse:developfrom
zpitroda:feature/cuda-support
Apr 25, 2025
Merged

Added CUDA support#228
yingapple merged 21 commits intomindverse:developfrom
zpitroda:feature/cuda-support

Conversation

@zpitroda
Copy link
Copy Markdown
Contributor

@zpitroda zpitroda commented Apr 14, 2025

Added:

  • CUDA support through build option
  • Memory management including layer offloading
  • Ollama model release after training
  • Switch to toggle CPU/GPU training

Current issues:

  • After training finishes and service starts, model takes a few minutes to load during which the chat will return empty responses
  • No way to toggle between CPU/GPU during inference
  • Only tested in docker with wsl backend on windows
  • I believe current commit still has bug causing training to always use 0.5B model

To do:

  • Optimize training by removing unnecessary functions and parameters
  • Update/improve documentation

Technical changes:

  1. Created new docker file Dockerfile.backend.cuda to handle building with cuda support including using a docker image prebuilt with nvidia drivers, buildtools, and python 3.12
  2. Added scripts prompt_cuda.sh and prompt_cuda.bat that ask the user if they want to build with CUDA support when running make docker-up.
  3. The prompt scripts now generate a docker-compose.override.yml file to explicitly tell Docker Compose which Dockerfile (Dockerfile.backend or Dockerfile.backend.cuda) to use based on the user's choice.
  4. The backend API routes_l2.py includes an endpoint /api/kernel2/cuda/available that uses torch.cuda.is_available() to check if CUDA is usable at runtime.
  5. The training interface TrainingConfiguration.tsx calls the backend API to check CUDA availability and enables/disables the "Enable CUDA GPU Acceleration" toggle accordingly.

@yingapple
Copy link
Copy Markdown
Contributor

i'm sorry, but there are some conflicts.

@yingapple
Copy link
Copy Markdown
Contributor

Thank you especially for your outstanding contribution. I’m really looking forward to merging this PR.

@zpitroda
Copy link
Copy Markdown
Contributor Author

I'll try to have those conflicts worked out asap 🫡

@zpitroda
Copy link
Copy Markdown
Contributor Author

Merge conflicts should be worked out!

@zpitroda zpitroda marked this pull request as draft April 15, 2025 23:56
@zpitroda zpitroda marked this pull request as ready for review April 16, 2025 04:51
@zpitroda zpitroda changed the title Add CUDA support Updated llama.cpp rebuild process Apr 16, 2025
- CUDA detection
- Memory handling
- Ollama model release after training
added cuda support flag so log accurately reflected cuda toggle
Changed llama.cpp to only check if cuda support is enabled and if so rebuild during the first build rather than each run
Enabled memory pinning and optimizer state offload
rewrote llama.cpp rebuild logic, added manual y/n toggle if user wants to enable cuda support
Added make docker-restart-backend-fast to restart the backend and reflect code changes without causing a full llama.cpp rebuild

Fixed make docker-check-cuda command to correctly reflect cuda support
@zpitroda zpitroda force-pushed the feature/cuda-support branch from 05c10aa to ac10939 Compare April 16, 2025 05:33
@yingapple
Copy link
Copy Markdown
Contributor

Let’s ignore these new conflicts for now. I tested your branch, and it seems there’s a bit of an issue.
Below is the log after i chose the gpu mode.

[+] Building 2/2
✔ backend Built 0.0s
✔ frontend Built 0.0s
docker compose up -d
[+] Running 3/4
✔ Network second-me_second-me-network Created 0.0s
✔ Volume "second-me_llama-cpp-build" Created 0.0s
⠏ Container second-me-backend Starting 6.9s
✔ Container second-me-frontend Created 0.0s
Error response from daemon: could not select device driver "nvidia" with capabilities: [[gpu]]

@yingapple
Copy link
Copy Markdown
Contributor

I probably haven’t installed the GPU driver for Docker. Maybe this PR should also mention that in the README.

@zpitroda
Copy link
Copy Markdown
Contributor Author

zpitroda commented Apr 16, 2025

Yeah sorry I should've added that, it needs the Nvidia container toolkit installed, can I ask if you're running this on windows or Linux, and if Windows are you using the WSL 2 backend?

@zpitroda
Copy link
Copy Markdown
Contributor Author

@yingapple I tried replicating your issue by completely resetting my docker and wsl environments but it seems to have all started fine. Please try installing the nvidia container toolkit with:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update

sudo apt-get install -y nvidia-container-toolkit

sudo nvidia-ctk runtime configure --runtime=docker

sudo systemctl restart docker

then to test your gpu is detected

sudo docker run --rm --gpus all nvidia/cuda:12.8.1-base-ubuntu24.04 nvidia-smi

Please let me know if this works for you and your system details. Thank you!

@yingapple
Copy link
Copy Markdown
Contributor

@yingapple I tried replicating your issue by completely resetting my docker and wsl environments but it seems to have all started fine. Please try installing the nvidia container toolkit with:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update

sudo apt-get install -y nvidia-container-toolkit

sudo nvidia-ctk runtime configure --runtime=docker

sudo systemctl restart docker

then to test your gpu is detected

sudo docker run --rm --gpus all nvidia/cuda:12.8.1-base-ubuntu24.04 nvidia-smi

Please let me know if this works for you and your system details. Thank you!

I have run success on my a100 server! We are testing it on none gpu machine

@yingapple
Copy link
Copy Markdown
Contributor

One Error is we should add "\n" before you modify the .env, otherwise the docker start fail

@yingapple
Copy link
Copy Markdown
Contributor

@zpitroda Additionally:
I suggest having a separate docker-compose.gpu.yml.
Otherwise, machines without a GPU will throw errors.

Added docker-compose.gpu.yml to fix error on machines without nvidia gpu and made sure "\n" is added before .env modification
@zpitroda zpitroda closed this Apr 17, 2025
@zpitroda zpitroda force-pushed the feature/cuda-support branch from 0489efe to e1ae6f5 Compare April 17, 2025 18:23
@zpitroda zpitroda reopened this Apr 17, 2025
Last push accidentally broke cuda toggle
@zpitroda
Copy link
Copy Markdown
Contributor Author

@yingapple thanks for the suggestions! I've incorporated both of them and synced the branch with the main one!

@zpitroda zpitroda changed the title Updated llama.cpp rebuild process Added CUDA support Apr 20, 2025
@zpitroda
Copy link
Copy Markdown
Contributor Author

Noticed some bugs not sure if they're related to the recent push but working on sorting them out

nvm was fixed with last merge

Removed unnecessary makefile command and fixed gpu logging
@yingapple
Copy link
Copy Markdown
Contributor

Noticed some bugs not sure if they're related to the recent push but working on sorting them out

nvm was fixed with last merge

I'm testing.

@zpitroda
Copy link
Copy Markdown
Contributor Author

zpitroda commented Apr 23, 2025

Noticed some bugs not sure if they're related to the recent push but working on sorting them out

nvm was fixed with last merge

I'm testing.

Yeah you beat me to it I was about to push a fix as well lol

@3050226203
Copy link
Copy Markdown

@yingapple @3050226203 I believe I've implemented these fixes, please let me know how it looks and if there's anything else that should be changed!

The problem has been solved. I've found the model location. Thank you.

@zpitroda zpitroda changed the base branch from master to develop April 23, 2025 16:37
- Removed dtype setting to let torch automatically handle it
- Removed vram logging
- Removed Unnecessary/old comments
Made "make docker-use-gpu/cpu" command work with .gpu_selected flag and changed "make docker-restart-backend-fast" command to respect flag instead of always using gpu
@zpitroda zpitroda requested a review from yingapple April 23, 2025 19:25
@zpitroda
Copy link
Copy Markdown
Contributor Author

Working on figuring out the cause but training is currently failing with a free(): double free detected in tcache 2 error

@3050226203
Copy link
Copy Markdown

When calling local Ollama, the following problems will occur:
TypeError: Exception() takes no keyword arguments
2025-04-24 03:19:24 [ERROR] document_service.py:493 - Error processing document embedding: Exception() takes no keyword arguments
2025-04-24 03:19:24 [ERROR] trainprocess_service.py:128 - Generate document embeddings failed: Exception() takes no keyword arguments
2025-04-24 03:19:24 [ERROR] trainprocess_service.py:1089 - Step generate_document_embeddings failed
2025-04-24 03:19:24 [INFO] trainprocess_service.py:1090 - Marking step as failed: stage=generate_document_embeddings, step=generate_document_embeddings

Added custom exception class for Ollama embeddings, which seemed to be returning keyword arguments while the Python exception class only accepts positional ones
@zpitroda
Copy link
Copy Markdown
Contributor Author

zpitroda commented Apr 24, 2025

When calling local Ollama, the following problems will occur: TypeError: Exception() takes no keyword arguments 2025-04-24 03:19:24 [ERROR] document_service.py:493 - Error processing document embedding: Exception() takes no keyword arguments 2025-04-24 03:19:24 [ERROR] trainprocess_service.py:128 - Generate document embeddings failed: Exception() takes no keyword arguments 2025-04-24 03:19:24 [ERROR] trainprocess_service.py:1089 - Step generate_document_embeddings failed 2025-04-24 03:19:24 [INFO] trainprocess_service.py:1090 - Marking step as failed: stage=generate_document_embeddings, step=generate_document_embeddings

@3050226203 Please let me know if it works after my last push!

Fixed training defaulting to 0.5B model regardless of selection and fixed "free(): double free detected in tcache 2" error caused by cuda flag being passed incorrectly
@yingapple
Copy link
Copy Markdown
Contributor

I finished my test. Let's merge. And do regression testing. Your contribution will be in our first release version!

@yingapple yingapple merged commit 0530909 into mindverse:develop Apr 25, 2025
1 check passed
@zpitroda
Copy link
Copy Markdown
Contributor Author

I finished my test. Let's merge. And do regression testing. Your contribution will be in our first release version!

Love to hear it! Last thing I'm just working out rn is making sure large models like the 7B work regardless of available memory I'll hopefully be able to push that tomorrow

kevin-mindverse pushed a commit that referenced this pull request Apr 25, 2025
* Add CUDA support

- CUDA detection
- Memory handling
- Ollama model release after training

* Fix logging issue

added cuda support flag so log accurately reflected cuda toggle

* Update llama.cpp rebuild

Changed llama.cpp to only check if cuda support is enabled and if so rebuild during the first build rather than each run

* Improved vram management

Enabled memory pinning and optimizer state offload

* Fix CUDA check

rewrote llama.cpp rebuild logic, added manual y/n toggle if user wants to enable cuda support

* Added fast restart and fixed CUDA check command

Added make docker-restart-backend-fast to restart the backend and reflect code changes without causing a full llama.cpp rebuild

Fixed make docker-check-cuda command to correctly reflect cuda support

* Added docker-compose.gpu.yml

Added docker-compose.gpu.yml to fix error on machines without nvidia gpu and made sure "\n" is added before .env modification

* Fixed cuda toggle

Last push accidentally broke cuda toggle

* Code review fixes

Fixed errors resulting from removed code:
- Added return save_path to end of save_hf_model function
- Rolled back download_file_with_progress function

* Update Makefile

Use cuda by default when using docker-restart-backend-fast

* Minor cleanup

Removed unnecessary makefile command and fixed gpu logging

* Delete .gpu_selected

* Simplified cuda training code

- Removed dtype setting to let torch automatically handle it
- Removed vram logging
- Removed Unnecessary/old comments

* Fixed gpu/cpu selection

Made "make docker-use-gpu/cpu" command work with .gpu_selected flag and changed "make docker-restart-backend-fast" command to respect flag instead of always using gpu

* Fix Ollama embedding error

Added custom exception class for Ollama embeddings, which seemed to be returning keyword arguments while the Python exception class only accepts positional ones

* Fixed model selection & memory error

Fixed training defaulting to 0.5B model regardless of selection and fixed "free(): double free detected in tcache 2" error caused by cuda flag being passed incorrectly
yexiangle pushed a commit that referenced this pull request Apr 25, 2025
* feature: use uv to setup python environment

* TrainProcessService add singleten method: get_instance

* feat: fix code

* Added CUDA support (#228)

* Add CUDA support

- CUDA detection
- Memory handling
- Ollama model release after training

* Fix logging issue

added cuda support flag so log accurately reflected cuda toggle

* Update llama.cpp rebuild

Changed llama.cpp to only check if cuda support is enabled and if so rebuild during the first build rather than each run

* Improved vram management

Enabled memory pinning and optimizer state offload

* Fix CUDA check

rewrote llama.cpp rebuild logic, added manual y/n toggle if user wants to enable cuda support

* Added fast restart and fixed CUDA check command

Added make docker-restart-backend-fast to restart the backend and reflect code changes without causing a full llama.cpp rebuild

Fixed make docker-check-cuda command to correctly reflect cuda support

* Added docker-compose.gpu.yml

Added docker-compose.gpu.yml to fix error on machines without nvidia gpu and made sure "\n" is added before .env modification

* Fixed cuda toggle

Last push accidentally broke cuda toggle

* Code review fixes

Fixed errors resulting from removed code:
- Added return save_path to end of save_hf_model function
- Rolled back download_file_with_progress function

* Update Makefile

Use cuda by default when using docker-restart-backend-fast

* Minor cleanup

Removed unnecessary makefile command and fixed gpu logging

* Delete .gpu_selected

* Simplified cuda training code

- Removed dtype setting to let torch automatically handle it
- Removed vram logging
- Removed Unnecessary/old comments

* Fixed gpu/cpu selection

Made "make docker-use-gpu/cpu" command work with .gpu_selected flag and changed "make docker-restart-backend-fast" command to respect flag instead of always using gpu

* Fix Ollama embedding error

Added custom exception class for Ollama embeddings, which seemed to be returning keyword arguments while the Python exception class only accepts positional ones

* Fixed model selection & memory error

Fixed training defaulting to 0.5B model regardless of selection and fixed "free(): double free detected in tcache 2" error caused by cuda flag being passed incorrectly

* fix: train service singlten

---------

Co-authored-by: Zachary Pitroda <30330004+zpitroda@users.noreply.github.com>
Cybercricetus pushed a commit to Cybercricetus/Second-Me that referenced this pull request May 29, 2025
* Add CUDA support

- CUDA detection
- Memory handling
- Ollama model release after training

* Fix logging issue

added cuda support flag so log accurately reflected cuda toggle

* Update llama.cpp rebuild

Changed llama.cpp to only check if cuda support is enabled and if so rebuild during the first build rather than each run

* Improved vram management

Enabled memory pinning and optimizer state offload

* Fix CUDA check

rewrote llama.cpp rebuild logic, added manual y/n toggle if user wants to enable cuda support

* Added fast restart and fixed CUDA check command

Added make docker-restart-backend-fast to restart the backend and reflect code changes without causing a full llama.cpp rebuild

Fixed make docker-check-cuda command to correctly reflect cuda support

* Added docker-compose.gpu.yml

Added docker-compose.gpu.yml to fix error on machines without nvidia gpu and made sure "\n" is added before .env modification

* Fixed cuda toggle

Last push accidentally broke cuda toggle

* Code review fixes

Fixed errors resulting from removed code:
- Added return save_path to end of save_hf_model function
- Rolled back download_file_with_progress function

* Update Makefile

Use cuda by default when using docker-restart-backend-fast

* Minor cleanup

Removed unnecessary makefile command and fixed gpu logging

* Delete .gpu_selected

* Simplified cuda training code

- Removed dtype setting to let torch automatically handle it
- Removed vram logging
- Removed Unnecessary/old comments

* Fixed gpu/cpu selection

Made "make docker-use-gpu/cpu" command work with .gpu_selected flag and changed "make docker-restart-backend-fast" command to respect flag instead of always using gpu

* Fix Ollama embedding error

Added custom exception class for Ollama embeddings, which seemed to be returning keyword arguments while the Python exception class only accepts positional ones

* Fixed model selection & memory error

Fixed training defaulting to 0.5B model regardless of selection and fixed "free(): double free detected in tcache 2" error caused by cuda flag being passed incorrectly
Cybercricetus pushed a commit to Cybercricetus/Second-Me that referenced this pull request May 29, 2025
* feature: use uv to setup python environment

* TrainProcessService add singleten method: get_instance

* feat: fix code

* Added CUDA support (mindverse#228)

* Add CUDA support

- CUDA detection
- Memory handling
- Ollama model release after training

* Fix logging issue

added cuda support flag so log accurately reflected cuda toggle

* Update llama.cpp rebuild

Changed llama.cpp to only check if cuda support is enabled and if so rebuild during the first build rather than each run

* Improved vram management

Enabled memory pinning and optimizer state offload

* Fix CUDA check

rewrote llama.cpp rebuild logic, added manual y/n toggle if user wants to enable cuda support

* Added fast restart and fixed CUDA check command

Added make docker-restart-backend-fast to restart the backend and reflect code changes without causing a full llama.cpp rebuild

Fixed make docker-check-cuda command to correctly reflect cuda support

* Added docker-compose.gpu.yml

Added docker-compose.gpu.yml to fix error on machines without nvidia gpu and made sure "\n" is added before .env modification

* Fixed cuda toggle

Last push accidentally broke cuda toggle

* Code review fixes

Fixed errors resulting from removed code:
- Added return save_path to end of save_hf_model function
- Rolled back download_file_with_progress function

* Update Makefile

Use cuda by default when using docker-restart-backend-fast

* Minor cleanup

Removed unnecessary makefile command and fixed gpu logging

* Delete .gpu_selected

* Simplified cuda training code

- Removed dtype setting to let torch automatically handle it
- Removed vram logging
- Removed Unnecessary/old comments

* Fixed gpu/cpu selection

Made "make docker-use-gpu/cpu" command work with .gpu_selected flag and changed "make docker-restart-backend-fast" command to respect flag instead of always using gpu

* Fix Ollama embedding error

Added custom exception class for Ollama embeddings, which seemed to be returning keyword arguments while the Python exception class only accepts positional ones

* Fixed model selection & memory error

Fixed training defaulting to 0.5B model regardless of selection and fixed "free(): double free detected in tcache 2" error caused by cuda flag being passed incorrectly

* fix: train service singlten

---------

Co-authored-by: Zachary Pitroda <30330004+zpitroda@users.noreply.github.com>
EOMZON pushed a commit to EOMZON/Second-Me that referenced this pull request Feb 1, 2026
* Add CUDA support

- CUDA detection
- Memory handling
- Ollama model release after training

* Fix logging issue

added cuda support flag so log accurately reflected cuda toggle

* Update llama.cpp rebuild

Changed llama.cpp to only check if cuda support is enabled and if so rebuild during the first build rather than each run

* Improved vram management

Enabled memory pinning and optimizer state offload

* Fix CUDA check

rewrote llama.cpp rebuild logic, added manual y/n toggle if user wants to enable cuda support

* Added fast restart and fixed CUDA check command

Added make docker-restart-backend-fast to restart the backend and reflect code changes without causing a full llama.cpp rebuild

Fixed make docker-check-cuda command to correctly reflect cuda support

* Added docker-compose.gpu.yml

Added docker-compose.gpu.yml to fix error on machines without nvidia gpu and made sure "\n" is added before .env modification

* Fixed cuda toggle

Last push accidentally broke cuda toggle

* Code review fixes

Fixed errors resulting from removed code:
- Added return save_path to end of save_hf_model function
- Rolled back download_file_with_progress function

* Update Makefile

Use cuda by default when using docker-restart-backend-fast

* Minor cleanup

Removed unnecessary makefile command and fixed gpu logging

* Delete .gpu_selected

* Simplified cuda training code

- Removed dtype setting to let torch automatically handle it
- Removed vram logging
- Removed Unnecessary/old comments

* Fixed gpu/cpu selection

Made "make docker-use-gpu/cpu" command work with .gpu_selected flag and changed "make docker-restart-backend-fast" command to respect flag instead of always using gpu

* Fix Ollama embedding error

Added custom exception class for Ollama embeddings, which seemed to be returning keyword arguments while the Python exception class only accepts positional ones

* Fixed model selection & memory error

Fixed training defaulting to 0.5B model regardless of selection and fixed "free(): double free detected in tcache 2" error caused by cuda flag being passed incorrectly
EOMZON pushed a commit to EOMZON/Second-Me that referenced this pull request Feb 1, 2026
* feature: use uv to setup python environment

* TrainProcessService add singleten method: get_instance

* feat: fix code

* Added CUDA support (mindverse#228)

* Add CUDA support

- CUDA detection
- Memory handling
- Ollama model release after training

* Fix logging issue

added cuda support flag so log accurately reflected cuda toggle

* Update llama.cpp rebuild

Changed llama.cpp to only check if cuda support is enabled and if so rebuild during the first build rather than each run

* Improved vram management

Enabled memory pinning and optimizer state offload

* Fix CUDA check

rewrote llama.cpp rebuild logic, added manual y/n toggle if user wants to enable cuda support

* Added fast restart and fixed CUDA check command

Added make docker-restart-backend-fast to restart the backend and reflect code changes without causing a full llama.cpp rebuild

Fixed make docker-check-cuda command to correctly reflect cuda support

* Added docker-compose.gpu.yml

Added docker-compose.gpu.yml to fix error on machines without nvidia gpu and made sure "\n" is added before .env modification

* Fixed cuda toggle

Last push accidentally broke cuda toggle

* Code review fixes

Fixed errors resulting from removed code:
- Added return save_path to end of save_hf_model function
- Rolled back download_file_with_progress function

* Update Makefile

Use cuda by default when using docker-restart-backend-fast

* Minor cleanup

Removed unnecessary makefile command and fixed gpu logging

* Delete .gpu_selected

* Simplified cuda training code

- Removed dtype setting to let torch automatically handle it
- Removed vram logging
- Removed Unnecessary/old comments

* Fixed gpu/cpu selection

Made "make docker-use-gpu/cpu" command work with .gpu_selected flag and changed "make docker-restart-backend-fast" command to respect flag instead of always using gpu

* Fix Ollama embedding error

Added custom exception class for Ollama embeddings, which seemed to be returning keyword arguments while the Python exception class only accepts positional ones

* Fixed model selection & memory error

Fixed training defaulting to 0.5B model regardless of selection and fixed "free(): double free detected in tcache 2" error caused by cuda flag being passed incorrectly

* fix: train service singlten

---------

Co-authored-by: Zachary Pitroda <30330004+zpitroda@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants