Feature/fix training model switch bug2 by kevin-mindverse · Pull Request #281 · mindverse/Second-Me

kevin-mindverse · 2025-04-24T13:09:00Z

Fix Exception Handling in Training Process Service

Issue Description

In the start_process method of TrainProcessService, there is a bug in the exception handling block. The method uses the variable step which is defined within a loop scope, but if an exception occurs outside this loop (before it starts or after it ends), the variable would be undefined, causing a potential NameError.

Impact

When an exception occurs outside the loop scope, the system will raise a secondary exception (NameError: name 'step' is not defined) which masks the original exception.
This prevents proper error logging and status tracking in the training process.
The system cannot accurately mark the failing step status because the reference to the step is invalid.

Root Cause

The step variable is only defined within the for-loop scope:

for step in ordered_steps[start_index:]:
    # loop body

* Add CUDA support - CUDA detection - Memory handling - Ollama model release after training * Fix logging issue added cuda support flag so log accurately reflected cuda toggle * Update llama.cpp rebuild Changed llama.cpp to only check if cuda support is enabled and if so rebuild during the first build rather than each run * Improved vram management Enabled memory pinning and optimizer state offload * Fix CUDA check rewrote llama.cpp rebuild logic, added manual y/n toggle if user wants to enable cuda support * Added fast restart and fixed CUDA check command Added make docker-restart-backend-fast to restart the backend and reflect code changes without causing a full llama.cpp rebuild Fixed make docker-check-cuda command to correctly reflect cuda support * Added docker-compose.gpu.yml Added docker-compose.gpu.yml to fix error on machines without nvidia gpu and made sure "\n" is added before .env modification * Fixed cuda toggle Last push accidentally broke cuda toggle * Code review fixes Fixed errors resulting from removed code: - Added return save_path to end of save_hf_model function - Rolled back download_file_with_progress function * Update Makefile Use cuda by default when using docker-restart-backend-fast * Minor cleanup Removed unnecessary makefile command and fixed gpu logging * Delete .gpu_selected * Simplified cuda training code - Removed dtype setting to let torch automatically handle it - Removed vram logging - Removed Unnecessary/old comments * Fixed gpu/cpu selection Made "make docker-use-gpu/cpu" command work with .gpu_selected flag and changed "make docker-restart-backend-fast" command to respect flag instead of always using gpu * Fix Ollama embedding error Added custom exception class for Ollama embeddings, which seemed to be returning keyword arguments while the Python exception class only accepts positional ones * Fixed model selection & memory error Fixed training defaulting to 0.5B model regardless of selection and fixed "free(): double free detected in tcache 2" error caused by cuda flag being passed incorrectly

* feature: use uv to setup python environment * TrainProcessService add singleten method: get_instance * feat: fix code * Added CUDA support (mindverse#228) * Add CUDA support - CUDA detection - Memory handling - Ollama model release after training * Fix logging issue added cuda support flag so log accurately reflected cuda toggle * Update llama.cpp rebuild Changed llama.cpp to only check if cuda support is enabled and if so rebuild during the first build rather than each run * Improved vram management Enabled memory pinning and optimizer state offload * Fix CUDA check rewrote llama.cpp rebuild logic, added manual y/n toggle if user wants to enable cuda support * Added fast restart and fixed CUDA check command Added make docker-restart-backend-fast to restart the backend and reflect code changes without causing a full llama.cpp rebuild Fixed make docker-check-cuda command to correctly reflect cuda support * Added docker-compose.gpu.yml Added docker-compose.gpu.yml to fix error on machines without nvidia gpu and made sure "\n" is added before .env modification * Fixed cuda toggle Last push accidentally broke cuda toggle * Code review fixes Fixed errors resulting from removed code: - Added return save_path to end of save_hf_model function - Rolled back download_file_with_progress function * Update Makefile Use cuda by default when using docker-restart-backend-fast * Minor cleanup Removed unnecessary makefile command and fixed gpu logging * Delete .gpu_selected * Simplified cuda training code - Removed dtype setting to let torch automatically handle it - Removed vram logging - Removed Unnecessary/old comments * Fixed gpu/cpu selection Made "make docker-use-gpu/cpu" command work with .gpu_selected flag and changed "make docker-restart-backend-fast" command to respect flag instead of always using gpu * Fix Ollama embedding error Added custom exception class for Ollama embeddings, which seemed to be returning keyword arguments while the Python exception class only accepts positional ones * Fixed model selection & memory error Fixed training defaulting to 0.5B model regardless of selection and fixed "free(): double free detected in tcache 2" error caused by cuda flag being passed incorrectly * fix: train service singlten --------- Co-authored-by: Zachary Pitroda <30330004+zpitroda@users.noreply.github.com>

kevin-mindverse added 3 commits April 24, 2025 16:35

feature: use uv to setup python environment

c1ea4cb

TrainProcessService add singleten method: get_instance

fce80b2

feat: fix code

1a49eca

kevin-mindverse changed the base branch from master to develop April 24, 2025 13:09

kevin-mindverse requested review from yexiangle and yingapple April 25, 2025 06:17

zpitroda and others added 3 commits April 25, 2025 14:19

Merge branch 'develop' into feature/fixTrainingModelSwitchBug2

aabb4e4

fix: train service singlten

3a17cb1

yexiangle approved these changes Apr 25, 2025

View reviewed changes

yexiangle merged commit 37553fb into develop Apr 25, 2025
1 check passed

This was referenced Jun 23, 2025

Feature implementation from commits 516843d..c88d236 yashuatla/Second-Me#4

Open

Feature implementation from commits 8ace28a..19adcac yashuatla/Second-Me#5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/fix training model switch bug2#281

Feature/fix training model switch bug2#281
yexiangle merged 6 commits intodevelopfrom
feature/fixTrainingModelSwitchBug2

kevin-mindverse commented Apr 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kevin-mindverse commented Apr 24, 2025

Fix Exception Handling in Training Process Service

Issue Description

Impact

Root Cause

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants