Skip to content

Conversation

@iGavroche
Copy link

Add ROCm/AMD GPU Support and Enhancements

This PR adds comprehensive ROCm/AMD GPU support to the AI Toolkit, along with significant improvements to WAN model handling, UI enhancements, and developer experience improvements.

Major Features

ROCm/AMD GPU Support

  • Full ROCm GPU detection and monitoring: Added support for detecting and monitoring AMD GPUs via rocm-smi, alongside existing NVIDIA support
  • GPU stats API: Extended GPU API to return both NVIDIA and ROCm GPUs with comprehensive stats (temperature, utilization, memory, power, clocks)
  • Cross-platform support: Works on both Linux and Windows
  • GPU selection: Fixed job GPU selection to use gpu_ids from request body instead of hardcoded values

Setup and Startup Scripts

  • Automated setup scripts: Created setup.sh (Linux) and setup.ps1 (Windows) for automated installation
  • Startup scripts: Added start_toolkit.sh (Linux) and start_toolkit.ps1 (Windows) with multiple modes:
    • setup: Install dependencies
    • train: Run training jobs
    • gradio: Launch Gradio interface
    • ui: Launch web UI
  • Auto-detection: Automatically detects virtual environment (uv .venv or standard venv) and GPU backend (ROCm or CUDA)
  • Training options: Support for --recover, --name, --log flags
  • UI options: Support for --port and --dev (development mode) flags

WAN Model Improvements

Image-to-Video (i2v) Enhancements

  • First frame caching: Implemented caching system for first frames in i2v datasets to reduce computation
  • VAE encoding optimization: Optimized VAE encoding to only encode first frame and replicate, preventing HIP errors on ROCm
  • Device mismatch fixes: Fixed VAE device placement when encoding first frames for i2v
  • Tensor shape fixes: Resolved tensor shape mismatches in WAN 2.2 i2v pipeline by properly splitting 36-channel latents
  • Control image handling: Fixed WAN 2.2 i2v sampling to work without control images by generating dummy first frames

Flash Attention Support

  • Flash Attention 2/3: Added WanAttnProcessor2_0Flash for optimized attention computation
  • ROCm compatibility: Fixed ROCm compatibility by checking for 'hip' device type
  • Fallback support: Graceful fallback to PyTorch SDP when Flash Attention not available
  • Configuration: Added use_flash_attention option to model config and sdp: true for training config

Device Management

  • ROCm device placement: Fixed GPU placement for WAN 2.2 14B transformers on ROCm to prevent automatic CPU placement
  • Quantization improvements: Keep quantized blocks on GPU for ROCm (only move to CPU in low_vram mode)
  • Device consistency: Improved device consistency throughout quantization process

UI Enhancements

GPU Monitoring

  • ROCm GPU display: Updated GPUMonitor component to display ROCm GPUs alongside NVIDIA
  • GPU name parsing: Improved GPU name parsing for ROCm devices, prioritizing Card SKU over hex IDs
  • Stats validation: Added validation and clamping for GPU stats to prevent invalid values
  • Edge case handling: Improved handling of edge cases in GPU utilization and memory percentage calculations

Job Management

  • Environment variable handling: Fixed ROCm environment variable handling for UI mode and quantized models
  • Job freezing fix: Prevented job freezing when launched from UI by properly managing ROCm env vars
  • Quantized model support: Disabled ROCBLAS_USE_HIPBLASLT by default to prevent crashes with quantized models

Environment Variables and Configuration

ROCm Environment Variables

  • HIP error handling: Added comprehensive ROCm environment variables for better error reporting:
    • AMD_SERIALIZE_KERNEL=3 for better error reporting
    • TORCH_USE_HIP_DSA=1 for device-side assertions
    • HSA_ENABLE_SDMA=0 for APU compatibility
    • PYTORCH_ROCM_ALLOC_CONF for better VRAM fragmentation
    • ROCBLAS_LOG_LEVEL=0 to reduce logging overhead
  • Automatic application: ROCm variables are set in run.py before torch imports and passed when launching jobs from UI
  • UI mode handling: UI mode no longer sets ROCm env vars (let run.py handle them when jobs spawn)

Documentation

  • Installation instructions: Added comprehensive ROCm/AMD GPU installation instructions using uv
  • Quick Start guide: Added Quick Start section using setup scripts
  • Usage instructions: Detailed running instructions for both Linux and Windows
  • Examples: Included examples for all common use cases
  • Architecture notes: Documented different GPU architectures and how to check them

Technical Details

Key Files Modified

  • run.py: ROCm environment variable setup
  • ui/src/app/api/gpu/route.ts: ROCm GPU detection and stats
  • ui/src/components/GPUMonitor.tsx & GPUWidget.tsx: ROCm GPU display
  • toolkit/models/wan21/wan_attn_flash.py: Flash Attention implementation
  • extensions_built_in/diffusion_models/wan22/*: WAN model improvements
  • toolkit/dataloader_mixins.py: First frame caching
  • start_toolkit.sh & start_toolkit.ps1: Startup scripts
  • setup.sh & setup.ps1: Setup scripts

Testing Considerations

  • Tested on ROCm systems with AMD GPUs
  • Verified compatibility with existing CUDA/NVIDIA workflows
  • Tested UI job launching with ROCm environment
  • Validated quantized model training on ROCm
  • Tested WAN 2.2 i2v pipeline with and without control images

Bug Fix Commits

  • Fixed GPU name display for ROCm devices (hex ID issue)
  • Fixed job freezing when launched from UI
  • Fixed VAE device mismatch when encoding first frames for i2v
  • Fixed tensor shape mismatches in WAN 2.2 i2v pipeline
  • Fixed GPU placement for WAN 2.2 14B transformers on ROCm
  • Fixed WAN 2.2 i2v sampling without control image
  • Fixed GPU selection for jobs (was hardcoded to '0' initially)

Migration Notes

  • Users with AMD GPUs should follow the new installation instructions in README.md
  • The new startup scripts (start_toolkit.sh/start_toolkit.ps1) are recommended but not required
  • Existing CUDA/NVIDIA workflows remain unchanged
  • ROCm environment variables are automatically set when using the startup scripts or run.py

iGavroche added 19 commits December 2, 2025 10:14
- Add device_map=None to prevent automatic CPU placement
- Use torch.device context to hint GPU loading
- Immediately move transformers to GPU after loading
- Keep quantized blocks on GPU for ROCm (only move to CPU in low_vram mode)
- Fix device consistency in quantization process
- Split 36-channel latents into 16-channel latents and 20-channel conditioning
- Add defensive checks to ensure consistent channel counts throughout pipeline
- Move conditioning to device once before loop to avoid repeated transfers
- Fix mask application with proper shape matching
- Remove zero conditioning fallback that produced garbage output
- Add clear error messages for missing i2v conditioning
- Add first frame caching to LatentCachingMixin when do_i2v is enabled
- Store first frames separately from latents for i2v conditioning
- Add get_first_frame_path() and get_first_frame() methods
- Add first_frame_tensor to DataLoaderBatchDTO for cached first frames
- Support both disk and memory caching for first frames
- Ensure VAE is on correct device before encoding first frames
- Temporarily move VAE to GPU when needed, even if latents are cached
- Support cached first frames when batch.tensor is None
- Add clear error messages for missing first frames
- Fix device management for both 14B and 5B models
- Add ROCm/AMD GPU detection via rocm-smi alongside NVIDIA support
- Update GPU API to detect and return both NVIDIA and ROCm GPUs
- Fix jobs route to use gpu_ids from request body instead of hardcoding '0'
- Update GPUMonitor component to display ROCm GPUs
- Add Windows support for ROCm detection
- Parse rocm-smi CSV output for GPU stats (temperature, utilization, memory, power, clocks)
- Use Card SKU as primary name source (more descriptive than hex IDs)
- Fallback to descriptive 'AMD GPU N' format if SKU is hex or missing
- Prevents unstable/incorrect GPU names in the UI
- Add dedicated section for AMD GPU installation using ROCm
- Document uv virtual environment setup (.venv)
- Include specific ROCm PyTorch installation command for gfx1151
- Add note about different GPU architectures and how to check them
- Clarify CUDA vs ROCm installation paths
- Generate dummy solid gray first frame when ctrl_img is None
- Ensures model always receives expected 36 channels (16 latent + 20 conditioning)
- Allows baseline sampling to work without requiring control image in config
- Fixes device mismatch by creating tensor on CPU first, then moving to target device
- Encode only first frame instead of full video sequence with zeros
- Replicate encoded latent to match required number of frames
- Significantly reduces memory usage and computation
- Add explicit device management and synchronization points
- Ensure VAE is in eval mode before encoding
- Move VAE back to CPU after encoding if it was originally there
- Add better error handling and synchronization for ROCm/HIP compatibility
- Implement WanAttnProcessor2_0Flash for optimized attention computation
- Support both Flash Attention 2 and 3 APIs
- Fallback to PyTorch SDP when Flash Attention not available
- Fix ROCm compatibility by checking for 'hip' device type
- Enable flash attention on all transformer blocks (including dual-transformer WAN 2.2)
- Add use_flash_attention option to model config
- Add sdp: true option to training config for PyTorch flash SDP backend
- Set AMD_SERIALIZE_KERNEL=3 for better error reporting
- Set TORCH_USE_HIP_DSA=1 for device-side assertions
- Set HSA_ENABLE_SDMA=0 for APU compatibility
- Set PYTORCH_ROCM_ALLOC_CONF for better VRAM fragmentation
- Apply ROCm variables in run.py before torch imports
- Pass ROCm variables when launching jobs from UI
- Improve HIP error detection and debugging capabilities
- Prioritize Card Model over Card SKU when SKU is hex ID
- Provide more descriptive GPU names for AMD GPUs
- Fallback to generic names when model/SKU not available
- Create start_toolkit.sh for Linux with modes: setup, train, gradio, ui
- Create start_toolkit.ps1 for Windows with same functionality
- Create setup.sh for automated Linux installation
- Create setup.ps1 for automated Windows installation
- Auto-detect virtual environment (uv .venv or standard venv)
- Auto-detect GPU backend (ROCm or CUDA)
- Set ROCm environment variables automatically
- Verify dependencies before running
- Support training options: --recover, --name, --log
- Support UI options: --port, --dev (development mode)
- Add Quick Start section using setup scripts
- Keep manual installation instructions for advanced users
- Add detailed running instructions for both Linux and Windows
- Document startup script modes and options
- Include examples for all common use cases
- Improve organization and readability
- Update wan22_14b_model.py with minor improvements
- Update UI package-lock.json with dependency changes
- Disable ROCBLAS_USE_HIPBLASLT by default to prevent HIPBLAS_STATUS_INTERNAL_ERROR with quantized models
- Fix UI mode to not set ROCm env vars (let run.py handle them when jobs spawn)
- Only pass through ROCm env vars in startJob.ts if already set in parent process
- Add ROCBLAS_LOG_LEVEL=0 to reduce logging overhead
- Unset problematic ROCm vars in UI mode to prevent freezing during model loading

This fixes job freezing when launched from UI and prevents crashes during quantized model training.
…clamping

- Implement validation and clamping for GPU stats in `getNvidiaGpuStats` and `getRocmGpuStats` functions to prevent invalid values.
- Update GPU utilization and memory percentage calculations in `GPUWidget` to handle edge cases and ensure proper display.
- Ensure non-negative values for power draw, clock speeds, and fan speeds, improving overall robustness of GPU data handling.
…r experience

- Enhance parsing logic in `getRocmGpuStats` to handle edge cases, ensuring valid temperature, power draw, and clock speed values.
- Update GPUWidget to display 'N/A' for invalid or zero values, improving clarity in the UI.
- Implement additional safety checks to prevent numeric values from being displayed as GPU names, ensuring more descriptive outputs.
@iGavroche iGavroche marked this pull request as draft December 3, 2025 19:13
- Introduce optional `hasAmdSmi` field in `GPUApiResponse` to indicate AMD GPU support.
- Implement `checkAmdSmi` function to verify the presence of AMD SMI tool for GPU detection.
- Update GPU stats retrieval logic to include AMD GPU metrics, ensuring comprehensive support alongside NVIDIA and ROCm GPUs.
- Enhance error handling and logging for improved debugging and user experience.
@wakattac
Copy link

wakattac commented Dec 5, 2025

FYI the https://rocm.nightlies.amd.com/v2/ website does not have a directory for each GPU arch by their exact name. You can try accessing it to view the directory.
gfx1100 should point either to https://rocm.nightlies.amd.com/v2/gfx110X-all/ or https://rocm.nightlies.amd.com/v2/gfx110X-dgpu/ (only for certain gfx1103 mobile gpus, I believe)

changing line 103 in setup.sh to
elif echo "$GPU_INFO" | grep -q "gfx1100"; then
ROCM_ARCH="gfx110X-all"
fixed the download for 7900xtx

Did not get around to testing everything else yet.

[Later Edit]: After manually installing npm and self-compiling bitsandbytes, the training functionality works fine.
There are some issues with the UI picking up the rocm-smi details, as for some reason it detects my output as being the old version (<18 fields) while it is in fact the new version (>25 fields) so it starts doing some weird things like showing the power draw as the GPU name, but that's a very minor issue and I have metrics open on my own anyway.
Thank you very much @iGavroche . I wish there would be more people like you, porting tools for AMD GPU use.

@iGavroche
Copy link
Author

Thank you for the excellent feedback!

I'm not sure why but the PR isn't picking up my later changes to the branch. I've addressed the GPU rocm-smi issues but will want to look at a better way to address your hardware specific issues.

Thank you again for the kind words and the time you've spent testing the effort 🙏

@iGavroche iGavroche marked this pull request as ready for review December 6, 2025 13:43
- Update README.md to clarify GPU architecture detection and mapping for ROCm installations.
- Implement GPU architecture mapping functions in setup.ps1, setup.sh, and start_toolkit.sh to streamline the installation process.
- Improve error handling and user prompts for GPU architecture detection, ensuring users can easily specify or auto-detect their GPU architecture.
- Add detailed notes on common GPU architectures and their corresponding ROCm directory names for better user guidance.
@iGavroche
Copy link
Author

@wakattac thank you again so much for raising these issues. I've updated the setup script to encompass your feedback: 71e9743

@asoldano
Copy link

Hi, I'm trying this with an AMD RYZEN AI MAX+ 395 w/ Radeon 8060S (gfx1151) and getting this error:

`(venv) alessio@fedora ~/dati/ai-toolkit (pr-563) $ ./start_toolkit.sh ui
[INFO] AI Toolkit Startup Script
[INFO] ==========================

[INFO] Virtual environment already active: /home/alessio/dati/ai-toolkit/venv
[INFO] Verifying dependencies...
[SUCCESS] Core dependencies verified

[INFO] Starting in mode: ui

[INFO] Launching web UI on port 8675...
[INFO] Installing UI dependencies...
npm warn deprecated inflight@1.0.6: This module is not supported, and leaks memory. Do not use it. Check out lru-cache if you want a good and tested way to coalesce async requests by a key value, which is much more comprehensive and powerful.
npm warn deprecated @npmcli/move-file@1.1.2: This functionality has been moved to @npmcli/fs
npm warn deprecated npmlog@6.0.2: This package is no longer supported.
npm warn deprecated rimraf@2.7.1: Rimraf versions prior to v4 are no longer supported
npm warn deprecated rimraf@3.0.2: Rimraf versions prior to v4 are no longer supported
npm warn deprecated are-we-there-yet@3.0.1: This package is no longer supported.
npm warn deprecated glob@7.2.3: Glob versions prior to v9 are no longer supported
npm warn deprecated glob@7.2.3: Glob versions prior to v9 are no longer supported
npm warn deprecated glob@7.2.3: Glob versions prior to v9 are no longer supported
npm warn deprecated glob@7.2.3: Glob versions prior to v9 are no longer supported
npm warn deprecated gauge@4.0.4: This package is no longer supported.

added 490 packages, and audited 491 packages in 4s

67 packages are looking for funding
run npm fund for details

3 vulnerabilities (2 high, 1 critical)

To address issues that do not require attention, run:
npm audit fix

To address all issues, run:
npm audit fix --force

Run npm audit for details.
[INFO] Starting UI in PRODUCTION mode...
[INFO] To use dev mode with hot reload, run: ./start_toolkit.sh ui --dev

ai-toolkit-ui@0.1.0 build_and_start
npm install && npm run update_db && npm run build && npm run start

up to date, audited 491 packages in 779ms

67 packages are looking for funding
run npm fund for details

3 vulnerabilities (2 high, 1 critical)

To address issues that do not require attention, run:
npm audit fix

To address all issues, run:
npm audit fix --force

Run npm audit for details.

ai-toolkit-ui@0.1.0 update_db
npx prisma generate && npx prisma db push

Prisma schema loaded from prisma/schema.prisma

✔ Generated Prisma Client (v6.3.1) to ./node_modules/@prisma/client in 26ms

Start by importing your Prisma Client (See: https://pris.ly/d/importing-client)

Tip: Want real-time updates to your database without manual polling? Discover how with Pulse: https://pris.ly/tip-0-pulse

Prisma schema loaded from prisma/schema.prisma
Datasource "db": SQLite database "aitk_db.db" at "file:../../aitk_db.db"

SQLite database aitk_db.db created at file:../../aitk_db.db

🚀 Your database is now in sync with your Prisma schema. Done in 26ms

✔ Generated Prisma Client (v6.3.1) to ./node_modules/@prisma/client in 27ms

ai-toolkit-ui@0.1.0 build
tsc -p tsconfig.worker.json && next build

▲ Next.js 15.1.7

Creating an optimized production build ...
⚠ Compiled with warnings

./node_modules/systeminformation/lib/cpu.js
Module not found: Can't resolve 'osx-temperature-sensor' in '/home/alessio/dati/ai-toolkit/ui/node_modules/systeminformation/lib'

Import trace for requested module:
./node_modules/systeminformation/lib/cpu.js
./node_modules/systeminformation/lib/index.js
./src/app/api/cpu/route.ts

✓ Compiled successfully
Skipping validation of types
✓ Linting
✓ Collecting page data
✓ Generating static pages (22/22)
✓ Collecting build traces
✓ Finalizing page optimization

Route (app) Size First Load JS
┌ ƒ / 220 B 106 kB
├ ƒ /_not-found 986 B 107 kB
├ ƒ /api/auth 220 B 106 kB
├ ƒ /api/caption/get 220 B 106 kB
├ ƒ /api/cpu 220 B 106 kB
├ ƒ /api/datasets/create 220 B 106 kB
├ ƒ /api/datasets/delete 220 B 106 kB
├ ƒ /api/datasets/list 220 B 106 kB
├ ƒ /api/datasets/listImages 220 B 106 kB
├ ƒ /api/datasets/upload 220 B 106 kB
├ ƒ /api/files/[...filePath] 220 B 106 kB
├ ƒ /api/gpu 220 B 106 kB
├ ƒ /api/img/[...imagePath] 220 B 106 kB
├ ƒ /api/img/caption 220 B 106 kB
├ ƒ /api/img/delete 220 B 106 kB
├ ƒ /api/img/upload 220 B 106 kB
├ ƒ /api/jobs 220 B 106 kB
├ ƒ /api/jobs/[jobID]/delete 220 B 106 kB
├ ƒ /api/jobs/[jobID]/files 220 B 106 kB
├ ƒ /api/jobs/[jobID]/log 220 B 106 kB
├ ƒ /api/jobs/[jobID]/mark_stopped 220 B 106 kB
├ ƒ /api/jobs/[jobID]/samples 220 B 106 kB
├ ƒ /api/jobs/[jobID]/start 220 B 106 kB
├ ƒ /api/jobs/[jobID]/stop 220 B 106 kB
├ ƒ /api/queue 220 B 106 kB
├ ƒ /api/queue/[queueID]/start 220 B 106 kB
├ ƒ /api/queue/[queueID]/stop 220 B 106 kB
├ ƒ /api/settings 220 B 106 kB
├ ƒ /api/zip 220 B 106 kB
├ ○ /apple-icon.png 0 B 0 B
├ ƒ /dashboard 4.8 kB 188 kB
├ ƒ /datasets 3.79 kB 165 kB
├ ƒ /datasets/[datasetName] 5.25 kB 179 kB
├ ○ /icon.png 0 B 0 B
├ ○ /icon.svg 0 B 0 B
├ ƒ /jobs 2.58 kB 186 kB
├ ƒ /jobs/[jobID] 9.21 kB 228 kB
├ ƒ /jobs/new 19 kB 228 kB
├ ○ /manifest.json 0 B 0 B
└ ƒ /settings 1.89 kB 128 kB

  • First Load JS shared by all 106 kB
    ├ chunks/1517-e87d8ec80ba330cc.js 50.5 kB
    ├ chunks/4bd1b696-4a90ab8dd4830a4e.js 53.1 kB
    └ other shared chunks (total) 1.92 kB

ƒ Middleware 32.1 kB

○ (Static) prerendered as static content
ƒ (Dynamic) server-rendered on demand

ai-toolkit-ui@0.1.0 start
concurrently --restart-tries -1 --restart-after 1000 -n WORKER,UI "node dist/cron/worker.js" "next start --port 8675"

[WORKER] TOOLKIT_ROOT: /home/alessio/dati/ai-toolkit
[WORKER] Cron worker started with interval: 1000 ms
[UI] ▲ Next.js 15.1.7
[UI] - Local: http://localhost:8675
[UI] - Network: http://192.168.2.37:8675
[UI]
[UI] ✓ Starting...
[UI] ✓ Ready in 223ms
[UI] [ROCm GPU 0] ERROR: Invalid graphics clock 11365MHz from field[5]="11365"
[UI] [ROCm GPU 0] ERROR: Invalid graphics clock 11365MHz from field[5]="11365"
[UI] [ROCm GPU 0] ERROR: Invalid graphics clock 11365MHz from field[5]="11365"
[UI] [ROCm GPU 0] ERROR: Invalid graphics clock 11365MHz from field[5]="11365"
[UI] [ROCm GPU 0] ERROR: Invalid graphics clock 11365MHz from field[5]="11365"
[UI] [ROCm GPU 0] ERROR: Invalid graphics clock 11365MHz from field[5]="11365"
[UI] [ROCm GPU 0] ERROR: Invalid graphics clock 11365MHz from field[5]="11365"
[UI] [ROCm GPU 0] ERROR: Invalid graphics clock 11365MHz from field[5]="11365"
[UI] [ROCm GPU 0] ERROR: Invalid graphics clock 11365MHz from field[5]="11365"
`

Any idea what to try? Thanks

iGavroche and others added 6 commits December 25, 2025 14:26
Addresses feedback from PR ostris#563. Removed standalone 'self.accelerator' line at line 2248 that served no purpose.
Addresses feedback from PR ostris#563. Replaced data reassignment with explicit String() conversion for better null/undefined handling when processing caption data from API responses.
Addresses feedback from PR ostris#563. Added try-catch block to properly handle database errors and return appropriate error responses instead of letting errors propagate unhandled.
Addresses feedback from PR ostris#563. Fixed several issues with ROCm GPU stats parsing:

- Fixed hardcoded field indices in error messages to use dynamic field indices
- Made field parsing errors conditional on development mode to reduce log noise
- Improved power parsing to detect and skip clock-like values (e.g., "(1472Mhz)")
- Added automatic Hz to MHz conversion for clock values that are way too high
- Better handling of edge cases in power and clock value parsing

This fixes the recurring errors:
- "Could not parse power from field[9]=\"(1472Mhz)\""
- "Invalid graphics clock 51834MHz from field[5]=\"51834\""
- Make amd-smi the primary tool for AMD GPU monitoring, with rocm-smi as fallback
- Improve field detection to handle multiple possible field names (temperature, power, etc.)
- Enhance GPU name detection to try multiple paths in static data structure
- Ensure all metrics are properly mapped: temperature, fan speed, GPU load, memory, clock speeds, and names
- Fallback to rocm-smi only if amd-smi is unavailable or returns insufficient data
@iGavroche
Copy link
Author

Hi, I'm trying this with an AMD RYZEN AI MAX+ 395 w/ Radeon 8060S (gfx1151) and getting this error:

`(venv) alessio@fedora ~/dati/ai-toolkit (pr-563) $ ./start_toolkit.sh ui [INFO] AI Toolkit Startup Script [INFO] ==========================

[INFO] Virtual environment already active: /home/alessio/dati/ai-toolkit/venv [INFO] Verifying dependencies... [SUCCESS] Core dependencies verified

[INFO] Starting in mode: ui

[INFO] Launching web UI on port 8675... [INFO] Installing UI dependencies... npm warn deprecated inflight@1.0.6: This module is not supported, and leaks memory. Do not use it. Check out lru-cache if you want a good and tested way to coalesce async requests by a key value, which is much more comprehensive and powerful. npm warn deprecated @npmcli/move-file@1.1.2: This functionality has been moved to @npmcli/fs npm warn deprecated npmlog@6.0.2: This package is no longer supported. npm warn deprecated rimraf@2.7.1: Rimraf versions prior to v4 are no longer supported npm warn deprecated rimraf@3.0.2: Rimraf versions prior to v4 are no longer supported npm warn deprecated are-we-there-yet@3.0.1: This package is no longer supported. npm warn deprecated glob@7.2.3: Glob versions prior to v9 are no longer supported npm warn deprecated glob@7.2.3: Glob versions prior to v9 are no longer supported npm warn deprecated glob@7.2.3: Glob versions prior to v9 are no longer supported npm warn deprecated glob@7.2.3: Glob versions prior to v9 are no longer supported npm warn deprecated gauge@4.0.4: This package is no longer supported.

added 490 packages, and audited 491 packages in 4s

67 packages are looking for funding run npm fund for details

3 vulnerabilities (2 high, 1 critical)

To address issues that do not require attention, run: npm audit fix

To address all issues, run: npm audit fix --force

Run npm audit for details. [INFO] Starting UI in PRODUCTION mode... [INFO] To use dev mode with hot reload, run: ./start_toolkit.sh ui --dev

ai-toolkit-ui@0.1.0 build_and_start
npm install && npm run update_db && npm run build && npm run start

up to date, audited 491 packages in 779ms

67 packages are looking for funding run npm fund for details

3 vulnerabilities (2 high, 1 critical)

To address issues that do not require attention, run: npm audit fix

To address all issues, run: npm audit fix --force

Run npm audit for details.

ai-toolkit-ui@0.1.0 update_db
npx prisma generate && npx prisma db push

Prisma schema loaded from prisma/schema.prisma

✔ Generated Prisma Client (v6.3.1) to ./node_modules/@prisma/client in 26ms

Start by importing your Prisma Client (See: https://pris.ly/d/importing-client)

Tip: Want real-time updates to your database without manual polling? Discover how with Pulse: https://pris.ly/tip-0-pulse

Prisma schema loaded from prisma/schema.prisma Datasource "db": SQLite database "aitk_db.db" at "file:../../aitk_db.db"

SQLite database aitk_db.db created at file:../../aitk_db.db

🚀 Your database is now in sync with your Prisma schema. Done in 26ms

✔ Generated Prisma Client (v6.3.1) to ./node_modules/@prisma/client in 27ms

ai-toolkit-ui@0.1.0 build
tsc -p tsconfig.worker.json && next build

▲ Next.js 15.1.7

Creating an optimized production build ... ⚠ Compiled with warnings

./node_modules/systeminformation/lib/cpu.js Module not found: Can't resolve 'osx-temperature-sensor' in '/home/alessio/dati/ai-toolkit/ui/node_modules/systeminformation/lib'

Import trace for requested module: ./node_modules/systeminformation/lib/cpu.js ./node_modules/systeminformation/lib/index.js ./src/app/api/cpu/route.ts

✓ Compiled successfully Skipping validation of types ✓ Linting ✓ Collecting page data ✓ Generating static pages (22/22) ✓ Collecting build traces ✓ Finalizing page optimization

Route (app) Size First Load JS ┌ ƒ / 220 B 106 kB ├ ƒ /_not-found 986 B 107 kB ├ ƒ /api/auth 220 B 106 kB ├ ƒ /api/caption/get 220 B 106 kB ├ ƒ /api/cpu 220 B 106 kB ├ ƒ /api/datasets/create 220 B 106 kB ├ ƒ /api/datasets/delete 220 B 106 kB ├ ƒ /api/datasets/list 220 B 106 kB ├ ƒ /api/datasets/listImages 220 B 106 kB ├ ƒ /api/datasets/upload 220 B 106 kB ├ ƒ /api/files/[...filePath] 220 B 106 kB ├ ƒ /api/gpu 220 B 106 kB ├ ƒ /api/img/[...imagePath] 220 B 106 kB ├ ƒ /api/img/caption 220 B 106 kB ├ ƒ /api/img/delete 220 B 106 kB ├ ƒ /api/img/upload 220 B 106 kB ├ ƒ /api/jobs 220 B 106 kB ├ ƒ /api/jobs/[jobID]/delete 220 B 106 kB ├ ƒ /api/jobs/[jobID]/files 220 B 106 kB ├ ƒ /api/jobs/[jobID]/log 220 B 106 kB ├ ƒ /api/jobs/[jobID]/mark_stopped 220 B 106 kB ├ ƒ /api/jobs/[jobID]/samples 220 B 106 kB ├ ƒ /api/jobs/[jobID]/start 220 B 106 kB ├ ƒ /api/jobs/[jobID]/stop 220 B 106 kB ├ ƒ /api/queue 220 B 106 kB ├ ƒ /api/queue/[queueID]/start 220 B 106 kB ├ ƒ /api/queue/[queueID]/stop 220 B 106 kB ├ ƒ /api/settings 220 B 106 kB ├ ƒ /api/zip 220 B 106 kB ├ ○ /apple-icon.png 0 B 0 B ├ ƒ /dashboard 4.8 kB 188 kB ├ ƒ /datasets 3.79 kB 165 kB ├ ƒ /datasets/[datasetName] 5.25 kB 179 kB ├ ○ /icon.png 0 B 0 B ├ ○ /icon.svg 0 B 0 B ├ ƒ /jobs 2.58 kB 186 kB ├ ƒ /jobs/[jobID] 9.21 kB 228 kB ├ ƒ /jobs/new 19 kB 228 kB ├ ○ /manifest.json 0 B 0 B └ ƒ /settings 1.89 kB 128 kB

  • First Load JS shared by all 106 kB
    ├ chunks/1517-e87d8ec80ba330cc.js 50.5 kB
    ├ chunks/4bd1b696-4a90ab8dd4830a4e.js 53.1 kB
    └ other shared chunks (total) 1.92 kB

ƒ Middleware 32.1 kB

○ (Static) prerendered as static content ƒ (Dynamic) server-rendered on demand

ai-toolkit-ui@0.1.0 start
concurrently --restart-tries -1 --restart-after 1000 -n WORKER,UI "node dist/cron/worker.js" "next start --port 8675"

[WORKER] TOOLKIT_ROOT: /home/alessio/dati/ai-toolkit [WORKER] Cron worker started with interval: 1000 ms [UI] ▲ Next.js 15.1.7 [UI] - Local: http://localhost:8675 [UI] - Network: http://192.168.2.37:8675 [UI] [UI] ✓ Starting... [UI] ✓ Ready in 223ms [UI] [ROCm GPU 0] ERROR: Invalid graphics clock 11365MHz from field[5]="11365" [UI] [ROCm GPU 0] ERROR: Invalid graphics clock 11365MHz from field[5]="11365" [UI] [ROCm GPU 0] ERROR: Invalid graphics clock 11365MHz from field[5]="11365" [UI] [ROCm GPU 0] ERROR: Invalid graphics clock 11365MHz from field[5]="11365" [UI] [ROCm GPU 0] ERROR: Invalid graphics clock 11365MHz from field[5]="11365" [UI] [ROCm GPU 0] ERROR: Invalid graphics clock 11365MHz from field[5]="11365" [UI] [ROCm GPU 0] ERROR: Invalid graphics clock 11365MHz from field[5]="11365" [UI] [ROCm GPU 0] ERROR: Invalid graphics clock 11365MHz from field[5]="11365" [UI] [ROCm GPU 0] ERROR: Invalid graphics clock 11365MHz from field[5]="11365" `

Any idea what to try? Thanks

bab8bec and 2f0eb5f hopefully fixed your issue. Thank you for pasting the log!

@asoldano
Copy link

cool, thanks @iGavroche . I'll be able to test this in a few days and will let you know here for sure.

Copy link

@asoldano asoldano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iGavroche , I've tested your changes and I confirm the warnings are gone. The AMD Strix Halo is also properly detected and shown in the UI.
Having said that, before you fixed this properly, I worked around the problem by asking Cline+GLM4.7 to come up with a solution; I'm mentioning this because in both cases, the training is still actually failing to start. In stderr.log I can see terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc
I've digged the problem a bit (again with Cline, so might not be super accurate) and eventually discovered that torchaudio is crashing with the std::bad_alloc when imported. Getting rid of the only torchaudio import in config_modules.py (or going with something like asoldano@58514f8 ) allows me starting and completing a lora training, though I'm not really sure if this is a problem with my environment only or what (btw, I'm using Python 3.13, I also had to use a more recent scipy version). Hope this feedback is of some use.

@iGavroche
Copy link
Author

@asoldano thanks for the feedback. I believe the issue to be with the --pre TheRock builds. Probably torchaudio. Best is to use the non-pre-release builds these days, for instance:
uv pip install --index-url https://rocm.nightlies.amd.com/v2/gfx1151/ --upgrade --force-reinstall torch torchvision torchaudio

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants