Fix GPU single precision energy error in dav_subspace solver by zhubonan · Pull Request #6946 · deepmodeling/abacus-develop

zhubonan · 2026-02-02T02:15:57Z

Summary

Restore d_precondition host-to-device sync that was commented out in Feature: update new version of dav_subspace with higher performance #5199 (this caused uninitialized GPU memory to be used as the preconditioner)
Fix cuBLAS gemv calls using incx instead of incy for Y parameter
Fix gemv_batched using incy instead of incx for x parameter
Replace some gemm operations with gemv when the number of columns is one.

Root Cause for the dav_subspace solver problem

The syncmem_var_h2d_op() call for d_precondition was commented out in commit a5c35d9 (#5199), causing the GPU preconditioner array to contain uninitialized memory. This led to incorrect preconditioning in the Davidson subspace iterations, resulting in wrong energies for GPU single precision calculations (~0.027 eV error).

Note: The problem may not always be reproducible since uninitialized memory may contain random data.

Reason for replacing gemm with gemv

I found the code crashes with GPU double precision in dav and dav_supspace. Not a problem with the CG solver. The crash occurs at the gemm call with notconv=1.

Test Results

Tested with examples/02_scf/pw_Si2 on CUDA 12.8 with NVIDIA GeForce RTX 5090:

Configuration	Energy (eV)	Status
cpu_single	-215.505698	✓
cpu_double	-215.505698	✓
gpu_single	-215.505686	✓ (was -215.4787)
gpu_double	-215.505698	✓

Fixes #6867

It appears that GEMM with dimension 1 can be buggy for GPU (cuBLAS)

- Restore d_precondition host-to-device sync that was commented out in deepmodeling#5199 (this caused uninitialized GPU memory to be used as the preconditioner) - Fix cuBLAS gemv calls using incx instead of incy for Y parameter - Fix gemv_batched using incy instead of incx for x parameter Fixes GPU single precision energy being ~0.027 eV off from correct value.

Fixed 3 hipBLAS gemv calls that incorrectly used incx instead of incy for the Y vector stride parameter: - hipblasDgemv (double) - hipblasCgemv (complex<float>) - hipblasZgemv (complex<double>) This is the same bug that was fixed in the CUDA version (math_kernel_op.cu). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

dyzheng · 2026-02-02T05:13:50Z

LGTM, Thanks for this fix, I will check the result on ROCM.

Cstandardlib

LGTM! Thanks for the fix.
Now that only gemm_op_mt is supported on DSP device, I will add gemv_op_mt implementation next. This is only used on DSP acceleration clusters and won't block the PR from being merged.

source/source_hsolver/diago_david.cpp

source/source_hsolver/diago_dav_subspace.cpp

Cstandardlib · 2026-02-02T07:44:03Z

And there are some code conflicts that need to be resolved, caused by #6936, which unify the CUBLAS check macro(cublasErrcheck->CHECK_CUBLAS).

… fix-gpu-double-cu128

…develop into fix-gpu-double-cu128

zhubonan and others added 3 commits February 1, 2026 21:25

Fix double precision GPU bug by using GEMV instead of GEMM

f9b4d36

It appears that GEMM with dimension 1 can be buggy for GPU (cuBLAS)

dyzheng approved these changes Feb 2, 2026

View reviewed changes

mohanchen requested review from Critsium-xy and Cstandardlib February 2, 2026 05:41

Merge branch 'develop' into fix-gpu-double-cu128

55434dc

mohanchen added Bugs Bugs that only solvable with sufficient knowledge of DFT GPU & DCU & HPC GPU and DCU and HPC related any issues labels Feb 2, 2026

Critsium-xy approved these changes Feb 2, 2026

View reviewed changes

Cstandardlib approved these changes Feb 2, 2026

View reviewed changes

source/source_hsolver/diago_david.cpp Show resolved Hide resolved

source/source_hsolver/diago_dav_subspace.cpp Show resolved Hide resolved

zhubonan added 3 commits February 2, 2026 15:54

Merge branch 'develop' of github.com:deepmodeling/abacus-develop into…

1c4e60e

… fix-gpu-double-cu128

Merge branch 'fix-gpu-double-cu128' of github.com:bonan-group/abacus-…

3791c07

…develop into fix-gpu-double-cu128

Replace cudaErrCheck with CHECK_CUBLAS

3ce6158

mohanchen merged commit be3b24c into deepmodeling:develop Feb 4, 2026
14 checks passed

Cstandardlib mentioned this pull request Mar 9, 2026

[Feature] Add gemv_op_mt for DSP #7009

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GPU single precision energy error in dav_subspace solver#6946

Fix GPU single precision energy error in dav_subspace solver#6946
mohanchen merged 7 commits intodeepmodeling:developfrom
bonan-group:fix-gpu-double-cu128

zhubonan commented Feb 2, 2026 •

edited

Loading

Uh oh!

dyzheng commented Feb 2, 2026

Uh oh!

Cstandardlib left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Cstandardlib commented Feb 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

zhubonan commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause for the dav_subspace solver problem

Reason for replacing gemm with gemv

Test Results

Uh oh!

dyzheng commented Feb 2, 2026

Uh oh!

Cstandardlib left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Cstandardlib commented Feb 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zhubonan commented Feb 2, 2026 •

edited

Loading

Cstandardlib left a comment •

edited

Loading