Skip to content

The cuSolverMP solver failed to install on the HanHai 22 supercomputing platform when -DENABLE_CUSOLVERMP=ON #6758

@goodhandy

Description

@goodhandy

Details

Initially, I successfully installed the GPU version of ABACUS on the HanHai 22 supercomputing platform using icpc. The commands and environment used are as follows:
environment:

module purge
module load cmake/3.19.0
module load tbb/2021.6.0
module load compiler-rt/2022.1.0
module load oclfpga/2022.1.0
module load compiler/2022.1.0 
module load mpi/2021.6.0 
module load mkl/2022.1.0  
module load cuda/12.8
module load elpa/2023.11.001/intelmpi/2021.6/intel-2022.1.0

instruction:

CC=mpiicc CXX=mpiicpc FC=mpiifort \
cmake -B build -DCereal_INCLUDE_DIR=/home/liuxiaohuigroup/handy/cereal-1.3.2/include \
-DELPA_LINK_LIBRARIES=/opt/elpa/2023.11.001/intelmpi/2021.6/intel-2022.1.0/lib/libelpa.so \
-DELPA_INCLUDE_DIR=/opt/elpa/2023.11.001/intelmpi/2021.6/intel-2022.1.0/include/elpa-2023.11.001 \
-DUSE_OPENMP=ON -DENABLE_LCAO=ON -DUSE_CUDA=ON -DUSE_ELPA=ON -DDEBUG_INFO=1
    
cd build && make -j32  

Then, based on the website's instructions, I wanted to set the -DENABLE_CUSOLVERMP=ON option.
This is the environment and settings I changed to successfully set up this option.
environment:

module purge
module load cmake/3.19.0
module load tbb/2021.6.0
module load compiler-rt/2022.1.0
module load oclfpga/2022.1.0
module load compiler/2022.1.0 
module load mpi/2021.6.0 
module load mkl/2022.1.0 
module load nvhpc-byo-compiler/22.7
module load cuda/12.8
module load elpa/2023.11.001/intelmpi/2021.6/intel-2022.1.0

instruction:

CC=mpiicc CXX=mpiicpc FC=mpiifort  \
cmake -B build \
  -DCereal_INCLUDE_DIR=/home/liuxiaohuigroup/handy/cereal-1.3.2/include \
  -DCMAKE_CUDA_COMPILER=/opt/cuda/12.8/bin/nvcc \
  -DELPA_LINK_LIBRARIES=/opt/elpa/2023.11.001/intelmpi/2021.6/intel-2022.1.0/lib/libelpa.so \
  -DELPA_INCLUDE_DIR=/opt/elpa/2023.11.001/intelmpi/2021.6/intel-2022.1.0/include/elpa-2023.11.001 \
  -DCAL_CUSOLVERMP_PATH=/opt/hpc_sdk/2022_227/Linux_x86_64/22.7/math_libs/lib64 \
  -DUSE_OPENMP=ON \
  -DENABLE_LCAO=ON \
  -DUSE_CUDA=ON \
  -DUSE_ELPA=ON \
  -DDEBUG_INFO=1 \
  -DENABLE_CUSOLVERMP=ON

cd build/

make VERBOSE=1 -j$(nproc) > build_log.txt 2>&1  

These are some module details.

/opt/MODULES/compiler/nvhpc-byo-compiler/22.7:

conflict        nvhpc
conflict        nvhpc-nompi
conflict        nvhpc-byo-compiler
setenv          NVHPC /opt/hpc_sdk/2022_227
setenv          NVHPC_ROOT /opt/hpc_sdk/2022_227/Linux_x86_64/22.7
prepend-path    PATH /opt/hpc_sdk/2022_227/Linux_x86_64/22.7/cuda/bin
prepend-path    CPATH /opt/hpc_sdk/2022_227/Linux_x86_64/22.7/cuda/include
prepend-path    CPATH /opt/hpc_sdk/2022_227/Linux_x86_64/22.7/math_libs/include
prepend-path    CPATH /opt/hpc_sdk/2022_227/Linux_x86_64/22.7/comm_libs/nccl/include
prepend-path    CPATH /opt/hpc_sdk/2022_227/Linux_x86_64/22.7/comm_libs/nvshmem/include
prepend-path    LD_LIBRARY_PATH /opt/hpc_sdk/2022_227/Linux_x86_64/22.7/cuda/lib64
prepend-path    LD_LIBRARY_PATH /opt/hpc_sdk/2022_227/Linux_x86_64/22.7/cuda/extras/CUPTI/lib64
prepend-path    LD_LIBRARY_PATH /opt/hpc_sdk/2022_227/Linux_x86_64/22.7/math_libs/lib64
prepend-path    LD_LIBRARY_PATH /opt/hpc_sdk/2022_227/Linux_x86_64/22.7/comm_libs/nccl/lib
prepend-path    LD_LIBRARY_PATH /opt/hpc_sdk/2022_227/Linux_x86_64/22.7/comm_libs/nvshmem/lib
****

/opt/hpc_sdk/2022_227/Linux_x86_64/22.7/math_libs/lib64$ ls
libcal.so                  libcublas_static.a     libcufft_static_nocallback.a  libcurand_static.a           libcusolver.so.11.3.5.50  libcutensorMg.so.1.5.0  libnvblas.so
libcublasLt.so             libcufftMp.so          libcufftw.so                  libcusolver_lapack_static.a  libcusolver_static.a      libcutensorMg_static.a  libnvblas.so.11
libcublasLt.so.11          libcufftMp.so.10       libcufftw.so.10               libcusolverMg.so             libcusparse.so            libcutensor.so          libnvblas.so.11.10.1.25
libcublasLt.so.11.10.1.25  libcufftMp.so.10.8.1   libcufftw.so.10.7.2.50        libcusolverMg.so.11          libcusparse.so.11         libcutensor.so.1        stubs
libcublasLt_static.a       libcufft.so            libcufftw_static.a            libcusolverMg.so.11.3.5.50   libcusparse.so.11.7.3.50  libcutensor.so.1.5.0
libcublas.so               libcufft.so.10         libcurand.so                  libcusolverMp.so             libcusparse_static.a      libcutensor_static.a
libcublas.so.11            libcufft.so.10.7.2.50  libcurand.so.10               libcusolver.so               libcutensorMg.so          liblapack_static.a
libcublas.so.11.10.1.25    libcufft_static.a      libcurand.so.10.2.10.50       libcusolver.so.11            libcutensorMg.so.1        libmetis_static.a


My cmake phase works fine, but when I run make, I get the following error:

[ 90%] Built target cell
[ 90%] Linking CXX static library libcontainer.a
[ 90%] Built target container
[ 90%] Built target elecstate
[ 90%] Built target vdw
[ 90%] Built target io_basic
[ 90%] Built target device
[ 90%] Built target gint
make: *** [Makefile:149: all] Error 2

The build_log.txt file displays the following detailed error message:

In file included from /home/liuxiaohuigroup/handy/abacus-develop-LTSv3.10.0/source/module_hsolver/diago_cusolvermp.h(8),
                 from /home/liuxiaohuigroup/handy/abacus-develop-LTSv3.10.0/source/module_hsolver/hsolver_lcao.cpp(11):
/home/liuxiaohuigroup/handy/abacus-develop-LTSv3.10.0/source/module_hsolver/kernels/cuda/diag_cusolvermp.cuh(59): error: identifier "cusolverMpGrid_t" is undefined
      cusolverMpGrid_t grid = NULL;
      ^

In file included from /home/liuxiaohuigroup/handy/abacus-develop-LTSv3.10.0/source/module_hsolver/diago_cusolvermp.h(8),
                 from /home/liuxiaohuigroup/handy/abacus-develop-LTSv3.10.0/source/module_hsolver/hsolver_lcao.cpp(11):
/home/liuxiaohuigroup/handy/abacus-develop-LTSv3.10.0/source/module_hsolver/kernels/cuda/diag_cusolvermp.cuh(62): error: identifier "cusolverMpMatrixDescriptor_t" is undefined
      cusolverMpMatrixDescriptor_t desc_for_cusolvermp = NULL;
      ^
......

make[2]: *** [source/module_hsolver/CMakeFiles/diag_cusolver.dir/build.make:212: source/module_hsolver/CMakeFiles/diag_cusolver.dir/hsolver_lcao.cpp.o] Error 2
make[2]: *** Waiting for unfinished jobs....

...

[ 90%] Built target gint
make[1]: Leaving directory '/home/liuxiaohuigroup/handy/abacus-develop-LTSv3.10.0/build'
make: *** [Makefile:149: all] Error 2

Is there a good solution to this problem?

Have you read FAQ on the online manual http://abacus.deepmodeling.com/en/latest/community/faq.html

  • Yes, I have read the FAQ part on online manual.

Task list for Issue attackers (only for developers)

  • Understand the problem or question described by the user.
  • Check if the issue is a known problem or has been addressed in the documentation.
  • Test the issue or problem on a similar system or environment, if possible.
  • Identify the root cause or provide clarification on the user's question.
  • Provide a step-by-step guide, including any necessary resources, to resolve the issue or answer the question.
  • If the issue is related to documentation, update the documentation to prevent future confusion (optional).
  • If the issue is related to code, consider implementing a fix or improvement (optional).
  • Review and incorporate any relevant feedback from users or developers.
  • Ensure the user's issue is resolved or their question is answered and close the ticket.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions