forked from abacusmodeling/abacus-develop
-
Notifications
You must be signed in to change notification settings - Fork 165
Labels
BugsBugs that only solvable with sufficient knowledge of DFTBugs that only solvable with sufficient knowledge of DFTGPU & DCU & HPCGPU and DCU and HPC related any issuesGPU and DCU and HPC related any issues
Description
Describe the bug
When using abacus_dsp on DSP hardware, scf gives false results if PROCESS = -N * --ntasks-per-node is not equal to kpar settings:
tested on examples/02_scf/pw_Si2
- False(DSP, MT-3000) kpar = 4
srun -p mt_module --mpi=pmix -N 2 --ntasks-per-node 4 --cpus-per-task 1
ITER ETOT/eV EDIFF/eV DRHO TIME/s
DS1 -1.53170263e+02 0.00000000e+00 3.8786e-01 0.80
DS2 -1.53368004e+02 -1.97740604e-01 3.3636e-02 0.37
DS3 -1.53705485e+02 -3.37481195e-01 1.1937e-05 0.46
DS4 -1.53887359e+02 -1.81873573e-01 1.8364e-06 0.94
DS5 -1.54173290e+02 -2.85931306e-01 4.8518e-07 0.63
DS6 -1.54788611e+02 -6.15321139e-01 2.1001e-07 0.90
DS7 -1.53889065e+02 8.99545991e-01 1.0410e-06 1.25
DS8 -1.54774765e+02 -8.85699226e-01 1.7564e-06 1.22
DS9 -1.54800543e+02 -2.57788457e-02 2.5583e-08 1.26
** Closing DSP Hardware...
** DSP closed on cluster 0 **
- Normal(DSP, MT-3000) kpar = 4
srun -p mt_module --mpi=pmix -N 1 --ntasks-per-node 4 --cpus-per-task 1
ITER ETOT/eV EDIFF/eV DRHO TIME/s
DS1 -2.15454298e+02 0.00000000e+00 6.9791e-02 1.18
DS2 -2.15503983e+02 -4.96852791e-02 1.7895e-03 0.39
DS3 -2.15505631e+02 -1.64801402e-03 2.6217e-05 0.62
DS4 -2.15505697e+02 -6.60092140e-05 3.4063e-07 0.55
DS5 -2.15505698e+02 -1.56979436e-06 1.9504e-08 0.66
** Closing DSP Hardware...
** DSP closed on cluster 0 **
srun -p mt_module --mpi=pmix -N 2 --ntasks-per-node 2 --cpus-per-task 1
srun -p mt_module --mpi=pmix -N 4 --ntasks-per-node 1 --cpus-per-task 1
give exactly the same results if -N * --ntasks-per-node == kpar.
- Normal(CPU, FT-3000) kpar = 4
DONE(0.395965 SEC) : INIT SCF
ITER ETOT/eV EDIFF/eV DRHO TIME/s
CG1 -2.15454952e+02 0.00000000e+00 6.8544e-02 0.51
CG2 -2.15503057e+02 -4.81044430e-02 1.9140e-03 0.18
CG3 -2.15505564e+02 -2.50710417e-03 2.1959e-05 0.25
CG4 -2.15505697e+02 -1.32526984e-04 4.8307e-07 0.26
CG5 -2.15505698e+02 -1.82268224e-06 2.0521e-08 0.26
- Normal(CPU, Intel Gold 6132*4) kpar 4
OMP_NUM_THREADS=1 mpirun -np 4 abacus
ITER ETOT/eV EDIFF/eV DRHO TIME/s
DS1 -2.15454288e+02 0.00000000e+00 6.9794e-02 0.26
DS2 -2.15503986e+02 -4.96983193e-02 1.7894e-03 0.10
DS3 -2.15505632e+02 -1.64577128e-03 2.6208e-05 0.13
DS4 -2.15505697e+02 -6.49813040e-05 3.3496e-07 0.13
DS5 -2.15505698e+02 -1.49346408e-06 1.9132e-08 0.16
Also note that CPU(Intel), CPU(FT-3000), DSP(MT-3000) produce slightly different results from DS1.
Expected behavior
DSP and CPU builds should provides the same results;
or DSP version should pop a warning and quit immediately if number of processes != kpar.
To Reproduce
ABACUS 3.9.0.19 Release
Environment
No response
Additional Context
No response
Task list for Issue attackers (only for developers)
- Verify the issue is not a duplicate.
- Describe the bug.
- Steps to reproduce.
- Expected behavior.
- Error message.
- Environment details.
- Additional context.
- Assign a priority level (low, medium, high, urgent).
- Assign the issue to a team member.
- Label the issue with relevant tags.
- Identify possible related issues.
- Create a unit test or automated test to reproduce the bug (if applicable).
- Fix the bug.
- Test the fix.
- Update documentation (if necessary).
- Close the issue and inform the reporter (if applicable).
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
BugsBugs that only solvable with sufficient knowledge of DFTBugs that only solvable with sufficient knowledge of DFTGPU & DCU & HPCGPU and DCU and HPC related any issuesGPU and DCU and HPC related any issues