Skip to content

[Bug] False Results on DSP #6745

@Cstandardlib

Description

@Cstandardlib

Describe the bug

When using abacus_dsp on DSP hardware, scf gives false results if PROCESS = -N * --ntasks-per-node is not equal to kpar settings:
tested on examples/02_scf/pw_Si2

  1. False(DSP, MT-3000) kpar = 4
    srun -p mt_module --mpi=pmix -N 2 --ntasks-per-node 4 --cpus-per-task 1
 ITER       ETOT/eV          EDIFF/eV         DRHO     TIME/s
 DS1     -1.53170263e+02   0.00000000e+00   3.8786e-01   0.80
 DS2     -1.53368004e+02  -1.97740604e-01   3.3636e-02   0.37
 DS3     -1.53705485e+02  -3.37481195e-01   1.1937e-05   0.46
 DS4     -1.53887359e+02  -1.81873573e-01   1.8364e-06   0.94
 DS5     -1.54173290e+02  -2.85931306e-01   4.8518e-07   0.63
 DS6     -1.54788611e+02  -6.15321139e-01   2.1001e-07   0.90
 DS7     -1.53889065e+02   8.99545991e-01   1.0410e-06   1.25
 DS8     -1.54774765e+02  -8.85699226e-01   1.7564e-06   1.22
 DS9     -1.54800543e+02  -2.57788457e-02   2.5583e-08   1.26
 ** Closing DSP Hardware...
 ** DSP closed on cluster 0 **
  1. Normal(DSP, MT-3000) kpar = 4
    srun -p mt_module --mpi=pmix -N 1 --ntasks-per-node 4 --cpus-per-task 1
 ITER       ETOT/eV          EDIFF/eV         DRHO     TIME/s
 DS1     -2.15454298e+02   0.00000000e+00   6.9791e-02   1.18
 DS2     -2.15503983e+02  -4.96852791e-02   1.7895e-03   0.39
 DS3     -2.15505631e+02  -1.64801402e-03   2.6217e-05   0.62
 DS4     -2.15505697e+02  -6.60092140e-05   3.4063e-07   0.55
 DS5     -2.15505698e+02  -1.56979436e-06   1.9504e-08   0.66
 ** Closing DSP Hardware...
 ** DSP closed on cluster 0 **

srun -p mt_module --mpi=pmix -N 2 --ntasks-per-node 2 --cpus-per-task 1
srun -p mt_module --mpi=pmix -N 4 --ntasks-per-node 1 --cpus-per-task 1
give exactly the same results if -N * --ntasks-per-node == kpar.

  1. Normal(CPU, FT-3000) kpar = 4
 DONE(0.395965   SEC) : INIT SCF
 ITER       ETOT/eV          EDIFF/eV         DRHO     TIME/s
 CG1     -2.15454952e+02   0.00000000e+00   6.8544e-02   0.51
 CG2     -2.15503057e+02  -4.81044430e-02   1.9140e-03   0.18
 CG3     -2.15505564e+02  -2.50710417e-03   2.1959e-05   0.25
 CG4     -2.15505697e+02  -1.32526984e-04   4.8307e-07   0.26
 CG5     -2.15505698e+02  -1.82268224e-06   2.0521e-08   0.26
  1. Normal(CPU, Intel Gold 6132*4) kpar 4
    OMP_NUM_THREADS=1 mpirun -np 4 abacus
 ITER       ETOT/eV          EDIFF/eV         DRHO     TIME/s
 DS1     -2.15454288e+02   0.00000000e+00   6.9794e-02   0.26
 DS2     -2.15503986e+02  -4.96983193e-02   1.7894e-03   0.10
 DS3     -2.15505632e+02  -1.64577128e-03   2.6208e-05   0.13
 DS4     -2.15505697e+02  -6.49813040e-05   3.3496e-07   0.13
 DS5     -2.15505698e+02  -1.49346408e-06   1.9132e-08   0.16

Also note that CPU(Intel), CPU(FT-3000), DSP(MT-3000) produce slightly different results from DS1.

Expected behavior

DSP and CPU builds should provides the same results;
or DSP version should pop a warning and quit immediately if number of processes != kpar.

To Reproduce

ABACUS 3.9.0.19 Release

Environment

No response

Additional Context

No response

Task list for Issue attackers (only for developers)

  • Verify the issue is not a duplicate.
  • Describe the bug.
  • Steps to reproduce.
  • Expected behavior.
  • Error message.
  • Environment details.
  • Additional context.
  • Assign a priority level (low, medium, high, urgent).
  • Assign the issue to a team member.
  • Label the issue with relevant tags.
  • Identify possible related issues.
  • Create a unit test or automated test to reproduce the bug (if applicable).
  • Fix the bug.
  • Test the fix.
  • Update documentation (if necessary).
  • Close the issue and inform the reporter (if applicable).

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugsBugs that only solvable with sufficient knowledge of DFTGPU & DCU & HPCGPU and DCU and HPC related any issues

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions