Skip to content

Potential issue in MPI comm for curvilinear SRPIC#58

Merged
haykh merged 31 commits into1.1.0rcfrom
bug/mpicommbc
Jul 28, 2024
Merged

Potential issue in MPI comm for curvilinear SRPIC#58
haykh merged 31 commits into1.1.0rcfrom
bug/mpicommbc

Conversation

@haykh
Copy link
Collaborator

@haykh haykh commented Jun 29, 2024

No description provided.

@haykh haykh added the bug Something isn't working label Jun 29, 2024
haykh added 3 commits July 1, 2024 05:27
fix bug in metadomain.cpp; link library stdc++fs
Tests adapted to double precision
@LudwigBoess
Copy link
Collaborator

Maybe related, I also encounter an MPI error in cartesian/minkowski SRPIC with the wip/shock setup:

PMPI_Allgather(1000): MPI_Allgather(sbuf=0x490bc70, scount=1, MPI_FLOAT, rbuf=0x490bc70, rcount=1, MPI_FLOAT, MPI_COMM_WORLD) failed
PMPI_Allgather(945).: Buffers must not be aliased
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=201926145
:
system msg for write_line failure : Bad file descriptor
Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()
srun: error: midway3-0278: task 0: Exited with exit code 1

@haykh
Copy link
Collaborator Author

haykh commented Jul 1, 2024

Maybe related, I also encounter an MPI error in cartesian/minkowski SRPIC with the wip/shock setup:

PMPI_Allgather(1000): MPI_Allgather(sbuf=0x490bc70, scount=1, MPI_FLOAT, rbuf=0x490bc70, rcount=1, MPI_FLOAT, MPI_COMM_WORLD) failed
PMPI_Allgather(945).: Buffers must not be aliased
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=201926145
:
system msg for write_line failure : Bad file descriptor
Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()
srun: error: midway3-0278: task 0: Exited with exit code 1

likely unrelated, this problem is simply wrong comm without errors.

@LudwigBoess is this a runtime or compile-time error? if runtime -- could you post the command you use to run? (or submit script) if compile-time, what MPI are you using?

CUDA with MPI is a bit of a headache to configure at first on a new machine. especially given the fact that different clusters have different env variables defined.

@haykh
Copy link
Collaborator Author

haykh commented Jul 17, 2024

Culprit identified as potential race condition in src/kernels/injectors.hpp -- kernels::NonUniformInjector_kernel::operator(). Switching from Kokkos::atomic_fetch_add(&idx(), ppc) to Kokkos::atomic_fetch_add(&idx(), 1) solved the issue.

@haykh haykh merged commit 54cc679 into 1.1.0rc Jul 28, 2024
@haykh haykh deleted the bug/mpicommbc branch July 28, 2024 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants