Skip to content

Conversation

@schnellerhase
Copy link
Contributor

@schnellerhase schnellerhase commented Aug 14, 2025

Catch2 >=3.9.0 changes execution order to random by default. This needs to be deactivated for MPI parallel testing.

Copy link
Member

@garth-wells garth-wells left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will work due to the hdf5-mpi package depending on OpenMPI.

Which function call is leading to the hang with OpenMPI ?

@schnellerhase
Copy link
Contributor Author

MPI_Dist_graph_create_adjacent causes an MPI error - without any further output.

@schnellerhase
Copy link
Contributor Author

But the problem appears to be open-mpi 5.0.7 -> 5.0.8. After a local update I can reproduce it.

@garth-wells
Copy link
Member

MPI_Dist_graph_create_adjacent causes an MPI error - without any further output.

Any particular line in DOLFINx, or is it any call to MPI_Dist_graph_create_adjacent? OpenMPI and MPICH have had bugs in the past with MPI_Dist_graph_create_adjacent - we have some odd looking std::vector::reserve calls in places to work around these.

@schnellerhase
Copy link
Contributor Author

Any call.

@garth-wells
Copy link
Member

I can't reproduce failure locally on macOS and with OpenMPI 5.0.8 from Homebrew.

@schnellerhase
Copy link
Contributor Author

MPI_ERROR code 15 is MPI_ERR_TRUNCATE and means 'Message truncated on receive'.

@schnellerhase
Copy link
Contributor Author

Maybe race condition. Cycling mpiexec -np 3 ./unittests [la_vector] crashes most times, but passes occasionally.

@schnellerhase schnellerhase force-pushed the schnellerhase/pin-open-mpi branch from 7690c34 to 38f760b Compare August 14, 2025 17:43
@schnellerhase schnellerhase changed the title Pin open-mpi to 5.0.7 Skip C++ tests in parallel on macOS Aug 14, 2025
@schnellerhase schnellerhase marked this pull request as ready for review August 14, 2025 17:44
@schnellerhase schnellerhase changed the title Skip C++ tests in parallel on macOS Force non random order of execution for C++ tests Aug 15, 2025
@schnellerhase
Copy link
Contributor Author

Found it - was not the open-mpi bump, but the catch2 update. New version changes default behaviour of execution order to random https://github.com/catchorg/Catch2/releases/tag/v3.9.0. This means, on every process a different execution order of the test cases is generated. So MPI communications of different test cases, in particular with different data types line up, causing all kinds of incorrect data or unmatched communications.

@garth-wells garth-wells enabled auto-merge August 17, 2025 09:09
@garth-wells garth-wells added this pull request to the merge queue Aug 17, 2025
Merged via the queue into main with commit d3d320b Aug 17, 2025
30 checks passed
@garth-wells garth-wells deleted the schnellerhase/pin-open-mpi branch August 17, 2025 09:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants