Skip to content

libfabric=1.17.0-3 on Debian causes MPI tests to fail with MPI_ERR_OTHER #370

@inducer

Description

@inducer

Sample CI failure:
https://gitlab.tiker.net/inducer/meshmode/-/jobs/533461

Similar failure in grudge:
https://gitlab.tiker.net/inducer/grudge/-/jobs/533485

Sample traceback:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/.env/lib/python3.11/site-packages/mpi4py/run.py", line 208, in <module>
    main()
  File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/.env/lib/python3.11/site-packages/mpi4py/run.py", line 198, in main
    run_command_line(args)
  File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/.env/lib/python3.11/site-packages/mpi4py/run.py", line 47, in run_command_line
    run_path(sys.argv[0], run_name='__main__')
  File "<frozen runpy>", line 291, in run_path
  File "<frozen runpy>", line 98, in _run_module_code
  File "<frozen runpy>", line 88, in _run_code
  File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/test/test_partition.py", line 609, in <module>
    _test_mpi_boundary_swap(dim, order, num_groups)
  File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/test/test_partition.py", line 426, in _test_mpi_boundary_swap
    conns = bdry_setup_helper.complete_some()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/meshmode/distributed.py", line 332, in complete_some
    data = [self._internal_mpi_comm.recv(status=status)]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "mpi4py/MPI/Comm.pyx", line 1438, in mpi4py.MPI.Comm.recv
  File "mpi4py/MPI/msgpickle.pxi", line 341, in mpi4py.MPI.PyMPI_recv
  File "mpi4py/MPI/msgpickle.pxi", line 303, in mpi4py.MPI.PyMPI_recv_match
mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list

Downgrading to libfabric (see here) appears to resolve this.

This is the code in mpi4py that ultimately fails, it's a matched receive (mrecv).

@majosm Got any ideas? (Pinging you since the two of us last touched this code.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions