[WIP] MPI communication layer #4142

ianthomas23 · 2020-10-01T21:36:08Z

This is an implementation of the communication layer using MPI, for demonstration purposes. It is far from production standard and much is suboptimal, but I wanted to make the code available for others to look at and discuss. There is a related issue dask/dask-mpi#48.

It comes with a demo to run in demo/demo.py. You will need both mpi4py and dask-mpi, and you can start the demo from the command line using something like

mpirun -np 5 python demo.py

which will create 5 processes (1 Scheduler, 1 Client and 3 Workers). If you set want_sleep = True in the script it will sleep before and after the calculations to give you time to open the dashboard in a web browser. Much logging is performed; each MPI process logs to a different file, e.g. rank3.log, which has proved invaluable for debugging.

The actual MPI code is minimal and it fits fairly well into the Comm/Connector/Listener/Backend architecture except that it doesn't work well with asyncio such that the handling of asynchronous sends and receives is achieved by regular polling, which is far from ideal.

It is too early to talk about performance, but I can't help it. It is poor. And there is significant difference between mpi4py using different MPI implementations, e.g. MPICH vs OpenMPI. But I am mostly concerned about it working correctly at this stage.

Some improvements I have started to think about:

Use of the distributed utilities to convert between messages as python objects and byte streams rather that the (suspected slow) mpi4py equivalents.
Reduce the excessive polling for MPI messages which is currently on a per-Comm basis. Doing this once per MPI rank and passing it to the correct Comm should be much better.

Certainly the latter requires an understanding of how the higher-level Scheduler/Worker/Client layer interacts with the communications layer, and I will no doubt have some questions about this.

Enjoy!

mrocklin · 2020-10-02T15:59:41Z

Well this is fun to see :)

cc @kmpaul and @andersy005 from dask-mpi , who might find this interesting

Let's also cc @dalcinl from mpi4py to see if he has interest / time. I think that the point around waiting to recv from many senders simultaneously is an interesting design challenge. I agree in general that polling on each one probably isn't ideal. Ideally there would be some event in MPI itself that we could use to trigger things. MPI wasn't designed for a dynamic application like Dask, but I wouldn't be surprised if there was some internal system that we could hook into here.

mrocklin · 2020-10-02T16:02:02Z

@ianthomas23 I'm also curious, was this done just for fun, or are you working on something in particular? I'm also curious to know if there are any performance differences. A computation that you might want to try is something like the following:

from dask.distributed import Client
client = Client()  # or however you set up

import dask.array as da
import time

x = da.random.random((20000, 20000)).persist()

start = time.time()
y = (x + x.T).transpose().sum().compute()
stop = time.time()

print(stop - start)

kmpaul · 2020-10-02T16:16:14Z

@ianthomas23 This looks very cool! Thanks for sharing it. (And @mrocklin, thanks for CCing me.)

I won't have time to look at the PR closely for a while, but I'll take a look and respond more next week.

ianthomas23 · 2020-10-02T18:10:56Z

@mrocklin There is no particular problem that I am trying to solve, so I guess we are in "just for fun" territory! I was looking around dask for issues I could help with and when I saw there was some interest in using MPI, for which my skillset is ideally suited, I thought I would take a stab at it.

dalcinl · 2020-10-02T18:21:44Z

distributed/comm/mpi.py

+        while True:
+            if self._cancel:
+                return None, None
+            if _mpi_comm.iprobe(source=self._source, tag=self._tag):


status = MPI.Status() if _mpi_comm.iprobe(source=self._source, tag=self._tag, status=status): source = status.Get_source() tag = status.Get_tag() msg = _mpi_comm.recv(source=source, tag=tag, status=status) return msg, status.Get_source()

Alternatively, if MPI.VERSION >= 3, you could use the following code, which is the thread-safe way of doing things:

message = _mpi_comm.iprobe(source=self._source, tag=self._tag, status=status) if message is not None: msg = message.recv(status=status) return msg, status.Get_source()

dalcinl · 2020-10-02T18:34:04Z

@mrocklin Polling many senders to receive a message is builtin in MPI, you just specify a ANY_SOURCE wildcard value. The new code in this PR seems to be taking advantage of that.

mrocklin · 2020-10-02T18:41:04Z

Oh I see, great. So we're not polling on each Comm every 5ms, we're polling on all comms every 5ms.

@dalcinl you mention thread safety. If mpi4py is threadsafe then another option here would be to set up a thread that just blocks on waiting for a message. Once it comes in it would alert the main asyncio event loop thread which would then respond. This might be more responsive than polling. Thoughts anyone?

dalcinl · 2020-10-03T13:01:22Z

@mrocklin I do not know all the details of all the code in this PR. I'm just saying that in MPI you can very well block in a receive call polling many sources, it is a builtin feature of MPI.

About thread safety, mpi4py is certainly thread-safe. However, the backend MPI implemenation maybe not, or users my request various levels of thread support. Besides that detail, given the way MPI implementations work (at least by default), blocking on a recv() call in a thread will make your processor core go 100% CPU usage; this is certainly not nice.

mpi4py provides mpi4py.futures, which simply an MPI implementation of Python's stdlib concurrent.futures. I use threads in the implementation, but I still do polling with probe+sleep on an infinite while loop.

@ianthomas23 Look in mpi4py.futures sources, look for the Backoff helper class, It is a slightly smarter way to handle the pause interval, you may want to implement a similar logic here.

ianthomas23 · 2020-10-03T17:52:41Z

Oh I see, great. So we're not polling on each Comm every 5ms, we're polling on all comms every 5ms.

@mrocklin No, you were correct first time, each Comm (and Listener) is polling separately. At least this means there is plenty of room for improvement.

ianthomas23 · 2020-10-03T17:56:59Z

@dalcinl Thanks for your suggestions on the improved iprobe/recv combination and on mpi4py.futures. I will take a look at both.

ianthomas23 · 2020-10-03T18:20:18Z

@mrocklin Your example computation

x = da.random.random((20000, 20000)).persist()
start = time.time()
y = (x + x.T).transpose().sum().compute()
print('Seconds:', time.time() - start)

using 5 MPI processes (3 workers) on my 4-core i5 laptop takes ~2.2 s using the tcp protocol and ~2.3 s using the mpi protocol, whether using mpich 3.3.2 or openmpi 4.0.5 conda packages. This is much better than I was observing in my demo example which uses an excessive number of small chunks to force lots of communication, for which mpi was 2 to 4 times slower than tcp.

ianthomas23 · 2021-03-02T08:56:48Z

This is as polished as I am going to get it. Now just a single object (per Scheduler/Client/Worker) polls for incoming messages and passes them to the appropriate receiver.

On a single multicore machine with 1 thread per worker (and all the debug stuff turned off) it is about 10% slower than TCP. Under these circumstances it is using shared memory. But a more realistic test across multiple nodes of an Infiniband cluster it is only about half the speed of TCP, so comms were dominating.

I suspect that a problem could be selected to perform better than this, e.g. one that sends a smaller number of larger messages. But ultimately the small message 'chatter' between Scheduler/Client/Workers is slow due to the extra latency introduced by the MPI <-> dask communications. MPI isn't really designed to work the dask event-driven way. Also, all of the low level comms code in dask is an unnecessary overhead here as MPI deals with all of it, but removing it from the distributed MPI code path would be serious refactoring work.

I note that I haven't touched this code for a few months and there have been changes to distributed in the mean time that may make a difference (either way!).

It has been an interesting experiment, but has now dropped off the bottom of my priority list. If someone else wants to continue with it they are welcome to do so.

dalcinl reviewed Oct 2, 2020

View reviewed changes

ianthomas23 added 4 commits November 5, 2020 11:22

MPI communication layer

a85d522

Added option to encode messages as frames like tcp comms layer

13ada3c

flake8 corrections

4b6b954

Corrected mpi comms IP addresses

8ecb3d2

ianthomas23 force-pushed the mpi_comms_demonstrator branch from 3a70f09 to 8ecb3d2 Compare November 5, 2020 17:10

ianthomas23 added 7 commits November 7, 2020 21:54

Single tag for each MPIComm

9d27f00

Single MessageReceiver polling for received mpi messages

6b64122

Option to log contents of sent/received messages

23d1f67

Replace CommStore with TagGenerator

640a943

Option to log time taken by MPI sends and receives

0aee182

Added ReceivedMessageProxy class

9b4d361

Implement backoff in AsyncMPISend

2f7f8d0

jakirkham mentioned this pull request Mar 2, 2021

MPI4Dask - MVAPICH2-GDR based Communication Backend for the Dask Distributed Library #4461

Open

Base automatically changed from master to main March 8, 2021 19:04

ianthomas23 requested a review from fjetter as a code owner January 23, 2024 10:57

ianthomas23 closed this Sep 11, 2024

ianthomas23 deleted the mpi_comms_demonstrator branch September 11, 2024 05:36

Uh oh!

[WIP] MPI communication layer #4142

[WIP] MPI communication layer #4142

Uh oh!

Conversation

ianthomas23 commented Oct 1, 2020

Uh oh!

mrocklin commented Oct 2, 2020

Uh oh!

mrocklin commented Oct 2, 2020

Uh oh!

kmpaul commented Oct 2, 2020

Uh oh!

ianthomas23 commented Oct 2, 2020

Uh oh!

dalcinl Oct 2, 2020

Choose a reason for hiding this comment

Uh oh!

dalcinl Oct 2, 2020

Choose a reason for hiding this comment

Uh oh!

dalcinl commented Oct 2, 2020

Uh oh!

mrocklin commented Oct 2, 2020

Uh oh!

dalcinl commented Oct 3, 2020

Uh oh!

ianthomas23 commented Oct 3, 2020

Uh oh!

ianthomas23 commented Oct 3, 2020

Uh oh!

ianthomas23 commented Oct 3, 2020

Uh oh!

ianthomas23 commented Mar 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants