Motivation
We have three main components which related to process group initialization:
- Global parallel context
- Device mesh
- Process group manager
Global parallel context is compatible with all kinds of famous parallelism, but it has below drawbacks:
- It's global, which means it's not flexible enough
- It's deeply coupled with parallel method, which means it's not easy to extend
- Some namings are confusing, e.g.
local_rank
Device mesh it to decribe how a tensor is stored. It's great for tensor parallelism, but not for other parallelism.
Process group manager is too simple, which is just a dict of process groups, to handle complex ND-parallelism scenario.
In conclusion, we need a component which is:
- Totally decoupled with parallel method
- Not global
- Easy to handle complex ND-parallism
Process group mesh
Process group mesh is to describe how to organize process groups. It's not coupled with parallel method. However, through it, it's easy to initialize process groups in ND-parallelism scenario.
It's a helper/utility class. It just initializes process groups and cache them. Exact parallel method will mange them.
We can use a ND-tuple to describe a process group mesh. E.g. ProcessGroupMesh(2, 2, 2) means a 3D cube process group mesh. We can further use a ND-coordinate to describe each process. E.g. (0, 1, 0) means the process whose rank is 2 in the above process group mesh. In classic 3D-parallelim scenario, each parallel method takes an axis. E.g. data parallelism takes axis-0, pipeline parallelism takes axis-1 and tensor parallelism takes axis-2. Process group mesh will provide a method to create group along axis, thus, it's easy to handle 3D-parallism.
Motivation
We have three main components which related to process group initialization:
Global parallel context is compatible with all kinds of famous parallelism, but it has below drawbacks:
local_rankDevice mesh it to decribe how a tensor is stored. It's great for tensor parallelism, but not for other parallelism.
Process group manager is too simple, which is just a dict of process groups, to handle complex ND-parallelism scenario.
In conclusion, we need a component which is:
Process group mesh
Process group mesh is to describe how to organize process groups. It's not coupled with parallel method. However, through it, it's easy to initialize process groups in ND-parallelism scenario.
It's a helper/utility class. It just initializes process groups and cache them. Exact parallel method will mange them.
We can use a ND-tuple to describe a process group mesh. E.g.
ProcessGroupMesh(2, 2, 2)means a 3D cube process group mesh. We can further use a ND-coordinate to describe each process. E.g.(0, 1, 0)means the process whose rank is 2 in the above process group mesh. In classic 3D-parallelim scenario, each parallel method takes an axis. E.g. data parallelism takes axis-0, pipeline parallelism takes axis-1 and tensor parallelism takes axis-2. Process group mesh will provide a method to create group along axis, thus, it's easy to handle 3D-parallism.