Overview
For now, the way of low level zero in ColossalAI sharding the OS when init may lead to an unbalanced load on each rank. For instance, if a model had 5 params(each contains some parameters) and we got 8 GPUs with Zero-DP, with the current implementation, 3 GPUs would have no OS on them. Thus, a sharding method splitting all parameters evenly is needed.
Goal
With a new sharding method, each rank would get a similar number of parameters.
Overview
For now, the way of low level zero in ColossalAI sharding the OS when init may lead to an unbalanced load on each rank. For instance, if a model had 5 params(each contains some parameters) and we got 8 GPUs with Zero-DP, with the current implementation, 3 GPUs would have no OS on them. Thus, a sharding method splitting all parameters evenly is needed.
Goal
With a new sharding method, each rank would get a similar number of parameters.