-
Notifications
You must be signed in to change notification settings - Fork 13
Description
First of all, thanks for the paper!
It was very intriguing to view model parallelism as an optimization problem in itself.
I wonder how would such scheduling work in a fully decentralized system?
Naively, you could run it concurrently on all nodes in hope that they find the same solution.
However, this naive option may be difficult to implement in geographically distributed networks: if nodes observe slightly different network bandwith, or if they take network measurements at a different time, they may end up with different solutions.
Is there a way to guarantee such network is consistent?
I mean, you can always elect a "leader" or let nodes vote on the solution, but perhaps there are more natural way to approach this.
What would you suggest?
p.s. another group that i'm in close contact faced similar issue their paper, and they ended up with a heuristic load-balancing rule where nodes greedily switch pipeline stages. However, unlike your work, they do not prove that such rule leads to optimal throughput.