Motivation
Current sharding method is
- Create a new parameter
- Shard the new parameter
- Shard the old parameter
- Copy the old parameter shard to new one
This has below shortcomings:
- Not memory efficient
- Must handle tied weights again (after sharding)
- Need to update param groups of optimizer (if using lazy init and sharding after optimizer init)
Method
Thus, I think we can shard parameters inplace. This is memory efficient and we don't need to handle tied weights or param groups again.
Motivation
Current sharding method is
This has below shortcomings:
Method
Thus, I think we can shard parameters inplace. This is memory efficient and we don't need to handle tied weights or param groups again.