Thomas/olruwase/sync layer norms#277
Conversation
| # We use rank 1 as source of truth and sed the new | ||
| torch.distributed.broadcast( | ||
| torch_rng_state, | ||
| src=mpu.get_tensor_model_parallel_src_rank() + 1, |
There was a problem hiding this comment.
may I ask why rank 1 and not rank 0?
(which would fail with tp=1, but I'm aware that this is a hack for our particular case with tp=4 so should be fine)
There was a problem hiding this comment.
1,2,3 were synchronized, 0 was out of sync. So I thought the path of least change was to force match 0 to the rest, instead of matching everyone to 0.
There was a problem hiding this comment.
gotcha - makes sense! thank you for explaining.
So is this fix related to #276?
There was a problem hiding this comment.
Yes sorry this fix here is going to be merged on the force syncing branch of Meg-DS olruwase/sync_layer_norms. It incorporates the force syncing strategy where the current training is out of sync. It currently works only because the sample we use is not random. (otherwise we can't guarantee that we don't see duplicated samples)
#276 is the general fix that should be integrated in master.
…p#277) * Introduce LayerNorm optimization from NVIDIA/apex#1715 * Fix args call * Ad-hoc apex version check * Remove unnecessary TransformerConfig arg
Syncing of torch random state.