Make dataloader use another random generator#276
Conversation
|
From https://pytorch.org/docs/stable/data.html#randomness-in-multi-process-data-loading:
How do you interpret this writeup?
|
The issue was |
why won't the other ranks? I think here by "main process" it refers to the process that spawns the dataloader workers. Won't that main process be each rank's process with relationship to its workers? i.e. there are multiple main processes in this context. Since I haven't investigated this - have you validated that this fix actually changes something? I assume you do, but you haven't described this in the OP, hence the asking. |
Megatron-DeepSpeed/megatron/training.py Lines 1170 to 1174 in 87a9dba tp_rank = 0 if you go up the file a bit. The main process here will refer to tp_rank=0
Yes I was able to reproduce the discrepancy in
My bad, since we talked about this on slack I didn't copy paste the thread. |
stas00
left a comment
There was a problem hiding this comment.
Thank you for adding the missing sync, Thomas
* sync layer norms * all_reduce is an in_place operation * Make dataloader use another random generator (#276) * do all_reduce op.AVG directly * add eval dataloader deadlock workaround * revert generator sync * make auto-sync configurable; basic test; cleanup * test with updated AMI image * fix unrelated test Co-authored-by: thomasw21 <24695242+thomasw21@users.noreply.github.com>
* sync layer norms * all_reduce is an in_place operation * Make dataloader use another random generator (bigscience-workshop#276) * do all_reduce op.AVG directly * add eval dataloader deadlock workaround * revert generator sync * make auto-sync configurable; basic test; cleanup * test with updated AMI image * fix unrelated test Co-authored-by: thomasw21 <24695242+thomasw21@users.noreply.github.com>
* universal-ckp: fix gpt model param names Signed-off-by: Moshe Island <misland@habana.ai> * universal-ckp: reconfigure model parameter rng tracker When loading from universal checkpoint with a different model parameter configuration, the loaded tensor parallel RNG tracker states are incorrect. In this case, we reconfigure the tensor parallel RNG tracker states with new seed values (each tp rank with a unique seed). We add an offset=iteration to the base seed. This is to ensure that when we load multiple times from universal checkpoint, we will use a different random sequence at each run. This commit requires a counter change in DeepSpeed repo. Signed-off-by: Moshe Island <misland@habana.ai> * universal-ckp: remove embedding norm patterns Embedding norm patterns originate from Bloom, but are not in vanilla GPT. Therefore, Remove the patterns. Signed-off-by: Moshe Island <misland@habana.ai> --------- Signed-off-by: Moshe Island <misland@habana.ai> Co-authored-by: Moshe Island <misland@habana.ai>
This is in order to start synchronizing the dropout across TP.