setting sync_dist to true for validation metrics#10257
setting sync_dist to true for validation metrics#10257krishnacpuvvada wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
Conversation
Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com>
|
Looks good.
@titu1994 can you elaborate? do you expect an uneven number of validation batches per node (why), or something else? |
|
Yes, the reason to avoid sync is two fold - I think I remember random hangs years ago due to a similar flag, dunno if it's still an issue now. Second one is it causes a global sync across all ranks every single step - which causes a large slowdown once you scale up nodes |
|
This is on validation epoch end so only a single sync per validation loop. That should be OK both hang-wise and speed-wise. |
|
Makes sense |
|
|
||
| if 'log' in output_dict: | ||
| self.log_dict(output_dict.pop('log'), on_epoch=True) | ||
| self.log_dict(output_dict.pop('log'), on_epoch=True, sync_dist=True, reduce_fx='mean') |
There was a problem hiding this comment.
Just FYI, reduce_fx doesn't apply to sync_dist, it applies to on_epoch, ie reduction over step values (docs.
Since this change is in on_validation_epoch_end(), I'm not sure if the reduce_fx='mean' will do anything, but that's the default anyways.
There was a problem hiding this comment.
Also FYI, the reduction function for syncing over ranks is set here
|
Another note, from my experience, |
|
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
|
This PR was closed because it has been inactive for 7 days since being marked as stale. |
|
@krishnacpuvvada @titu1994 what's the verdict on this one? Do we need a Canary-specific mechanism to synchronize the validation metrics if it can't be done at the modelPT level? |
|
(1) Don't push this to modelPT. Do note - @nithinraok plans to remove modelpt level support for multi validation and test data loader, which basically would mean every model has to handle this stuff manually anyway. |
What does this PR do ?
setting sync_dist=True for
on_validation_epoch_endinModelPT.pyThis is in response to the recent slack discussion on "wiping out of checkpoints mid training and starting from scratch" in Canary training.
Collection: [all] - The change is in ModelPT.py
Changelog
on_validation_epoch_endGitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information