setting sync_dist to true for validation metrics by krishnacpuvvada · Pull Request #10257 · NVIDIA-NeMo/NeMo

krishnacpuvvada · 2024-08-26T16:04:48Z

What does this PR do ?

setting sync_dist=True for on_validation_epoch_end in ModelPT.py
This is in response to the recent slack discussion on "wiping out of checkpoints mid training and starting from scratch" in Canary training.

Collection: [all] - The change is in ModelPT.py

Changelog

setting sync_dist=true in on_validation_epoch_end

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com>

titu1994

This can cause hangs, but for now let's try it.

pzelasko · 2024-08-26T16:11:44Z

Looks good.

This can cause hands, but for now let's try it.

@titu1994 can you elaborate? do you expect an uneven number of validation batches per node (why), or something else?

titu1994 · 2024-08-26T16:19:00Z

Yes, the reason to avoid sync is two fold - I think I remember random hangs years ago due to a similar flag, dunno if it's still an issue now. Second one is it causes a global sync across all ranks every single step - which causes a large slowdown once you scale up nodes

pzelasko · 2024-08-26T21:04:13Z

This is on validation epoch end so only a single sync per validation loop. That should be OK both hang-wise and speed-wise.

titu1994 · 2024-08-26T21:04:46Z

Makes sense

maanug-nv · 2024-08-26T21:06:58Z

nemo/core/classes/modelPT.py


            if 'log' in output_dict:
-                self.log_dict(output_dict.pop('log'), on_epoch=True)
+                self.log_dict(output_dict.pop('log'), on_epoch=True, sync_dist=True, reduce_fx='mean')


Just FYI, reduce_fx doesn't apply to sync_dist, it applies to on_epoch, ie reduction over step values (docs.
Since this change is in on_validation_epoch_end(), I'm not sure if the reduce_fx='mean' will do anything, but that's the default anyways.

Also FYI, the reduction function for syncing over ranks is set here

maanug-nv · 2024-08-26T21:12:09Z

Another note, from my experience, sync_dist=True is tricky with parallelisms, especially pipeline parallelisms. If some metrics are non-existent on some ranks (eg val_loss is only on last pp stage, so we have to sync it), this could cause a hang. If the metric is 0.0 on some ranks, the mean reduction over ranks could be wrong or not produce intended value.

github-actions · 2024-09-10T01:55:57Z

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

github-actions · 2024-09-18T01:55:42Z

This PR was closed because it has been inactive for 7 days since being marked as stale.

pzelasko · 2024-10-22T14:30:23Z

@krishnacpuvvada @titu1994 what's the verdict on this one? Do we need a Canary-specific mechanism to synchronize the validation metrics if it can't be done at the modelPT level?

titu1994 · 2024-10-22T17:01:12Z

(1) Don't push this to modelPT.
(2) If Canary needs it, it can override this function specifically and copy paste the code (for now).

Do note - @nithinraok plans to remove modelpt level support for multi validation and test data loader, which basically would mean every model has to handle this stuff manually anyway.

setting sync_dist to true for validation metrics

cfcab79

Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com>

krishnacpuvvada requested review from ericharper, pzelasko and titu1994 August 26, 2024 16:04

github-actions bot added the core Changes to NeMo Core label Aug 26, 2024

titu1994 approved these changes Aug 26, 2024

View reviewed changes

pzelasko added the Run CICD label Aug 26, 2024

ericharper requested a review from maanug-nv August 26, 2024 20:08

pzelasko approved these changes Aug 26, 2024

View reviewed changes

maanug-nv reviewed Aug 26, 2024

View reviewed changes

github-actions bot added the stale label Sep 10, 2024

github-actions bot closed this Sep 18, 2024

pzelasko mentioned this pull request Dec 10, 2024

Sync validation metrics for ASRModel #11533

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

setting sync_dist to true for validation metrics#10257

setting sync_dist to true for validation metrics#10257
krishnacpuvvada wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
krishnacpuvvada:sync_val_metrics

krishnacpuvvada commented Aug 26, 2024

Uh oh!

titu1994 left a comment •

edited

Loading

Uh oh!

pzelasko commented Aug 26, 2024

Uh oh!

titu1994 commented Aug 26, 2024

Uh oh!

pzelasko commented Aug 26, 2024

Uh oh!

titu1994 commented Aug 26, 2024

Uh oh!

maanug-nv Aug 26, 2024

Uh oh!

maanug-nv Aug 26, 2024

Uh oh!

maanug-nv commented Aug 26, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Sep 10, 2024

Uh oh!

github-actions bot commented Sep 18, 2024

Uh oh!

pzelasko commented Oct 22, 2024

Uh oh!

titu1994 commented Oct 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

krishnacpuvvada commented Aug 26, 2024

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Uh oh!

titu1994 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pzelasko commented Aug 26, 2024

Uh oh!

titu1994 commented Aug 26, 2024

Uh oh!

pzelasko commented Aug 26, 2024

Uh oh!

titu1994 commented Aug 26, 2024

Uh oh!

maanug-nv Aug 26, 2024

Choose a reason for hiding this comment

Uh oh!

maanug-nv Aug 26, 2024

Choose a reason for hiding this comment

Uh oh!

maanug-nv commented Aug 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 10, 2024

Uh oh!

github-actions bot commented Sep 18, 2024

Uh oh!

pzelasko commented Oct 22, 2024

Uh oh!

titu1994 commented Oct 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

titu1994 left a comment •

edited

Loading

maanug-nv commented Aug 26, 2024 •

edited

Loading