Skip to content

Comments

setting sync_dist to true for validation metrics#10257

Closed
krishnacpuvvada wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
krishnacpuvvada:sync_val_metrics
Closed

setting sync_dist to true for validation metrics#10257
krishnacpuvvada wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
krishnacpuvvada:sync_val_metrics

Conversation

@krishnacpuvvada
Copy link
Collaborator

What does this PR do ?

setting sync_dist=True for on_validation_epoch_end in ModelPT.py
This is in response to the recent slack discussion on "wiping out of checkpoints mid training and starting from scratch" in Canary training.

Collection: [all] - The change is in ModelPT.py

Changelog

  • setting sync_dist=true in on_validation_epoch_end

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com>
@github-actions github-actions bot added the core Changes to NeMo Core label Aug 26, 2024
Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can cause hangs, but for now let's try it.

@pzelasko
Copy link
Collaborator

Looks good.

This can cause hands, but for now let's try it.

@titu1994 can you elaborate? do you expect an uneven number of validation batches per node (why), or something else?

@titu1994
Copy link
Collaborator

Yes, the reason to avoid sync is two fold - I think I remember random hangs years ago due to a similar flag, dunno if it's still an issue now. Second one is it causes a global sync across all ranks every single step - which causes a large slowdown once you scale up nodes

@ericharper ericharper requested a review from maanug-nv August 26, 2024 20:08
@pzelasko
Copy link
Collaborator

This is on validation epoch end so only a single sync per validation loop. That should be OK both hang-wise and speed-wise.

@titu1994
Copy link
Collaborator

Makes sense


if 'log' in output_dict:
self.log_dict(output_dict.pop('log'), on_epoch=True)
self.log_dict(output_dict.pop('log'), on_epoch=True, sync_dist=True, reduce_fx='mean')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI, reduce_fx doesn't apply to sync_dist, it applies to on_epoch, ie reduction over step values (docs.
Since this change is in on_validation_epoch_end(), I'm not sure if the reduce_fx='mean' will do anything, but that's the default anyways.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also FYI, the reduction function for syncing over ranks is set here

@maanug-nv
Copy link
Collaborator

maanug-nv commented Aug 26, 2024

Another note, from my experience, sync_dist=True is tricky with parallelisms, especially pipeline parallelisms. If some metrics are non-existent on some ranks (eg val_loss is only on last pp stage, so we have to sync it), this could cause a hang. If the metric is 0.0 on some ranks, the mean reduction over ranks could be wrong or not produce intended value.

@github-actions
Copy link
Contributor

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Sep 10, 2024
@github-actions
Copy link
Contributor

This PR was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this Sep 18, 2024
@pzelasko
Copy link
Collaborator

@krishnacpuvvada @titu1994 what's the verdict on this one? Do we need a Canary-specific mechanism to synchronize the validation metrics if it can't be done at the modelPT level?

@titu1994
Copy link
Collaborator

(1) Don't push this to modelPT.
(2) If Canary needs it, it can override this function specifically and copy paste the code (for now).

Do note - @nithinraok plans to remove modelpt level support for multi validation and test data loader, which basically would mean every model has to handle this stuff manually anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Changes to NeMo Core Run CICD stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants