Skip to content

Eliminate multi-tensor LAMB reduction in favor of applying reduce_square_sum each tensor#6023

Closed
suffiank wants to merge 8 commits intomasterfrom
sukha/simplifylambreduce
Closed

Eliminate multi-tensor LAMB reduction in favor of applying reduce_square_sum each tensor#6023
suffiank wants to merge 8 commits intomasterfrom
sukha/simplifylambreduce

Conversation

@suffiank
Copy link
Contributor

@suffiank suffiank commented Dec 3, 2020

Description: Eliminate multi-tensor LAMB reduction in favor of invoking reduce_square_sum individually all tensors. This simplifies the existing code but comes at a 1% perf reduction for BERT-L seqlen 128 bs 64 gradacc 1. It will be less for gradacc > 1, as in a more realistic training scenario.

Motivation and Context

  • Why is this change required? What problem does it solve?
    This simplifies the existing code and makes it deterministic. It is also possible to make the existing multi-tensor LAMB reduction kernel deterministic by ordering the reduction across thread blocks. This does not reduce perf.

image

Copy link
Contributor

@wschin wschin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge another PR which nicely fixes the randomness of Reduce in Lamb.

@suffiank suffiank closed this Dec 4, 2020
@suffiank
Copy link
Contributor Author

suffiank commented Dec 4, 2020

Let's merge another PR which nicely fixes the randomness of Reduce in Lamb.

Agreed.

@suffiank suffiank deleted the sukha/simplifylambreduce branch February 19, 2021 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants