less scary overflow notice by stas00 · Pull Request #833 · deepspeedai/DeepSpeed

stas00 · 2021-03-08T17:12:32Z

This all-caps OVERFLOW in:

[deepspeed] OVERFLOW! Skipping step. Attempted loss

is quite intimidating for the users and makes them feel that something is wrong.

This PR is suggesting a more Info-style no-caps version:

[deepspeed] Overflow! Skipping step. Attempted loss

The biggest problem here is that the "overflow" message lacks context. It'd be even more informative and less confusing to specifically say what it is an overflow of. Could it say perhaps:

[deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss

? Now users can act on it, since they know where to look for information to adjusting this if they prefer to get the optimizer to work from step 1.

It would have been even better, IMHO, if only 1 message were printed e.g. at the point when the first step is finally taken, e.g. as in:

First 22 steps were skipped due to fp16 dynamic loss scale overflow, starting stepping from step 23.

Otherwise there is a huge flurry of these messages.

stas00 · 2021-03-11T03:07:19Z

ok, as discussed changed it to:

[deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss

also fixed [deepscale] - I hear that was the original name.

* set adamw_mode default true (follows FusedAdam and < 0.3.11 logic) (deepspeedai#844) * less scary overflow notice (deepspeedai#833) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Add optimizers and schedules to RTD and updated the corresponding part in the website (deepspeedai#799) * add optimizers and schedules to rtd * update ds website and fix links * add optimizers and schedules to rtd * update ds website and fix links * add flops profiler to rtd * fix Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> * small tweaks (deepspeedai#839) * Control ZeRO wall clock timers (deepspeedai#849) * Control ZeRO wall clock timers * Disable more ZeRO3 debug prints Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [WarmupDecayLR] fix log(0) & 1/log(1) bugs (deepspeedai#772) * fix log(0) & 1/log(1) bugs * simplify Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> * bump to v0.3.12 * Bug fix: Remove client optimizer param_group list item that does not have 'params' (deepspeedai#827) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [doc] pipeline doc typos/improvements (deepspeedai#659) Admin merging for pure-doc PR that does not trigger build. * Samyamr/inference hook fix (deepspeedai#851) * Fix mis-aligned-grad When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that. * Formatting fix * Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size * also removing alignment from flat fp16 buffers * Testing for hidden dim alignment * inference hook fix * Update stage3.py * formatting * [bug-fix] move params to gpu if offload params is turned off Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * ZeRO Stage 2: Clear reduced gradients (deepspeedai#856) * Ensure gradients of other partitions are cleared after reduction * Remove redundant code Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Squash stage3 v1 (deepspeedai#146) Co-authored-by: Samyam <samyamr@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: eltonzheng <eltonz@microsoft.com> * formatting fix (deepspeedai#150) * stage3 bugfix (API) update and simplified FP16 Z3 tests (deepspeedai#151) * fp16 Z3 API update and bugfix * revert debug change * docs * filling in allocation docs * better assumption docs * doc progress * config json * major docs edits * auto registration works for accessed cases * working on small models. * debugging large-model discovery? * fix discovery to first forward pass? * return obj ext param * support None parameters in auto-discovery Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: eltonzheng <eltonz@microsoft.com>

less scary overflow notice

e423d10

stas00 requested review from RezaYazdaniAminabadi, ShadenSmith, arashashari, awan-10, cli99, conglongli, eltonzheng, jeffra, minjiaz, niumanar, samyam and tjruwase as code owners March 8, 2021 17:12

Merge branch 'master' into less-scary-overflow

512c2ce

jeffra approved these changes Mar 11, 2021

View reviewed changes

stas00 and others added 2 commits March 10, 2021 19:06

make the message more user-friendly

f82716d

formatting

8d50c85

jeffra merged commit 29853c3 into deepspeedai:master Mar 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

less scary overflow notice#833

less scary overflow notice#833
jeffra merged 4 commits intodeepspeedai:masterfrom
stas00:less-scary-overflow

stas00 commented Mar 8, 2021 •

edited

Loading

Uh oh!

stas00 commented Mar 11, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stas00 commented Mar 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Mar 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stas00 commented Mar 8, 2021 •

edited

Loading

stas00 commented Mar 11, 2021 •

edited

Loading