IFU-master-2021-09-14 #40

rraminen · 2021-09-14T18:27:55Z

Integrating from upstream

Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

Co-authored-by: Alex Muzio <Alex.Muzio@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Felipe Cruz Salinas <Andres.Cruz@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <shaden.smith@microsoft.com> Co-authored-by: Young Jin Kim <youki@microsoft.com> Co-authored-by: bapatra <bapatra@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <shaden.smith@microsoft.com> Co-authored-by: Young Jin Kim <youki@microsoft.com>

* restore fp16 params if no zero ckpts available * formatting

…dai#1316) * Callable option for optimizer and scheduler * Add unit test * Formatting * Disable debug prints * Use base optimizer to construct lr scheduler * Formatting * Remove dead import

…pspeedai#1244) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

…se (deepspeedai#1309) * add more synchronizations and barriers for resolving gpu-halt issue * removing unuseful broadcasts

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Rename PA_TO_cpu * Code cleanup * Revert accidental change

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Remove the wrong function with duplicate name * fix format. * add mpu check. fix tests.

@tjruwase

* Added drop_last to DeepSpeedDataLoader This solves issue deepspeedai#326 * Updated drop_last in engine.py added drop_last as a ds_config as mentioned by @tjruwase * Update engine.py * Update engine.py * updated config.py and constants.py * Update constants.py * added dataloader_ prefix * Update dataloader.py * corrected yapf test errors * Update test_data.py Added dataloader_drop_last unit test * Corrected yapf and formatting issues * updated simple_model.py and test_data.py * Update simple_model.py * pre-commit fix * corrected issues * Update test_data.py * Update test_data.py * Update test_data.py * Update test_data.py * removed batch_size from test_data.py * Update simple_model.py * Update test_data.py * Update test_data.py * Fix unit test issues * Use fp32 to make things work Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

…ler (deepspeedai#1337)

…1339)

* Added 4-byte alignment on NCCL/RCCL * pre-commit formatting fixes * Fix for checkpoint loading with optimizer partitioning * Better assert print * Added unit tests for nccl/rccl 4-byte alignment * bug * Updated alignment to implicit Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* pass GAS boundary state from PP -> ZeRO * formatting Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

updated classifiers

…9-14

rraminen · 2021-09-15T16:37:36Z

Errors observed in gpt2 workload

unit tests summary on local system:
52 failed, 375 passed, 43 skipped

* [zero Init] fix regression * clean up the warning

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

) Co-authored-by: Reza Yazdani <reyazda@microsoft.com>

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [zero_to_fp32] fix padding removal * style * fix comments Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

…ests (deepspeedai#1405) * install HF w. dev extra to get all required packages * switch ds.init to use param dict instead of json file on disk * switch back to 'testing' extra

Bumps [nokogiri](https://github.com/sparklemotion/nokogiri) from 1.11.4 to 1.12.5. - [Release notes](https://github.com/sparklemotion/nokogiri/releases) - [Changelog](https://github.com/sparklemotion/nokogiri/blob/main/CHANGELOG.md) - [Commits](sparklemotion/nokogiri@v1.11.4...v1.12.5) --- updated-dependencies: - dependency-name: nokogiri dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

jithunnair-amd · 2021-10-06T15:38:11Z

Closing this PR since #43 subsumes it

conglongli and others added 30 commits August 16, 2021 18:57

Curriculum learning (deepspeedai#1307)

b2b34ae

Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

bump 0.5.1, DSE, moe docs

a1de767

MoE read the docs update (deepspeedai#1312)

9cb64a1

add moe to sidebar

e070a09

Updating the torch version check to numeric (deepspeedai#1314)

e804f15

add MoE press release links

058ab81

[docs] update moe features and news post

6cd5f87

Add issue templates

10b4840

[zero] restore fp16 params if no zero ckpts available (deepspeedai#1322)

aa12129

* restore fp16 params if no zero ckpts available * formatting

Support Callable type for client optimizer and lr_scheduler (deepspee…

274c375

…dai#1316) * Callable option for optimizer and scheduler * Add unit test * Formatting * Disable debug prints * Use base optimizer to construct lr scheduler * Formatting * Remove dead import

Reducing the memory-overhead of creating model for multi-GPU run (dee…

49b6a63

…pspeedai#1244) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

bump to 0.5.2

c1b0a4e

Add more synchronizations and barriers for the multi-gpu inference ca…

0ec11da

…se (deepspeedai#1309) * add more synchronizations and barriers for resolving gpu-halt issue * removing unuseful broadcasts

use scalar cpu-adam in case of exception in builder (deepspeedai#1259)

9645e7b

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

Activation checkpointing improvements (deepspeedai#1254)

85acf14

* Rename PA_TO_cpu * Code cleanup * Revert accidental change

Use clone to avoid checkpoint bloat (deepspeedai#1326)

336dd08

update for cuda-11.4 (deepspeedai#1329)

b9ece25

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

Remove duplicate clip grad function in deepspeed (deepspeedai#1333)

ddffbae

* Remove the wrong function with duplicate name * fix format. * add mpu check. fix tests.

Support client lr schedulers that are not subclass of torch _LRSchedu…

8e301b6

…ler (deepspeedai#1337)

Use mpu in zero.Init() (deepspeedai#1325)

e08c239

Update main.yml (deepspeedai#1338)

74f058b

[actions] split formatting and unit tests into two jobs (deepspeedai#…

600db09

…1339)

[actions] update branch triggers

8f299be

[actions] add master to formatting trigger

86b948f

[actions] add torch version runner label

0a32c3e

[actions] revert unit-test build name

3e7d06a

Update matmul.py (deepspeedai#1349)

eb97a42

jeffra and others added 8 commits September 9, 2021 20:46

update hiring link

168fce3

Update website hiring link

bff6126

Correctness fix PP+ZeRO for gradient accumulation (deepspeedai#1264)

b712bab

* pass GAS boundary state from PP -> ZeRO * formatting Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

Remove dropout as client code can do it independently. (deepspeedai#1354

9f5939d

) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

fix unit test failure when merging PP ckpt files (deepspeedai#1359)

9b915fe

Update setup.py (deepspeedai#1361)

8e577c9

updated classifiers

Merge remote-tracking branch 'upstream/master' into IFU-master-2021-0…

df6d3ab

…9-14

bump to 0.5.3

a708b18

Sean Naren and others added 13 commits September 15, 2021 16:13

Introduce a device rank when setting device (deepspeedai#1370)

90398a7

[zero Init] fix regression (deepspeedai#1373)

cf22a69

* [zero Init] fix regression * clean up the warning

[zero_to_fp32] adapt to 4-bytes alignment in z2 (deepspeedai#1372)

30537e7

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

bump to 0.5.4

45a498d

fix: support three digit layer numbers (deepspeedai#1377)

4ad8019

Sparse attn triton v1.0 support + torch1.8 test runner (deepspeedai#1374

6996bb0

) Co-authored-by: Reza Yazdani <reyazda@microsoft.com>

add tutorial on pytorch profiler usage (deepspeedai#1350)

51a2e91

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

[zero_to_fp32] fix padding removal (deepspeedai#1380)

364994a

* [zero_to_fp32] fix padding removal * style * fix comments Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

Fix from Felipe and Young for loading checkpoints. (deepspeedai#1389)

86dd6a6

[CI] Add HF transformers tests (deepspeedai#958)

c1829c4

Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

[CI] install fix for HF tests and use dict instead of json for some t…

9e5c0c5

…ests (deepspeedai#1405) * install HF w. dev extra to get all required packages * switch ds.init to use param dict instead of json file on disk * switch back to 'testing' extra

Mergedfrom upstream

7bb7c8b

jithunnair-amd closed this Oct 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

IFU-master-2021-09-14 #40

IFU-master-2021-09-14 #40

Uh oh!

rraminen commented Sep 14, 2021

Uh oh!

rraminen commented Sep 15, 2021

Uh oh!

jithunnair-amd commented Oct 6, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

IFU-master-2021-09-14 #40

IFU-master-2021-09-14 #40

Uh oh!

Conversation

rraminen commented Sep 14, 2021

Uh oh!

rraminen commented Sep 15, 2021

Uh oh!

jithunnair-amd commented Oct 6, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants