IFU-master-2021-09-29 #43

rraminen · 2021-09-29T23:15:59Z

IFU

The below conflicts have been resolved:
CONFLICT (content): Merge conflict in tests/unit/test_config.py
CONFLICT (content): Merge conflict in deepspeed/runtime/zero/stage2.py

test_config.log
stage2.log

@jithunnair-amd

Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

Co-authored-by: Alex Muzio <Alex.Muzio@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Felipe Cruz Salinas <Andres.Cruz@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <shaden.smith@microsoft.com> Co-authored-by: Young Jin Kim <youki@microsoft.com> Co-authored-by: bapatra <bapatra@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <shaden.smith@microsoft.com> Co-authored-by: Young Jin Kim <youki@microsoft.com>

* restore fp16 params if no zero ckpts available * formatting

…dai#1316) * Callable option for optimizer and scheduler * Add unit test * Formatting * Disable debug prints * Use base optimizer to construct lr scheduler * Formatting * Remove dead import

…pspeedai#1244) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

…se (deepspeedai#1309) * add more synchronizations and barriers for resolving gpu-halt issue * removing unuseful broadcasts

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Rename PA_TO_cpu * Code cleanup * Revert accidental change

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Remove the wrong function with duplicate name * fix format. * add mpu check. fix tests.

@tjruwase

* Added drop_last to DeepSpeedDataLoader This solves issue deepspeedai#326 * Updated drop_last in engine.py added drop_last as a ds_config as mentioned by @tjruwase * Update engine.py * Update engine.py * updated config.py and constants.py * Update constants.py * added dataloader_ prefix * Update dataloader.py * corrected yapf test errors * Update test_data.py Added dataloader_drop_last unit test * Corrected yapf and formatting issues * updated simple_model.py and test_data.py * Update simple_model.py * pre-commit fix * corrected issues * Update test_data.py * Update test_data.py * Update test_data.py * Update test_data.py * removed batch_size from test_data.py * Update simple_model.py * Update test_data.py * Update test_data.py * Fix unit test issues * Use fp32 to make things work Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

…ler (deepspeedai#1337)

…1339)

* Added 4-byte alignment on NCCL/RCCL * pre-commit formatting fixes * Fix for checkpoint loading with optimizer partitioning * Better assert print * Added unit tests for nccl/rccl 4-byte alignment * bug * Updated alignment to implicit Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [zero Init] fix regression * clean up the warning

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

) Co-authored-by: Reza Yazdani <reyazda@microsoft.com>

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [zero_to_fp32] fix padding removal * style * fix comments Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

…ests (deepspeedai#1405) * install HF w. dev extra to get all required packages * switch ds.init to use param dict instead of json file on disk * switch back to 'testing' extra

Bumps [nokogiri](https://github.com/sparklemotion/nokogiri) from 1.11.4 to 1.12.5. - [Release notes](https://github.com/sparklemotion/nokogiri/releases) - [Changelog](https://github.com/sparklemotion/nokogiri/blob/main/CHANGELOG.md) - [Commits](sparklemotion/nokogiri@v1.11.4...v1.12.5) --- updated-dependencies: - dependency-name: nokogiri dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

rraminen · 2021-09-30T19:26:37Z

Local tests:

Unit tests summary:
=========================== short test summary info ============================
FAILED tests/unit/test_checkpointing.py::test_checkpoint_unfused_optimizer
FAILED tests/unit/test_checkpointing.py::test_checkpoint_fused_optimizer
====== 2 failed, 371 passed, 98 skipped, 1 warning in 3199.32s (0:53:19) =======

Bing BERT - No issues

Megatron LM v1.1.5 345 M param model - No issues

jithunnair-amd · 2021-10-01T17:33:48Z

CI Unit test build http://rocmhead.amd.com:8080/job/pytorch-deepspeed-pr-build-unit-tests/26/artifact/DeepSpeed/unit_tests_py3.6.log showed the same errors as local run:

FAILED tests/unit/test_checkpointing.py::test_checkpoint_unfused_optimizer
FAILED tests/unit/test_checkpointing.py::test_checkpoint_fused_optimizer

Later CI unit test build http://rocmhead.amd.com:8080/job/pytorch-deepspeed-pr-build-unit-tests/27/artifact/DeepSpeed/unit_tests_py3.6.log aborted with timeout.

jithunnair-amd · 2021-10-01T19:02:53Z

GPT2 CI build is giving wrong signal (says passing when it's actually failing). Can we see if we can rectify it?

rraminen · 2021-10-08T15:57:24Z

The reasons for keeping this PR open

Evaluating CIs
Implementing 8.3 B param model of Megatron-LM v1.1.5 gpt2 and updating the script in pytorch-deepspeed-pr-build-gpt2 CI

jithunnair-amd · 2021-11-18T19:05:14Z

PR-to-CI issues are still unresolved, but @rraminen will continue to work on them.
As for 8.3B param GPT2, script for running it with Megatron1.1.5 and Zero3 has been added in ROCm/DeepSpeedExamples#13. We'll need to update the DeepSpeedExamples commit and then use this new script in the CI.

conglongli and others added 30 commits August 16, 2021 18:57

Curriculum learning (deepspeedai#1307)

b2b34ae

Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

bump 0.5.1, DSE, moe docs

a1de767

MoE read the docs update (deepspeedai#1312)

9cb64a1

add moe to sidebar

e070a09

Updating the torch version check to numeric (deepspeedai#1314)

e804f15

add MoE press release links

058ab81

[docs] update moe features and news post

6cd5f87

Add issue templates

10b4840

[zero] restore fp16 params if no zero ckpts available (deepspeedai#1322)

aa12129

* restore fp16 params if no zero ckpts available * formatting

Support Callable type for client optimizer and lr_scheduler (deepspee…

274c375

…dai#1316) * Callable option for optimizer and scheduler * Add unit test * Formatting * Disable debug prints * Use base optimizer to construct lr scheduler * Formatting * Remove dead import

Reducing the memory-overhead of creating model for multi-GPU run (dee…

49b6a63

…pspeedai#1244) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

bump to 0.5.2

c1b0a4e

Add more synchronizations and barriers for the multi-gpu inference ca…

0ec11da

…se (deepspeedai#1309) * add more synchronizations and barriers for resolving gpu-halt issue * removing unuseful broadcasts

use scalar cpu-adam in case of exception in builder (deepspeedai#1259)

9645e7b

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

Activation checkpointing improvements (deepspeedai#1254)

85acf14

* Rename PA_TO_cpu * Code cleanup * Revert accidental change

Use clone to avoid checkpoint bloat (deepspeedai#1326)

336dd08

update for cuda-11.4 (deepspeedai#1329)

b9ece25

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

Remove duplicate clip grad function in deepspeed (deepspeedai#1333)

ddffbae

* Remove the wrong function with duplicate name * fix format. * add mpu check. fix tests.

Support client lr schedulers that are not subclass of torch _LRSchedu…

8e301b6

…ler (deepspeedai#1337)

Use mpu in zero.Init() (deepspeedai#1325)

e08c239

Update main.yml (deepspeedai#1338)

74f058b

[actions] split formatting and unit tests into two jobs (deepspeedai#…

600db09

…1339)

[actions] update branch triggers

8f299be

[actions] add master to formatting trigger

86b948f

[actions] add torch version runner label

0a32c3e

[actions] revert unit-test build name

3e7d06a

Update matmul.py (deepspeedai#1349)

eb97a42

Sean Naren and others added 14 commits September 15, 2021 16:13

Introduce a device rank when setting device (deepspeedai#1370)

90398a7

[zero Init] fix regression (deepspeedai#1373)

cf22a69

* [zero Init] fix regression * clean up the warning

[zero_to_fp32] adapt to 4-bytes alignment in z2 (deepspeedai#1372)

30537e7

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

bump to 0.5.4

45a498d

fix: support three digit layer numbers (deepspeedai#1377)

4ad8019

Sparse attn triton v1.0 support + torch1.8 test runner (deepspeedai#1374

6996bb0

) Co-authored-by: Reza Yazdani <reyazda@microsoft.com>

add tutorial on pytorch profiler usage (deepspeedai#1350)

51a2e91

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

[zero_to_fp32] fix padding removal (deepspeedai#1380)

364994a

* [zero_to_fp32] fix padding removal * style * fix comments Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

Fix from Felipe and Young for loading checkpoints. (deepspeedai#1389)

86dd6a6

[CI] Add HF transformers tests (deepspeedai#958)

c1829c4

Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

[CI] install fix for HF tests and use dict instead of json for some t…

9e5c0c5

…ests (deepspeedai#1405) * install HF w. dev extra to get all required packages * switch ds.init to use param dict instead of json file on disk * switch back to 'testing' extra

IFU-master-2021-09-29

141ed70

Trigger Build

389cb5c

rraminen added 4 commits October 1, 2021 15:24

Trigger Build

9cf0419

Trigger Build

d4f0402

Trigger Build

5ab64f3

Trigger Build

03d9c9c

jithunnair-amd mentioned this pull request Oct 6, 2021

IFU-master-2021-09-14 #40

Closed

rraminen added 5 commits October 8, 2021 20:33

Trigger Build

7926893

Trigger CI

181a6b7

Trigger CI

18940cb

Trigger CI

c9be5b8

Trigger CI

2db31ab

jithunnair-amd merged commit 2bc2f49 into ROCm:master Nov 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

IFU-master-2021-09-29 #43

IFU-master-2021-09-29 #43

Uh oh!

rraminen commented Sep 29, 2021

Uh oh!

rraminen commented Sep 30, 2021

Uh oh!

jithunnair-amd commented Oct 1, 2021

Uh oh!

jithunnair-amd commented Oct 1, 2021

Uh oh!

rraminen commented Oct 8, 2021

Uh oh!

jithunnair-amd commented Nov 18, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

IFU-master-2021-09-29 #43

IFU-master-2021-09-29 #43

Uh oh!

Conversation

rraminen commented Sep 29, 2021

Uh oh!

rraminen commented Sep 30, 2021

Uh oh!

jithunnair-amd commented Oct 1, 2021

Uh oh!

jithunnair-amd commented Oct 1, 2021

Uh oh!

rraminen commented Oct 8, 2021

Uh oh!

jithunnair-amd commented Nov 18, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants