Skip to content

Conversation

@rraminen
Copy link

IFU

The below conflicts have been resolved:
CONFLICT (content): Merge conflict in tests/unit/test_config.py
CONFLICT (content): Merge conflict in deepspeed/runtime/zero/stage2.py

test_config.log
stage2.log

@jithunnair-amd

conglongli and others added 30 commits August 16, 2021 18:57
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Alex Muzio <Alex.Muzio@microsoft.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Co-authored-by: Felipe Cruz Salinas <Andres.Cruz@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <shaden.smith@microsoft.com>
Co-authored-by: Young Jin Kim <youki@microsoft.com>
Co-authored-by: bapatra <bapatra@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <shaden.smith@microsoft.com>
Co-authored-by: Young Jin Kim <youki@microsoft.com>
* restore fp16 params if no zero ckpts available

* formatting
…dai#1316)

* Callable option for optimizer and scheduler

* Add unit test

* Formatting

* Disable debug prints

* Use base optimizer to construct lr scheduler

* Formatting

* Remove dead import
…se (deepspeedai#1309)

* add more synchronizations and barriers for resolving gpu-halt issue

* removing unuseful broadcasts
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
* Rename PA_TO_cpu

* Code cleanup

* Revert accidental change
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* Remove the wrong function with duplicate name

* fix format.

* add mpu check. fix tests.
* Added drop_last to DeepSpeedDataLoader

This solves issue deepspeedai#326

* Updated drop_last in engine.py

added drop_last as a ds_config as mentioned by @tjruwase

* Update engine.py

* Update engine.py

* updated config.py and constants.py

* Update constants.py

* added dataloader_ prefix

* Update dataloader.py

* corrected yapf test errors

* Update test_data.py

Added dataloader_drop_last unit test

* Corrected yapf and formatting issues

* updated simple_model.py and test_data.py

* Update simple_model.py

* pre-commit fix

* corrected issues

* Update test_data.py

* Update test_data.py

* Update test_data.py

* Update test_data.py

* removed batch_size from test_data.py

* Update simple_model.py

* Update test_data.py

* Update test_data.py

* Fix unit test issues

* Use fp32 to make things work

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* Added 4-byte alignment on NCCL/RCCL

* pre-commit formatting fixes

* Fix for checkpoint loading with optimizer partitioning

* Better assert print

* Added unit tests for nccl/rccl 4-byte alignment

* bug

* Updated alignment to implicit

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Sean Naren and others added 14 commits September 15, 2021 16:13
* [zero Init] fix regression

* clean up the warning
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
)

Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* [zero_to_fp32] fix padding removal

* style

* fix comments

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
…ests (deepspeedai#1405)

* install HF w. dev extra to get all required packages

* switch ds.init to use param dict instead of json file on disk

* switch back to 'testing' extra
Bumps [nokogiri](https://github.com/sparklemotion/nokogiri) from 1.11.4 to 1.12.5.
- [Release notes](https://github.com/sparklemotion/nokogiri/releases)
- [Changelog](https://github.com/sparklemotion/nokogiri/blob/main/CHANGELOG.md)
- [Commits](sparklemotion/nokogiri@v1.11.4...v1.12.5)

---
updated-dependencies:
- dependency-name: nokogiri
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
@rraminen
Copy link
Author

Local tests:

Unit tests summary:
=========================== short test summary info ============================
FAILED tests/unit/test_checkpointing.py::test_checkpoint_unfused_optimizer
FAILED tests/unit/test_checkpointing.py::test_checkpoint_fused_optimizer
====== 2 failed, 371 passed, 98 skipped, 1 warning in 3199.32s (0:53:19) =======

Bing BERT - No issues

Megatron LM v1.1.5 345 M param model - No issues

@jithunnair-amd
Copy link
Collaborator

CI Unit test build http://rocmhead.amd.com:8080/job/pytorch-deepspeed-pr-build-unit-tests/26/artifact/DeepSpeed/unit_tests_py3.6.log showed the same errors as local run:

FAILED tests/unit/test_checkpointing.py::test_checkpoint_unfused_optimizer
FAILED tests/unit/test_checkpointing.py::test_checkpoint_fused_optimizer

Later CI unit test build http://rocmhead.amd.com:8080/job/pytorch-deepspeed-pr-build-unit-tests/27/artifact/DeepSpeed/unit_tests_py3.6.log aborted with timeout.

@jithunnair-amd
Copy link
Collaborator

GPT2 CI build is giving wrong signal (says passing when it's actually failing). Can we see if we can rectify it?

@rraminen
Copy link
Author

rraminen commented Oct 8, 2021

The reasons for keeping this PR open

  1. Evaluating CIs
  2. Implementing 8.3 B param model of Megatron-LM v1.1.5 gpt2 and updating the script in pytorch-deepspeed-pr-build-gpt2 CI

@jithunnair-amd
Copy link
Collaborator

PR-to-CI issues are still unresolved, but @rraminen will continue to work on them.
As for 8.3B param GPT2, script for running it with Megatron1.1.5 and Zero3 has been added in ROCm/DeepSpeedExamples#13. We'll need to update the DeepSpeedExamples commit and then use this new script in the CI.

@jithunnair-amd jithunnair-amd merged commit 2bc2f49 into ROCm:master Nov 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.