-
Notifications
You must be signed in to change notification settings - Fork 3
IFU-master-2021-09-29 #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Alex Muzio <Alex.Muzio@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Felipe Cruz Salinas <Andres.Cruz@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <shaden.smith@microsoft.com> Co-authored-by: Young Jin Kim <youki@microsoft.com> Co-authored-by: bapatra <bapatra@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <shaden.smith@microsoft.com> Co-authored-by: Young Jin Kim <youki@microsoft.com>
* restore fp16 params if no zero ckpts available * formatting
…dai#1316) * Callable option for optimizer and scheduler * Add unit test * Formatting * Disable debug prints * Use base optimizer to construct lr scheduler * Formatting * Remove dead import
…pspeedai#1244) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
…se (deepspeedai#1309) * add more synchronizations and barriers for resolving gpu-halt issue * removing unuseful broadcasts
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
* Rename PA_TO_cpu * Code cleanup * Revert accidental change
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* Remove the wrong function with duplicate name * fix format. * add mpu check. fix tests.
* Added drop_last to DeepSpeedDataLoader This solves issue deepspeedai#326 * Updated drop_last in engine.py added drop_last as a ds_config as mentioned by @tjruwase * Update engine.py * Update engine.py * updated config.py and constants.py * Update constants.py * added dataloader_ prefix * Update dataloader.py * corrected yapf test errors * Update test_data.py Added dataloader_drop_last unit test * Corrected yapf and formatting issues * updated simple_model.py and test_data.py * Update simple_model.py * pre-commit fix * corrected issues * Update test_data.py * Update test_data.py * Update test_data.py * Update test_data.py * removed batch_size from test_data.py * Update simple_model.py * Update test_data.py * Update test_data.py * Fix unit test issues * Use fp32 to make things work Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* Added 4-byte alignment on NCCL/RCCL * pre-commit formatting fixes * Fix for checkpoint loading with optimizer partitioning * Better assert print * Added unit tests for nccl/rccl 4-byte alignment * bug * Updated alignment to implicit Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* [zero Init] fix regression * clean up the warning
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* [zero_to_fp32] fix padding removal * style * fix comments Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
…ests (deepspeedai#1405) * install HF w. dev extra to get all required packages * switch ds.init to use param dict instead of json file on disk * switch back to 'testing' extra
Bumps [nokogiri](https://github.com/sparklemotion/nokogiri) from 1.11.4 to 1.12.5. - [Release notes](https://github.com/sparklemotion/nokogiri/releases) - [Changelog](https://github.com/sparklemotion/nokogiri/blob/main/CHANGELOG.md) - [Commits](sparklemotion/nokogiri@v1.11.4...v1.12.5) --- updated-dependencies: - dependency-name: nokogiri dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
|
Local tests: Unit tests summary: Bing BERT - No issues Megatron LM v1.1.5 345 M param model - No issues |
|
CI Unit test build http://rocmhead.amd.com:8080/job/pytorch-deepspeed-pr-build-unit-tests/26/artifact/DeepSpeed/unit_tests_py3.6.log showed the same errors as local run: Later CI unit test build http://rocmhead.amd.com:8080/job/pytorch-deepspeed-pr-build-unit-tests/27/artifact/DeepSpeed/unit_tests_py3.6.log aborted with timeout. |
|
GPT2 CI build is giving wrong signal (says passing when it's actually failing). Can we see if we can rectify it? |
|
The reasons for keeping this PR open
|
|
PR-to-CI issues are still unresolved, but @rraminen will continue to work on them. |
IFU
The below conflicts have been resolved:
CONFLICT (content): Merge conflict in tests/unit/test_config.py
CONFLICT (content): Merge conflict in deepspeed/runtime/zero/stage2.py
test_config.log
stage2.log
@jithunnair-amd