Merge base by ghosthamlet · Pull Request #1 · ghosthamlet/DeepSpeed

ghosthamlet · 2021-03-15T11:15:13Z

No description provided.

* fix arch flags, add PTX * bug fix Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Update launch.py * formatting

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [doc] xref to hostfile discussion wasn't clear where to find what was meant by `hostfile` - so adding a link to where it's discussed. * remove whitespace

Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

…608)

Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>

Allow DeepSpeed models to be initialized with optimizer=None Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

Bumps [nokogiri](https://github.com/sparklemotion/nokogiri) from 1.10.10 to 1.11.0. - [Release notes](https://github.com/sparklemotion/nokogiri/releases) - [Changelog](https://github.com/sparklemotion/nokogiri/blob/master/CHANGELOG.md) - [Commits](sparklemotion/nokogiri@v1.10.10...v1.11.0) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Remove a very verbose print statement. * Update engine.py

* Add Linear warmup+decay lr schedule Update lr schedule unit tests * LR scheduler unit tests for LR Range Test and 1Cycle * Disable yapf to preserve parameterizaton * Disable test_pipe.py for CI debugging * Disable test_lr_scheduler for CI debugging * Disable test_lr_scheduler for CI debugging * Enable all unit tests for CI debugging Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

@g-karthik

) Special thanks to @g-karthik for tracking this issue down.

Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* move workspace memory-allocation to PyTorch * refine the code based on the comments * remove unnecessary options * remove bsz from set_seq_len function

Invalid param name Thanks.

* fix the bias-add precision and indexing and also adding the layer-norm-eps as a configurable parameter for transformer * add ACC_HALF config * use defined to check if ACC_Half is defined

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

hi, i take a look at the code of column_sum_reduce, i have 2 questions: 1. the goal of column_sum_reduce is to get the column sum of inp matrix with shape[rows, width] and the result shape should be [width],right ? It seems that the judgment condition of pos is not suitable 2. the implementation of cuda kernel based on the asumption that, the thread with same threadIdx.y will group into a thread_block_tile, the blockDim is (32,32), i read the nvidia document https://on-demand.gputechconf.com/gtc/2017/presentation/s7622-Kyrylo-perelygin-robust-and-scalable-cuda.pdf, THREAD BLOCK TILE is a subset of threads of a thread block, divided into tiles in row-major order. doesn't it mean thread with the same threadIdx.x will group into a thread_block_tile ? thanks !!!! Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>

* fixing buffers in transformer kernel when gelu-checkpoint is enabled * fixing the test issue for other memory optimization flags * fixing a bug for when attn_dropout_checkpoint is enabled

* Squash stage3 v1 (#146) Co-authored-by: Samyam <samyamr@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: eltonzheng <eltonz@microsoft.com> * Fix correctness bug (#147) * formatting fix (#150) * stage3 bugfix (API) update and simplified FP16 Z3 tests (#151) * fp16 Z3 API update and bugfix * revert debug change * ZeRO-3 detach and race condition bugfixes (#149) * trying out ZeRO-3 race condition fix * CUDA sync instead of stream * reduction stream sync * remove commented code * Fix optimizer state_dict KeyError (#148) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * fix for smaller SGS sizes, ensures each grad is backed by unique tensors (#152) * Simplifying the logic for getting averaged gradients (#153) * skip for now * Z3 Docs redux (#154) * removing some TODOs and commented code (#155) * New Z3 defaults (#156) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * formatting * megatron external params Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: eltonzheng <eltonz@microsoft.com>

)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

…t in the website (#799) * add optimizers and schedules to rtd * update ds website and fix links * add optimizers and schedules to rtd * update ds website and fix links * add flops profiler to rtd * fix Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>

* Control ZeRO wall clock timers * Disable more ZeRO3 debug prints Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* fix log(0) & 1/log(1) bugs * simplify Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com>

…have 'params' (#827) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

Admin merging for pure-doc PR that does not trigger build.

jeffra and others added 30 commits December 11, 2020 10:05

add manual workflow to run tests with precompiled ops

0518252

[build] fix computer capability arch flags, add PTX, handle PTX (#591)

8a184b6

* fix arch flags, add PTX * bug fix Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

add DeepSpeedZeroConfig repr method (#596)

66268bd

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

Supported customizing kwargs for lr_scheduler (#584)

a4763f5

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

Update launcher to set local rank environ variable (#597)

c5a449f

* Update launch.py * formatting

implement missing get_last_lr (#595)

9f8e8f3

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

[doc] xref to hostfile discussion (#604)

007466e

* [doc] xref to hostfile discussion wasn't clear where to find what was meant by `hostfile` - so adding a link to where it's discussed. * remove whitespace

Fixes for RTD build errors (#606)

6380ee3

Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>

Transformer-kernel - supporting any arbitrary sequence-length (#587)

fd2f970

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

Ability to initialize distributed backend outside deepspeed runtime (#…

7435b2f

…608)

Elastic training support (#602)

81aeea3

Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>

update SA comp check to fix torch-cpu issue (#631)

24e0739

Support initialization with dict configuration (#632)

e6ac731

Allow DeepSpeed models to be initialized with optimizer=None (#469)

a9a83a6

Allow DeepSpeed models to be initialized with optimizer=None Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>

change dist to torch.distributed to fix bug in assert. (#638)

d38ad6a

docs: minor spelling tweaks (#623)

46d2e28

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

Fix docstring format (#640)

5ab1279

Module replacement support (#586)

44bd538

Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

Update builder.py (#642)

64461da

Add deepspeed.init_distributed to RTD page (#645)

4e2dc4e

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

document deepspeed.initialize() (#644)

828d75b

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

add additional validation checks in elastic config (#646)

bc046dc

Remove a very verbose print statement. (#649)

af212f6

* Remove a very verbose print statement. * Update engine.py

version bump to 0.3.10

c14b839

Handle actvitation checkpointing args that are None or non-tensors (#660

adcfd26

) Special thanks to @g-karthik for tracking this issue down.

squash latest flops profiling changes (#1) (#664)

e2fbe4d

Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

Move workspace memory-allocation to PyTorch (#661)

981bc7d

* move workspace memory-allocation to PyTorch * refine the code based on the comments * remove unnecessary options * remove bsz from set_seq_len function

Validate consistent ckpt tags across ranks (#667)

f032e56

jeffra and others added 29 commits February 18, 2021 16:20

Update engine.py (#767)

29fa4b2

[doc] fix incorrect param name (#773)

e60e92e

Invalid param name Thanks.

Fixing the module-inject Api (#786)

48065c0

Fix the bias-add and add the layer-norm-eps parameter (#791)

e2dfcad

* fix the bias-add precision and indexing and also adding the layer-norm-eps as a configurable parameter for transformer * add ACC_HALF config * use defined to check if ACC_Half is defined

Delete out2 (#798)

62396b7

fixing the compiling issue for the AMD architecture (#796)

490e6f7

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

document the requirement to call for all ranks (#801)

7eb083c

fixed typo (#802)

db987cf

Fixing gelu_checkpointing memory issue (#812)

8295d7a

* fixing buffers in transformer kernel when gelu-checkpoint is enabled * fixing the test issue for other memory optimization flags * fixing a bug for when attn_dropout_checkpoint is enabled

Update ZeRO-Offload tutorials (#824)

ba33e86

update tutorial/doc links for zero3 (#835)

d7de916

Fix zero3 tutorial link

75ffdaf

bump DSE to include ZeRO-3

9c5eee3

Fix for RTD

af54897

Model scale changing 5x to 3x

6adc19a

replace home env with ~

4949636

Fix regression in runner (#843)

2e6692c

bumping DSE pointer (#847)

564eb4b

set adamw_mode default true (follows FusedAdam and < 0.3.11 logic) (#844

dd03cff

)

less scary overflow notice (#833)

29853c3

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

small tweaks (#839)

7925d0c

Control ZeRO wall clock timers (#849)

311795d

* Control ZeRO wall clock timers * Disable more ZeRO3 debug prints Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

[WarmupDecayLR] fix log(0) & 1/log(1) bugs (#772)

18a26f3

* fix log(0) & 1/log(1) bugs * simplify Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com>

bump to v0.3.12

35fd7cc

Bug fix: Remove client optimizer param_group list item that does not …

458ff02

…have 'params' (#827) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

[doc] pipeline doc typos/improvements (#659)

73d762c

Admin merging for pure-doc PR that does not trigger build.

ghosthamlet merged commit 517357e into ghosthamlet:master Mar 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge base#1

Merge base#1
ghosthamlet merged 98 commits intoghosthamlet:masterfrom
deepspeedai:master

ghosthamlet commented Mar 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

ghosthamlet commented Mar 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants