zero.Init() should pin params in GPU memory as requested #2953

tjruwase · 2023-03-06T13:42:48Z

Currently, zero.Init() will offload all parameters even if user requested some to be pinned in GPU memory (using *_persistence_threshold configs). Those params are eventually pinned into GPU memory after the first forward pass. Unfortunately, zero.Init() may fail CPU OOM which could be avoided by utilizing GPU and CPU memory per user request.
This PR avoids this problem and improves flexibility of combining GPU and CPU memory.

…/zero_infer_partial_offload

…osoft/DeepSpeed into olruwase/zero_infer_partial_offload

tjruwase added 3 commits March 1, 2023 07:59

Persist params in zero.Init

4a5eb11

Disable debug prints

99d613c

Merge branch 'master' of github.com:microsoft/DeepSpeed into olruwase…

dc1669c

…/zero_infer_partial_offload

tjruwase requested review from jeffra, jomayeri and samyam March 6, 2023 13:42

tjruwase requested a review from mrwyattii as a code owner March 6, 2023 13:42

Formatting

39d0b09

tjruwase mentioned this pull request Mar 6, 2023

Finetune T5 11B and the process is killed . exits with return code = -9[BUG] #2946

Closed

tjruwase added 14 commits March 6, 2023 21:19

Merge branch 'master' into olruwase/zero_infer_partial_offload

aa19bd5

Merge branch 'master' into olruwase/zero_infer_partial_offload

a7fde24

Avoid offloading persisted params

9ac2a7c

Merge branch 'olruwase/zero_infer_partial_offload' of github.com:micr…

254ea3c

…osoft/DeepSpeed into olruwase/zero_infer_partial_offload

Merge branch 'master' into olruwase/zero_infer_partial_offload

d4652bd

Merge branch 'olruwase/zero_infer_partial_offload' of github.com:micr…

b367b9c

…osoft/DeepSpeed into olruwase/zero_infer_partial_offload

Simplify world_size=1

598a96b

Formatting

d22ccce

Remove pdb

a87f921

Merge branch 'master' into olruwase/zero_infer_partial_offload

42a22e4

Merge branch 'master' into olruwase/zero_infer_partial_offload

23f49fb

Merge branch 'master' into olruwase/zero_infer_partial_offload

4448e58

Restructure

e119c24

Merge branch 'olruwase/zero_infer_partial_offload' of github.com:micr…

a68ce1d

…osoft/DeepSpeed into olruwase/zero_infer_partial_offload

tjruwase mentioned this pull request Mar 20, 2023

[BUG] Zero3 Offload does not fully utilize GPU memory and fails to overlap I/O and computation #3054

Closed

tjruwase added 6 commits March 22, 2023 06:45

Merge branch 'master' into olruwase/zero_infer_partial_offload

964bf49

Merge branch 'master' into olruwase/zero_infer_partial_offload

618caa5

Merge branch 'master' into olruwase/zero_infer_partial_offload

881afff

Formating

8baf2a5

Formatting

83a84f0

Merge branch 'olruwase/zero_infer_partial_offload' of github.com:micr…

80767e5

…osoft/DeepSpeed into olruwase/zero_infer_partial_offload

tjruwase and others added 5 commits March 28, 2023 17:04

Merge branch 'master' into olruwase/zero_infer_partial_offload

19b5047

Merge branch 'olruwase/zero_infer_partial_offload' of github.com:micr…

da3e03d

…osoft/DeepSpeed into olruwase/zero_infer_partial_offload

Apply persistence only if ds_config available

8aaa42c

Fix typo

1a4ac0e

add util function for getting pydantic config default values

b343c85

mrwyattii approved these changes Mar 29, 2023

View reviewed changes

tjruwase added 6 commits March 30, 2023 07:37

Merge branch 'master' into olruwase/zero_infer_partial_offload

f028b6d

Merge branch 'master' into olruwase/zero_infer_partial_offload

8fbeb93

Merge branch 'master' into olruwase/zero_infer_partial_offload

d4d1eab

Merge branch 'master' into olruwase/zero_infer_partial_offload

a61a79e

Merge branch 'master' into olruwase/zero_infer_partial_offload

daa1321

Merge branch 'master' into olruwase/zero_infer_partial_offload

e5869d2

tjruwase merged commit 4d27225 into master Apr 7, 2023

tjruwase mentioned this pull request Apr 17, 2023

Make deepspeed.zero.Init() idempotent #3203

Closed

mrwyattii deleted the olruwase/zero_infer_partial_offload branch July 7, 2023 02:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

zero.Init() should pin params in GPU memory as requested #2953

zero.Init() should pin params in GPU memory as requested #2953

Uh oh!

tjruwase commented Mar 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zero.Init() should pin params in GPU memory as requested #2953

zero.Init() should pin params in GPU memory as requested #2953

Uh oh!

Conversation

tjruwase commented Mar 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants