[checkpoint] use gather_tensor in checkpoint and update its unit test#1339
Merged
1SAA merged 1 commit intohpcaitech:mainfrom Jul 19, 2022
Merged
[checkpoint] use gather_tensor in checkpoint and update its unit test#13391SAA merged 1 commit intohpcaitech:mainfrom
1SAA merged 1 commit intohpcaitech:mainfrom
Conversation
Contributor
1SAA
commented
Jul 19, 2022
- only gathers tensors to rank0 to reduce memory usage
- corrected and polished colo_tensor's checkpointing unit test
feifeibear
reviewed
Jul 19, 2022
| assert v.is_replicate() | ||
| delattr(v, 'save_ready') | ||
| # model saving | ||
| save_state = {'epoch': epoch, 'model': model_state} |
Contributor
There was a problem hiding this comment.
In the next PR. You can merge model and optim in a single file.
torch.save({
'epoch': EPOCH,
'model_state_dict': net.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': LOSS,
}, PATH)
like
https://pytorch.org/tutorials/recipes/recipes/saving_and_loading_a_general_checkpoint.html
feifeibear
reviewed
Jul 19, 2022
| old_dist_spec = colo_tensor.dist_spec | ||
| colo_tensor.to_replicate_() | ||
| if dist.get_rank() != 0: | ||
| colo_tensor.set_dist_spec(old_dist_spec) |
Contributor
There was a problem hiding this comment.
This line triggers collective communication.
Will there be potential blocking if rank 0 is excluded?
Contributor
Author
There was a problem hiding this comment.
There is no communication, since old_dist_spec must be SHARD and we have a replicated tensor here.
feifeibear
reviewed
Jul 19, 2022
| colossalai.launch(config={}, rank=rank, world_size=world_size, host='localhost', port=port, backend='nccl') | ||
| pg = ProcessGroup(tp_degree=world_size) | ||
| for model_name in ['simple_net', 'bert']: | ||
| # TODO(haichen) add BERT in the test |
Contributor
There was a problem hiding this comment.
Inside a DP group, the input is replicated?
Contributor
Author
There was a problem hiding this comment.
It depends on which model is using. We do not have a unifited standard now.
feifeibear
approved these changes
Jul 19, 2022
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.