[fx] fix test and algorithm bugs in activation checkpointing.#1451
[fx] fix test and algorithm bugs in activation checkpointing.#1451Cypher30 merged 19 commits intohpcaitech:mainfrom super-dainiu:feature/more_ckpt
Conversation
…on checkpointing usages
…on checkpointing usages
…on checkpointing usages
* [fx] activation checkpointing using Chen strategies. * [fx] add test for ckpt_solver_chen * [fx] add vanilla activation checkpoint search with test on resnet and densenet * [fx] add a namespace code for solver_chen. * [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174. * [fx] fix lowercase naming conventions. * [fx] simplify test for ckpt.
I think that's it, take a look at activation_checkpoint.py line:11-17, I think in this case the require_grad is set to True for input tensor, and as the input tensors are the leaf node in run_function, PyTorch will not let this kind of operation happen, see this. Seems our tracer could not identify those in-place operation? |
Cypher30
left a comment
There was a problem hiding this comment.
Just hold the PR, I think we need further discussion of those in-place operations
Cypher30
left a comment
There was a problem hiding this comment.
We approve this change and merge, but skip the test, waiting for new version of colossalai checkpoint
What's new?
I regretted using
torch.randn(2, 3, 224, 224)previously #1446 for testing because this consumes too much time on CI.Also, I made some modifications to the search algorithm (mostly conditions for annotations) to avoid crashes in
ActivationCheckpointCodeGen.What's wrong?
However, I did not figure out why tracing on
densenet121got an error.The generated
nn.Moduleis as follows. Problems occurred in checkpoint_2Is that because
colossalai.utils.activation_checkpoint.checkpointdoes not support in-place operation right after the input node? Should we hijack this potential problem duringCodeGenor modify our checkpoint logit?