Add the flops profiler which measures the time, number of estimated flops and parameters of each module in a model#544
Conversation
jeffra
left a comment
There was a problem hiding this comment.
I see DeepSpeedExamples (DSE) has lots of changes here, is this because we're wanting to point to an updated version of DSE that uses the profiler? or just pointing to a different commit than what the current master branch points to?
| @@ -0,0 +1,43 @@ | |||
| from functools import partial | |||
There was a problem hiding this comment.
I wonder if we should move these examples to DSE? or maybe still in the DeepSpeed repo but not in the deepspeed package? I am not sure we want them to be installed with the library itself (e.g., hidden away as a library like
/usr/local/lib/python3.6/dist-packages/deepspeed) where they are hard to find and run directly.
There was a problem hiding this comment.
I moved the examples to readme and tests/unit/test_flops_profiler.py
| @@ -0,0 +1,43 @@ | |||
| from functools import partial | |||
There was a problem hiding this comment.
I love that we have these examples. We might want to move them to the DSE repo? And we could also include them in a page on readthedocs if you think that would be good.
There was a problem hiding this comment.
removed the examples. I will add them to dse and readthedocs.
tests/small_model_debugging/run.sh
Outdated
| @@ -0,0 +1,12 @@ | |||
| #! /bin/bash | |||
There was a problem hiding this comment.
Can we move this into our unit tests?
tests/small_model_debugging/run.sh
Outdated
| export WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES)) | ||
|
|
||
| # python test_model.py | ||
| python test_model.py --flops_profiler true --profile_start_step=5 --profile_end_step=8 |
There was a problem hiding this comment.
It would be great to have some unit tests with models that we know the expected flop counts for. Then we could run the flops profiler to ensure our counts stay consistent?
| parser = argparse.ArgumentParser() | ||
| parser.add_argument("--local_rank", type=int, default=0) | ||
| parser.add_argument('--zero', type=int, default=0) | ||
| parser.add_argument('--flops_profiler', type=bool, default=False) |
There was a problem hiding this comment.
Same comment, it would be awesome to move this into our unit tests (tests/unit/).
| @@ -0,0 +1 @@ | |||
| from .profiler import get_model_profile, add_profile_methods, print_model_profile, print_model_aggregated_profile, flops_to_string, duration_to_string, params_to_string | |||
There was a problem hiding this comment.
With this many exports, we could also just from .profiler import * and then in profiler be sure to follow public/private naming conventions. Methods starting with _ won't be included.
There was a problem hiding this comment.
I put all the functions in a FlopsProfiler class which will be the only thing to import
|
|
||
|
|
||
| # FC | ||
| F.linear = wrapFunc(F.linear, linear_flops_compute) |
There was a problem hiding this comment.
Will the below wraps occur each time the module is imported? Can things get double-wrapped?
There was a problem hiding this comment.
I moved wrapping functionals into the profiler.start() and added unwrapping functionals in the profile.end().
| F.linear = wrapFunc(F.linear, linear_flops_compute) | ||
|
|
||
| # convolutions | ||
| F.conv1d = wrapFunc(F.conv1d, conv_flops_compute) |
There was a problem hiding this comment.
Do we also unwrap these functionals after profiling is complete? The polite thing might be to only wrap while profiling is active, if that's possible. I see we do that with modules in end_profile?
| F.avg_pool2d = wrapFunc(F.avg_pool2d, pool_flops_compute) | ||
| F.avg_pool3d = wrapFunc(F.avg_pool3d, pool_flops_compute) | ||
| F.max_pool1d = wrapFunc(F.max_pool1d, pool_flops_compute) | ||
| F.max_pool2d = wrapFunc(F.max_pool2d, pool_flops_compute) |
There was a problem hiding this comment.
It would be cool to have a data structure that maps a list of functionals to the compute wrapper, and then at wrap time the engine code just traverses that data structure to do this work.
There was a problem hiding this comment.
Monkey patching does not work if the target function is referenced as an item in a loop.
| return logits, probs | ||
|
|
||
|
|
||
| if __name__ == "__main__": |
There was a problem hiding this comment.
This looks left over from development.
deepspeed/runtime/engine.py
Outdated
| from ..ops.adam import DeepSpeedCPUAdam | ||
| from ..ops.adam import FusedAdam | ||
|
|
||
| from deepspeed.profiling.flops_profiler.profiler import add_profile_methods, print_model_profile, print_model_aggregated_profile, flops_to_string, params_to_string |
There was a problem hiding this comment.
Since we manage exports within each subpackage, we can just from deepspeed.profiling import * to expose the high-level API I think. That's a bit nicer than having a long list like this.
Since we do have this pattern of exporting a lot of methods, what about wrapping them up in a FlopsProfiler class?
samyam
left a comment
There was a problem hiding this comment.
I want to just leave a high level review here based on some our earlier discussion. I think it would be very cool to do the profiling using profiler.start() and profiler.stop() instead of using get_model_profile(...). This is aligned more closely with general training workflows, and additionally the it would not require the user to create an input constructor. Overall, it would be a more natural and seamless way to use the profiler. What do you think @cli99 ?
This would also go towards addressing the import side effect we chatted about @cli99. It might be a good opportunity to use contextlib ? |
| return p | ||
|
|
||
|
|
||
| def linear_flops_compute(input, weight, bias=None): |
There was a problem hiding this comment.
Do we wanna account for the bias flops when it is not None?
There was a problem hiding this comment.
what would be the flops for computing bias? I thought it's zero.
There was a problem hiding this comment.
You can consider the dimension of bias (like out_features), which is the number of addition. But, this will be very small probably, so that's okay if not considering that.
|
|
||
| def linear_flops_compute(input, weight, bias=None): | ||
| out_features = weight.shape[0] | ||
| return torch.numel(input) * out_features |
There was a problem hiding this comment.
Shouldn't we multiply this by 2 to account for multiply-add?
There was a problem hiding this comment.
the flops profiler outputs MACs and the user can multiply that by 2 or anything. The throughput of a module is computed as TFLOPS where I do MACs*2/time.
There was a problem hiding this comment.
I see, but I was thinking that this is not always MAC. I am not sure if we want to take into account the point-wise operations into the profiler like bias-add or normalize.
| for step, batch in enumerate(data_loader): | ||
| # start profiling at training step "profile_step" | ||
| if step == profile_start_step: | ||
| model.start_profile() |
There was a problem hiding this comment.
Im wondering if its more transparent to not overwrite the model into a profiler, and keep the model intact? Something like
model_profiler = prof.add_profile_method(model)
model_profiler.start()
y=model(x)
model_profiler.stop()
Also for getting the info we can then do
model_profiler.print_all()
flops = model_profiler.get_flops()
I actually don't know which one is better. Maybe we should have a discussion on this?
There was a problem hiding this comment.
The readme shown here is out of date. The latest code does this. Happy to chat more about it.
|
Closing this one after rebasing/squashing, moved to #664 |
No description provided.