-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[Collage] PruneCandidates and demo_collage_partition.py #12105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
See https://github.com/apache/tvm-rfcs/blob/main/rfcs/0062-collage.md. This completes our checkin of our Collage 'sketch' branch into main. Special thanks to Matthew Barrett for his help getting this over the line. The only C++ functionality added here is for 'pruning' candidates. This is a somewhat speculative algorithm (and I've called that out in the comments) which tries to elide candidate partitions which will 'obviously' not contribute to the final optimal partitioning. For largish models such as GPT2 this can significantly reduce the number of candidates we need to actually measure latency on. I beefed up the MockCostEstimator to make it possible to assert pruning occured from within the test_pass_collage_partition.py unit test. The rest of this PR adds the demo_collage_partition.py driver file we've been using to test and measure perfomance differences against various baseline (though only for the CUDA ecosystem). To eliminate loading time the models of interest are directly expressed in Relay text form in menangerie.py.
|
|
||
| # CAUTION: Requires some changes in python/tvm/autotvm/task/dispatcher.py | ||
| # so that AutoTVM tuning records can be cached between runs and between | ||
| # models. See https://github.com/mbs-octoml/mbs-tvm/tree/mbs-collage-hacks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting for posterity, these hacks are needed because autotvm isnt properly caching results? Does that lead to much longer tuning times than necessary or some other breakage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's so that the autotvm tuning helpers in demo_collage_partition.py can use the existing tuning records as a cache which can be shared overall all models. Ie a poor man's TRS.
* [Collage] PruneCandidates and demo_collage_partition.py See https://github.com/apache/tvm-rfcs/blob/main/rfcs/0062-collage.md. This completes our checkin of our Collage 'sketch' branch into main. Special thanks to Matthew Barrett for his help getting this over the line. The only C++ functionality added here is for 'pruning' candidates. This is a somewhat speculative algorithm (and I've called that out in the comments) which tries to elide candidate partitions which will 'obviously' not contribute to the final optimal partitioning. For largish models such as GPT2 this can significantly reduce the number of candidates we need to actually measure latency on. I beefed up the MockCostEstimator to make it possible to assert pruning occured from within the test_pass_collage_partition.py unit test. The rest of this PR adds the demo_collage_partition.py driver file we've been using to test and measure perfomance differences against various baseline (though only for the CUDA ecosystem). To eliminate loading time the models of interest are directly expressed in Relay text form in menangerie.py. * - lint
* [Collage] PruneCandidates and demo_collage_partition.py See https://github.com/apache/tvm-rfcs/blob/main/rfcs/0062-collage.md. This completes our checkin of our Collage 'sketch' branch into main. Special thanks to Matthew Barrett for his help getting this over the line. The only C++ functionality added here is for 'pruning' candidates. This is a somewhat speculative algorithm (and I've called that out in the comments) which tries to elide candidate partitions which will 'obviously' not contribute to the final optimal partitioning. For largish models such as GPT2 this can significantly reduce the number of candidates we need to actually measure latency on. I beefed up the MockCostEstimator to make it possible to assert pruning occured from within the test_pass_collage_partition.py unit test. The rest of this PR adds the demo_collage_partition.py driver file we've been using to test and measure perfomance differences against various baseline (though only for the CUDA ecosystem). To eliminate loading time the models of interest are directly expressed in Relay text form in menangerie.py. * - lint
See https://github.com/apache/tvm-rfcs/blob/main/rfcs/0062-collage.md.
This completes our checkin of our Collage 'sketch' branch into main. Special thanks
to Matthew Barrett for his help getting this over the line.
The only C++ functionality added here is for 'pruning' candidates. This is a somewhat
speculative algorithm (and I've called that out in the comments) which tries to
elide candidate partitions which will 'obviously' not contribute to the final optimal
partitioning. For largish models such as GPT2 this can significantly reduce the number of
candidates we need to actually measure latency on. I beefed up the MockCostEstimator to
make it possible to assert pruning occured from within the test_pass_collage_partition.py
unit test.
The rest of this PR adds the demo_collage_partition.py driver file we've been using
to test and measure perfomance differences against various baseline (though only
for the CUDA ecosystem). To eliminate loading time the models of interest are directly
expressed in Relay text form in menangerie.py.