Skip to content

Conversation

@mbs-octoml
Copy link
Contributor

See https://github.com/apache/tvm-rfcs/blob/main/rfcs/0062-collage.md.

This completes our checkin of our Collage 'sketch' branch into main. Special thanks
to Matthew Barrett for his help getting this over the line.

The only C++ functionality added here is for 'pruning' candidates. This is a somewhat
speculative algorithm (and I've called that out in the comments) which tries to
elide candidate partitions which will 'obviously' not contribute to the final optimal
partitioning. For largish models such as GPT2 this can significantly reduce the number of
candidates we need to actually measure latency on. I beefed up the MockCostEstimator to
make it possible to assert pruning occured from within the test_pass_collage_partition.py
unit test.

The rest of this PR adds the demo_collage_partition.py driver file we've been using
to test and measure perfomance differences against various baseline (though only
for the CUDA ecosystem). To eliminate loading time the models of interest are directly
expressed in Relay text form in menangerie.py.

See https://github.com/apache/tvm-rfcs/blob/main/rfcs/0062-collage.md.

This completes our checkin of our Collage 'sketch' branch into main. Special thanks
to Matthew Barrett for his help getting this over the line.

The only C++ functionality added here is for 'pruning' candidates. This is a somewhat
speculative algorithm (and I've called that out in the comments) which tries to
elide candidate partitions which will 'obviously' not contribute to the final optimal
partitioning. For largish models such as GPT2 this can significantly reduce the number of
candidates we need to actually measure latency on. I beefed up the MockCostEstimator to
make it possible to assert pruning occured from within the test_pass_collage_partition.py
unit test.

The rest of this PR adds the demo_collage_partition.py driver file we've been using
to test and measure perfomance differences against various baseline (though only
for the CUDA ecosystem). To eliminate loading time the models of interest are directly
expressed in Relay text form in menangerie.py.

# CAUTION: Requires some changes in python/tvm/autotvm/task/dispatcher.py
# so that AutoTVM tuning records can be cached between runs and between
# models. See https://github.com/mbs-octoml/mbs-tvm/tree/mbs-collage-hacks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting for posterity, these hacks are needed because autotvm isnt properly caching results? Does that lead to much longer tuning times than necessary or some other breakage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's so that the autotvm tuning helpers in demo_collage_partition.py can use the existing tuning records as a cache which can be shared overall all models. Ie a poor man's TRS.

@jwfromm jwfromm merged commit d436501 into apache:main Jul 15, 2022
@mbs-octoml mbs-octoml deleted the mbs-collage-sketch branch July 15, 2022 18:18
xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 25, 2022
* [Collage] PruneCandidates and demo_collage_partition.py

See https://github.com/apache/tvm-rfcs/blob/main/rfcs/0062-collage.md.

This completes our checkin of our Collage 'sketch' branch into main. Special thanks
to Matthew Barrett for his help getting this over the line.

The only C++ functionality added here is for 'pruning' candidates. This is a somewhat
speculative algorithm (and I've called that out in the comments) which tries to
elide candidate partitions which will 'obviously' not contribute to the final optimal
partitioning. For largish models such as GPT2 this can significantly reduce the number of
candidates we need to actually measure latency on. I beefed up the MockCostEstimator to
make it possible to assert pruning occured from within the test_pass_collage_partition.py
unit test.

The rest of this PR adds the demo_collage_partition.py driver file we've been using
to test and measure perfomance differences against various baseline (though only
for the CUDA ecosystem). To eliminate loading time the models of interest are directly
expressed in Relay text form in menangerie.py.

* - lint
mikeseven pushed a commit to mikeseven/tvm that referenced this pull request Sep 27, 2023
* [Collage] PruneCandidates and demo_collage_partition.py

See https://github.com/apache/tvm-rfcs/blob/main/rfcs/0062-collage.md.

This completes our checkin of our Collage 'sketch' branch into main. Special thanks
to Matthew Barrett for his help getting this over the line.

The only C++ functionality added here is for 'pruning' candidates. This is a somewhat
speculative algorithm (and I've called that out in the comments) which tries to
elide candidate partitions which will 'obviously' not contribute to the final optimal
partitioning. For largish models such as GPT2 this can significantly reduce the number of
candidates we need to actually measure latency on. I beefed up the MockCostEstimator to
make it possible to assert pruning occured from within the test_pass_collage_partition.py
unit test.

The rest of this PR adds the demo_collage_partition.py driver file we've been using
to test and measure perfomance differences against various baseline (though only
for the CUDA ecosystem). To eliminate loading time the models of interest are directly
expressed in Relay text form in menangerie.py.

* - lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants