-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[MetaSchedule] Improve inlining and VerifyGPUCode for quantized model workload
#13334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.
Generated by tvm-bot |
6e62f2a to
337c1c1
Compare
|
Hey thanks for the contribution! I was a bit uncertain if we really want to do name checking to determine constants from the compile engine, because it relies on the assumption that relay exists and relay always use There is an alternative I could come up with, and please let me know if it makes sense: Add a tvm/src/relay/backend/te_compiler_cache.cc Line 275 in fbe174b
T.block_attr({"schedule_rule": "compute_inline"})Then register a PackedFunc Let me know if it makes sense! |
|
@junrushao I like your idea, I'll rework this. |
|
@junrushao I realized that an easier way would be to check the content of the block to determine if it is a constant block, rather than relying on the block name. |
337c1c1 to
f398453
Compare
|
Removed the identification of constant blocks by name, and replaced it with more robust method based on the block structure. |
|
cc @vinx13 @junrushao please take a look. |
junrushao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
…el workload (apache#13334) * [MetaSchedule] Add a new schedule rule to inline all scalar constants * add doc * reorg * identify constant block by its structure, not by name
This can be inlined by existing
AutoInlinerule, but depending on the order where spatial blocks are processed byAutoInline, these "compile_engine_const" blocks can get in the way ofReverseComputeInlineon other blocks, since the constant blocks are also counted as a producer block.PostOrderApplycurrently processes the constant blocks at the very end, soReverseComputeInlineon blocks that consumes such constants always fails to inline. So in practice, we are not generating a fused kernel for quantized conv2d today.I added a simple inlining rule that inlines only such constant blocks. This rule is supposed to run before
AutoInline, to unblockReverseComputeInline. This lets us generate a fused kernel. On the int8 resnet50 model from PyTorch, the e2e perf improved from 6.8 to 5.2 msec, using batch size 16, and the same number of trials.VerifyGPUCodeonly checks vector width used inBufferLoadandBufferStore. But quantized models uses specialized intrinsics likeq_multiply_shift_per_axisbelow, which uses 64 bit arithmetic internally.To accurately account for data types used in a block, we need to lower those intrinsics before invoking TIR
VerifyGPUCodeand check the dtype ofCastNode.@vinx13 @junrushao @zxybazh