-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[Relay][Hexagon] Add per-channel FixedPointMultiply operation #13080
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment. Generated by tvm-bot |
7f61048 to
c0c66c8
Compare
|
@tvm-bot rerun |
| right shift. This is because we are rounding twice instead than only once. I.e.: | ||
|
|
||
| * original q_multiply_shift: round(x*y*2^-s) | ||
| * hexagon q_multiply_shift: round(round(x*y)*2^-s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @kparzysz-quic @jverma-quic on this HVX implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add test to demonstrate issue with accuracy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have fixed issue with accuracy drop.
For the case when we need both right and left shifts at the same time I use "old" approach and lower this oper to the sequence left_shift/multipy/add/right_shift (64bit arithmetic). Right now I have no idea how to implement this case through vector HVX instructions without accuracy drop.
c0c66c8 to
123ad38
Compare
|
@tvm-bot rerun |
Main goal of this commit is to improve performance for Hexagon target and preserve performance/accuracy for x86, GPU and etc. targets. "qnn.requantize" operation is lowered into the sequence of multiply, add, shift during QNN canonicalization pass if scale quantization parameter is the vector of scalars. This commit adds new Relay per-channel/per-axis FixedPointMultiply operation and is used in "qnn.requantize" operation lowering. per-channel/per-axis FixedPointMultiply is implemented through tir.q_multiply_shift_per_axis intrinsic. For Hexagon target it overrides default implementation and generates HVX vmpye/vmpyo instruction (see _q_multiply_shift_per_axis_hexagon). For all other targets it uses default implementation (64 bits arithmetic). Performance/accuracy measurement: CPU(x86) target: accuracy and performance are the same. For other targets should be the same (otherwise it is bug). Hexagon target: speedup of qnn.requantize 7x-9x times (Snapdragon 888, 3.08 ms -> 0.39 ms)
123ad38 to
483a1be
Compare
src/target/intrin_rule.cc
Outdated
| PrimExpr right_shift = call->args[3]; | ||
| PrimExpr q = call->args[4]; | ||
| PrimExpr is_lshift_required = call->args[5]; | ||
| // Note, 7th argument is "is_rshift_required" flag, but we do need that here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean "don't need"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... yes, exactly. My bad, this is typo in comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kparzysz-quic This PR improves performance on int8 resnet50 from the PR #12911 while preserving accuracy.
Manual schedules (no tuning): 146 msec (before) -> 92 msec.
Tuned schedules (vrmpy auto tensorization): 105 msec -> 58 msec.
Very cool!
…#13080) * [Relay][Hexagon] Add per-channel FixedPointMultiply operation Main goal of this commit is to improve performance for Hexagon target and preserve performance/accuracy for x86, GPU and etc. targets. "qnn.requantize" operation is lowered into the sequence of multiply, add, shift during QNN canonicalization pass if scale quantization parameter is the vector of scalars. This commit adds new Relay per-channel/per-axis FixedPointMultiply operation and is used in "qnn.requantize" operation lowering. per-channel/per-axis FixedPointMultiply is implemented through tir.q_multiply_shift_per_axis intrinsic. For Hexagon target it overrides default implementation and generates HVX vmpye/vmpyo instruction (see _q_multiply_shift_per_axis_hexagon). For all other targets it uses default implementation (64 bits arithmetic). Performance/accuracy measurement: CPU(x86) target: accuracy and performance are the same. For other targets should be the same (otherwise it is bug). Hexagon target: speedup of qnn.requantize 7x-9x times (Snapdragon 888, 3.08 ms -> 0.39 ms) * Address code review comments
…#13080) * [Relay][Hexagon] Add per-channel FixedPointMultiply operation Main goal of this commit is to improve performance for Hexagon target and preserve performance/accuracy for x86, GPU and etc. targets. "qnn.requantize" operation is lowered into the sequence of multiply, add, shift during QNN canonicalization pass if scale quantization parameter is the vector of scalars. This commit adds new Relay per-channel/per-axis FixedPointMultiply operation and is used in "qnn.requantize" operation lowering. per-channel/per-axis FixedPointMultiply is implemented through tir.q_multiply_shift_per_axis intrinsic. For Hexagon target it overrides default implementation and generates HVX vmpye/vmpyo instruction (see _q_multiply_shift_per_axis_hexagon). For all other targets it uses default implementation (64 bits arithmetic). Performance/accuracy measurement: CPU(x86) target: accuracy and performance are the same. For other targets should be the same (otherwise it is bug). Hexagon target: speedup of qnn.requantize 7x-9x times (Snapdragon 888, 3.08 ms -> 0.39 ms) * Address code review comments
Main goal of this commit is to improve performance for Hexagon target and preserve performance/accuracy for x86, GPU and etc. targets.
"qnn.requantize" operation is lowered into the sequence of multiply, add, shift during QNN canonicalization pass if scale quantization parameter is vector of scalars. This commit adds new Relay per-channel/per-axis FixedPointMultiply operation and is used in "qnn.requantize" operation lowering.
per-channel/per-axis FixedPointMultiply is implemented through tir.q_multiply_shift_per_axis intrinsic. For Hexagon target it overrides default implementation and generate HVX vmpye/vmpyo instruction (see _q_multiply_shift_per_axis_hexagon). For all other targets it uses default implementation (64 bits arithmetic).
Performance/accuracy measurement:
CPU(x86) target: accuracy and performance are the same. For other targets should be the same (otherwise it is bug).
Hexagon target: speedup of qnn.requantize 7x-9x times (Snapdragon 888, 4.4 ms -> 0.5 ms)