Skip to content

Conversation

@ibsidorenko
Copy link
Contributor

@ibsidorenko ibsidorenko commented Oct 14, 2022

Main goal of this commit is to improve performance for Hexagon target and preserve performance/accuracy for x86, GPU and etc. targets.

"qnn.requantize" operation is lowered into the sequence of multiply, add, shift during QNN canonicalization pass if scale quantization parameter is vector of scalars. This commit adds new Relay per-channel/per-axis FixedPointMultiply operation and is used in "qnn.requantize" operation lowering.

per-channel/per-axis FixedPointMultiply is implemented through tir.q_multiply_shift_per_axis intrinsic. For Hexagon target it overrides default implementation and generate HVX vmpye/vmpyo instruction (see _q_multiply_shift_per_axis_hexagon). For all other targets it uses default implementation (64 bits arithmetic).

Performance/accuracy measurement:

  • CPU(x86) target: accuracy and performance are the same. For other targets should be the same (otherwise it is bug).

  • Hexagon target: speedup of qnn.requantize 7x-9x times (Snapdragon 888, 4.4 ms -> 0.5 ms)

@tvm-bot
Copy link
Collaborator

tvm-bot commented Oct 14, 2022

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

Generated by tvm-bot

@ibsidorenko ibsidorenko changed the title [Relay][Hexagon] Added per-channel FixedPointMultiply operation. [Relay][Hexagon] Add per-channel FixedPointMultiply operation Oct 14, 2022
@masahi
Copy link
Member

masahi commented Oct 14, 2022

@tvm-bot rerun

right shift. This is because we are rounding twice instead than only once. I.e.:

* original q_multiply_shift: round(x*y*2^-s)
* hexagon q_multiply_shift: round(round(x*y)*2^-s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @kparzysz-quic @jverma-quic on this HVX implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add test to demonstrate issue with accuracy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed issue with accuracy drop.
For the case when we need both right and left shifts at the same time I use "old" approach and lower this oper to the sequence left_shift/multipy/add/right_shift (64bit arithmetic). Right now I have no idea how to implement this case through vector HVX instructions without accuracy drop.

@areusch areusch added needs-triage PRs or issues that need to be investigated by maintainers to find the right assignees to address it and removed needs-triage PRs or issues that need to be investigated by maintainers to find the right assignees to address it labels Oct 19, 2022
@ibsidorenko
Copy link
Contributor Author

@tvm-bot rerun

Main goal of this commit is to improve performance for Hexagon target and
preserve performance/accuracy for x86, GPU and etc. targets.

"qnn.requantize" operation is lowered into the sequence of multiply, add, shift
during QNN canonicalization pass if scale quantization parameter is the vector
of scalars. This commit adds new Relay per-channel/per-axis FixedPointMultiply
operation and is used in "qnn.requantize" operation lowering.

per-channel/per-axis FixedPointMultiply is implemented through
tir.q_multiply_shift_per_axis intrinsic. For Hexagon target it overrides default
implementation and generates HVX vmpye/vmpyo instruction (see
_q_multiply_shift_per_axis_hexagon). For all other targets it uses default
implementation (64 bits arithmetic).

Performance/accuracy measurement:

CPU(x86) target: accuracy and performance are the same. For other targets should
be the same (otherwise it is bug).

Hexagon target: speedup of qnn.requantize 7x-9x times (Snapdragon 888, 3.08 ms -> 0.39 ms)
PrimExpr right_shift = call->args[3];
PrimExpr q = call->args[4];
PrimExpr is_lshift_required = call->args[5];
// Note, 7th argument is "is_rshift_required" flag, but we do need that here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean "don't need"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... yes, exactly. My bad, this is typo in comment.

Copy link
Member

@masahi masahi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kparzysz-quic This PR improves performance on int8 resnet50 from the PR #12911 while preserving accuracy.

Manual schedules (no tuning): 146 msec (before) -> 92 msec.
Tuned schedules (vrmpy auto tensorization): 105 msec -> 58 msec.

Very cool!

cc @tmoreau89 @csullivan

@masahi masahi merged commit 645a5ea into apache:main Oct 27, 2022
xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 10, 2022
…#13080)

* [Relay][Hexagon] Add per-channel FixedPointMultiply operation

Main goal of this commit is to improve performance for Hexagon target and
preserve performance/accuracy for x86, GPU and etc. targets.

"qnn.requantize" operation is lowered into the sequence of multiply, add, shift
during QNN canonicalization pass if scale quantization parameter is the vector
of scalars. This commit adds new Relay per-channel/per-axis FixedPointMultiply
operation and is used in "qnn.requantize" operation lowering.

per-channel/per-axis FixedPointMultiply is implemented through
tir.q_multiply_shift_per_axis intrinsic. For Hexagon target it overrides default
implementation and generates HVX vmpye/vmpyo instruction (see
_q_multiply_shift_per_axis_hexagon). For all other targets it uses default
implementation (64 bits arithmetic).

Performance/accuracy measurement:

CPU(x86) target: accuracy and performance are the same. For other targets should
be the same (otherwise it is bug).

Hexagon target: speedup of qnn.requantize 7x-9x times (Snapdragon 888, 3.08 ms -> 0.39 ms)

* Address code review comments
xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 25, 2022
…#13080)

* [Relay][Hexagon] Add per-channel FixedPointMultiply operation

Main goal of this commit is to improve performance for Hexagon target and
preserve performance/accuracy for x86, GPU and etc. targets.

"qnn.requantize" operation is lowered into the sequence of multiply, add, shift
during QNN canonicalization pass if scale quantization parameter is the vector
of scalars. This commit adds new Relay per-channel/per-axis FixedPointMultiply
operation and is used in "qnn.requantize" operation lowering.

per-channel/per-axis FixedPointMultiply is implemented through
tir.q_multiply_shift_per_axis intrinsic. For Hexagon target it overrides default
implementation and generates HVX vmpye/vmpyo instruction (see
_q_multiply_shift_per_axis_hexagon). For all other targets it uses default
implementation (64 bits arithmetic).

Performance/accuracy measurement:

CPU(x86) target: accuracy and performance are the same. For other targets should
be the same (otherwise it is bug).

Hexagon target: speedup of qnn.requantize 7x-9x times (Snapdragon 888, 3.08 ms -> 0.39 ms)

* Address code review comments
@ibsidorenko ibsidorenko deleted the fpm-per-channel branch March 29, 2023 06:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants