-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[Hexagon] Add HVX quant conv2d implementation #13256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -20,6 +20,7 @@ | |
| #include <tvm/runtime/c_runtime_api.h> | ||
| #include <tvm/runtime/device_api.h> | ||
|
|
||
| #include <algorithm> | ||
| #include <cassert> | ||
|
|
||
| #ifndef TVM_RUNTIME_HEXAGON_OPS_CONV2D_H_ | ||
|
|
@@ -28,6 +29,7 @@ | |
| namespace tvm { | ||
| namespace runtime { | ||
| namespace hexagon { | ||
| namespace conv_utils { | ||
| static constexpr auto hexagon_device = DLDevice{static_cast<DLDeviceType>(kDLHexagon), 0}; | ||
|
|
||
| // Standalone DLTensor: the standalone-ness means that this object owns the shape | ||
|
|
@@ -75,26 +77,49 @@ inline void* to_ptr(uintptr_t v) { return reinterpret_cast<void*>(v); } | |
|
|
||
| inline uintptr_t to_uint(void* ptr) { return reinterpret_cast<uintptr_t>(ptr); } | ||
|
|
||
| constexpr int xyc_to_sm_16b(int y, int x, int c) { | ||
| constexpr int yxc_to_sm_16b(int y, int x, int c) { | ||
| // Map y,x,c coordinates within a block to the offset (in 16-bit elements) | ||
| // from the beginning of the block in spatial-major layout. | ||
| // 10-bit spatial mask: yyyxcccccx | ||
| assert(y >= 0 && x >= 0 && c >= 0); | ||
| assert(y < 8 && x < 4 && c < 32); | ||
| return y << 7 | (x & 2) << 5 | c << 1 | (x & 1); | ||
| } | ||
|
|
||
| constexpr int yxc_to_sm_8b(int y, int x, int c) { | ||
| // Map y,x,c coordinates within a block to the offset (in 8-bit elements) | ||
| // from the beginning of the block in spatial-major layout. | ||
| // 10-bit spatial mask: yyyxxxccccc | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add a check to make sure only the bits we expect are set in the inputs - for y and x only the lowest 3 bits and c only 5 bits
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I consciously avoided the checks here because these functions are used for indexing within the innermost loops and need to be really fast. I actually was planning to remove the check from the above I thought I had was to add the
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm pretty sure we can rely on
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another option is to check the loop bounds in the caller to make sure y, x and c can't get bigger than can be expressed. (And put a comment here to that effect - that it is the caller's responsibility to check on release builds.)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've added the asserts directly inside the index functions that would be disabled with Release builds. I thought about adding it in the outer loops as you suggested, but that anyways is guaranteed with the current code as
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed - this is much safer! |
||
| assert(y >= 0 && x >= 0 && c >= 0); | ||
| assert(y < 8 && x < 8 && c < 32); | ||
| return y << 8 | x << 5 | c; | ||
| } | ||
|
|
||
| constexpr int hwio_to_sm_8b(int width, int y, int x, int i, int o) { | ||
| // Map y,x,i,o coordinates within a chunk (assuming the origin at the | ||
| // top-left spatial corner) to the offset (in 8-bit elements) from the | ||
| // beginning of the chunk in spatial-major layout. | ||
| // Spatial mask: p..piiioooooii, where p..p are position bits. | ||
| assert(width >= 1); | ||
| assert(y >= 0 && x >= 0 && i >= 0 && o >= 0); | ||
| assert(i < 32 && o < 32); | ||
| int p = y * width + (width - 1 - x); | ||
| return p << 10 | (i & 0x1c) << 5 | o << 2 | (i & 3); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Suggest similar bounds checking here.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same comment as above. I can probably add asserts if we can disable them for release builds later |
||
| } | ||
|
|
||
| constexpr int hwio_to_sm_16b(int width, int y, int x, int i, int o) { | ||
| // Map y,x,i,o coordinates within a chunk (assuming the origin at the | ||
| // top-left spatial corner) to the offset (in 16-bit elements) from the | ||
| // beginning of the chunk in spatial-major layout. | ||
| // Spatial mask: p..piiiioooooi, where p..p are position bits. | ||
| assert(width >= 1); | ||
| assert(y >= 0 && x >= 0 && i >= 0 && o >= 0); | ||
| assert(i < 32 && o < 32); | ||
| int p = y * width + (width - 1 - x); | ||
| return p << 10 | (i & 0x1e) << 5 | o << 1 | (i & 1); | ||
| } | ||
|
|
||
| inline constexpr int round_up(int v, int p2) { return (v + p2 - 1) & -p2; } | ||
| constexpr int round_up(int v, int p2) { return (v + p2 - 1) & -p2; } | ||
|
|
||
| // Returns the block address at the given index | ||
| // Assumptions | ||
|
|
@@ -123,6 +148,10 @@ inline uintptr_t hwio_at(const DLTensor& f, int y, int x, int i, int o) { | |
| * The input is mapped into the below mentioned layout (notation similar to index map used for | ||
| * transform layout): | ||
| * | ||
| * For uint8_t type | ||
| * lambda n, h, w, c: n, h//8, w//8, c//32, AXIS_SEPARATOR, h%8, w%8, c%32 | ||
| * | ||
| * For uint16_t type | ||
| * lambda n, h, w, c: n, h//8, w//4, c//32, AXIS_SEPARATOR, h%8, (w%4)//2, c%32, w%2 | ||
| * | ||
| * where AXIS_SEPARATOR represents split up in the physical layout | ||
|
|
@@ -133,7 +162,48 @@ inline uintptr_t hwio_at(const DLTensor& f, int y, int x, int i, int o) { | |
| * @param width | ||
| * @param depth | ||
| */ | ||
| void blockize_hwc_16b(void* out, void* inp_flat, int height, int width, int depth); | ||
| template <typename T, int block_height, int block_width, int block_depth> | ||
| void blockize_hwc(void* out, void* inp_flat, int height, int width, int depth) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it make sense for This is probably a bit of a stylistic choice; I just figured I'd ask.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with both, I'll add the asserts and the |
||
| int (*index_func)(int, int, int); | ||
| if constexpr (std::is_same_v<T, uint8_t>) | ||
| index_func = yxc_to_sm_8b; | ||
| else if constexpr (std::is_same_v<T, uint16_t>) | ||
| index_func = yxc_to_sm_16b; | ||
| else | ||
| LOG_ERROR << "blockize_hwc is only supported for uint8_t and uint16_t types"; | ||
|
|
||
| auto inp_data = static_cast<T*>(inp_flat); | ||
| auto out_data = static_cast<uintptr_t*>(out); | ||
| const int stride_x = depth; | ||
| const int stride_y = stride_x * width; | ||
|
|
||
| for (int cy = 0; cy < height; cy += block_height) { | ||
| for (int cx = 0; cx < width; cx += block_width) { | ||
| for (int cc = 0; cc < depth; cc += block_depth) { | ||
| auto block = reinterpret_cast<T*>(*out_data++); | ||
| int max_y = std::min(block_height, height - cy); | ||
| int max_x = std::min(block_width, width - cx); | ||
| int max_c = std::min(block_depth, depth - cc); | ||
| for (int y = 0; y < max_y; ++y) { | ||
| for (int x = 0; x < max_x; ++x) { | ||
| for (int c = 0; c < max_c; ++c) { | ||
| block[index_func(y, x, c)] = | ||
| inp_data[(cy + y) * stride_y + (cx + x) * stride_x + (cc + c)]; | ||
| } | ||
| for (int c = max_c; c < block_depth; ++c) block[index_func(y, x, c)] = 0; | ||
| } | ||
| for (int x = max_x; x < block_width; ++x) { | ||
| for (int c = 0; c < block_depth; ++c) block[index_func(y, x, c)] = 0; | ||
| } | ||
| } | ||
|
|
||
| for (int y = max_y; y < block_height; ++y) | ||
| for (int x = 0; x < block_width; ++x) | ||
| for (int c = 0; c < block_depth; ++c) block[index_func(y, x, c)] = 0; | ||
| } // cc | ||
| } // cx | ||
| } // cy | ||
| } | ||
|
|
||
| /** | ||
| * @brief Convert back from non-contguous layout to a flat layout | ||
|
|
@@ -144,7 +214,42 @@ void blockize_hwc_16b(void* out, void* inp_flat, int height, int width, int dept | |
| * @param width | ||
| * @param depth | ||
| */ | ||
| void deblockize_hwc_16b(void* out_flat, void* inp, int height, int width, int depth); | ||
| template <typename T, int block_height, int block_width, int block_depth> | ||
| void deblockize_hwc(void* out_flat, void* inp, int height, int width, int depth) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it make sense for the type of
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll add the |
||
| int (*index_func)(int, int, int); | ||
| if constexpr (std::is_same_v<T, uint8_t>) | ||
| index_func = yxc_to_sm_8b; | ||
| else if constexpr (std::is_same_v<T, uint16_t>) | ||
| index_func = yxc_to_sm_16b; | ||
| else | ||
| LOG_ERROR << "deblockize_hwc is only supported for uint8_t and uint16_t types"; | ||
|
|
||
| uintptr_t* inp_data = static_cast<uintptr_t*>(inp); | ||
| T* out_data = static_cast<T*>(out_flat); | ||
| const int stride_x = depth; | ||
| const int stride_y = stride_x * width; | ||
|
|
||
| for (int cy = 0; cy < height; cy += block_height) { | ||
| for (int cx = 0; cx < width; cx += block_width) { | ||
| for (int cc = 0; cc < depth; cc += block_depth) { | ||
| auto block = reinterpret_cast<T*>(*inp_data); | ||
| int max_y = std::min(block_height, height - cy); | ||
| int max_x = std::min(block_width, width - cx); | ||
| int max_c = std::min(block_depth, depth - cc); | ||
| for (int y = 0; y < max_y; ++y) { | ||
| for (int x = 0; x < max_x; ++x) { | ||
| for (int c = 0; c < max_c; ++c) { | ||
| out_data[(cy + y) * stride_y + (cx + x) * stride_x + (cc + c)] = | ||
| block[index_func(y, x, c)]; | ||
| } | ||
| } | ||
| } | ||
|
|
||
| inp_data++; | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * @brief Convert the layout of weights from flat to "chunked". The term chunked is explained below: | ||
|
|
@@ -175,22 +280,50 @@ void deblockize_hwc_16b(void* out_flat, void* inp, int height, int width, int de | |
| */ | ||
| void chunkify_hwio_16b(void** out_ptr, int out_ptr_size, void* out, void* inp, int height, | ||
| int width, int idepth, int odepth); | ||
| void chunkify_hwio_8b(void** out_ptr, int out_ptr_size, void* out, void* inp, int height, int width, | ||
| int idepth, int odepth); | ||
|
|
||
| template <typename T, int block_height, int block_width, int block_depth> | ||
| SDLTensor<4> prepare_nhwc(tvm::runtime::DeviceAPI* device_api, const DLTensor* nhwc_flat, | ||
| bool copy_data); | ||
| bool copy_data) { | ||
| tvm::runtime::String vtcm_scope = "global.vtcm"; | ||
|
|
||
| // Allocate blocks for activations. We will use the block pointers | ||
| // directly from the allocated area. | ||
| int n = nhwc_flat->shape[0]; | ||
| int h = round_up(nhwc_flat->shape[1], block_height); | ||
| int w = round_up(nhwc_flat->shape[2], block_width); | ||
| int c = round_up(nhwc_flat->shape[3], block_depth); | ||
| int64_t shape_2d[2] = {(n * h * w * c) / (block_height * block_width * block_depth), | ||
| block_height * block_width * block_depth}; | ||
| void* nhwc_vtcm = | ||
| device_api->AllocDataSpace(hexagon_device, 2, shape_2d, nhwc_flat->dtype, vtcm_scope); | ||
| if (copy_data) { | ||
| blockize_hwc<T, block_height, block_width, block_depth>( | ||
| nhwc_vtcm, nhwc_flat->data, nhwc_flat->shape[1], nhwc_flat->shape[2], nhwc_flat->shape[3]); | ||
| } | ||
|
|
||
| int calculate_num_weight_chunks(int64_t* shape_hwio); | ||
| return SDLTensor<4>(nhwc_vtcm, nhwc_flat->dtype, nhwc_vtcm, | ||
| {n, h / block_height, w / block_width, c / block_depth}); | ||
| } | ||
|
|
||
| int calculate_num_weight_chunks(int64_t* shape_hwio, int chunk_height, int chunk_width, | ||
| int chunk_in_channel, int chunk_out_channel); | ||
|
|
||
| SDLTensor<4> prepare_hwio(tvm::runtime::DeviceAPI* device_api, const DLTensor* hwio_flat, | ||
| int num_chunks, void** ptr_table); | ||
|
|
||
| SDLTensor<4> prepare_hwio_8b(tvm::runtime::DeviceAPI* device_api, const DLTensor* hwio_flat, | ||
| int num_chunks, void** ptr_table, int wgt_zp = 0); | ||
|
|
||
| template <size_t N> | ||
| void release(tvm::runtime::DeviceAPI* device_api, const SDLTensor<N>& tensor) { | ||
| if (auto* data_space = tensor.GetDataSpace()) { | ||
| device_api->FreeDataSpace(hexagon_device, data_space); | ||
| } | ||
| } | ||
|
|
||
| } // namespace conv_utils | ||
| } // namespace hexagon | ||
| } // namespace runtime | ||
| } // namespace tvm | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we confident that
-mhvxis supported by all of the compilers that might build this code?I'm assuming that typically the clang provided by Hexagon Toolchain will be used. But I'm a little fuzzy about the intended level of support for other compilers, e.g. a user-supplied build of Clang/LLVM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to update src/runtime/hexagon/README.md to clarify the version(s) of LLVM that support flags like
-mhvx?Or alternatively, use CMake's CheckCXXCompilerFlag function to see if
-mhvxis supported, and only use that flag if it is?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @cconvey.
I can add the details in the README or add a CMake check, but the
-mhvxflag was added to clang all the way back in 2017 in LLVM 6.0 release if not earlier, which predates the entire TVM project, so we can also probably assume safely that the-mhvxflag will be available for practically anyone building the TVM project now.If you think it might still be better to add the check or the README change, please let me know which one you think makes more sense and I can make that change. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes total sense, I didn't realize
-mhvxsupport went back that far. I agree that there's no need for any additional documentation or checking.