[MXNET-411] Add ROI Align#10852
Conversation
|
Please add tests and documentation |
|
Hi, can you enable openmp on the cpu implementation of roialign. This can achieve a better performance. You can reference my pr in #9958. |
|
caffe2 has a cpp test. I think Caffe2’s roi align op is written by Kaiming He. |
|
Hi. I wrote a ROIAlign forward/backward test. It may be useful. And I found the implement of CPU/GPU is different. I think it's better to add cpu/gpu consistence test. |
|
@piiswrong @zhreshold Could you please review this PR for adapting ROI Aligh from Caffe2. Thanks! |
| i += blockDim.x * gridDim.x) | ||
|
|
||
| // The number of cuda threads to use. 512 is used for backward compatibility | ||
| constexpr int ROI_CUDA_NUM_THREADS = 512; |
There was a problem hiding this comment.
Use mshadow::cuda::kMaxThreadsPerBlock might provide better perf on newer opus possibly?
Use mshadow::cuda::CheckLaunchParam to help check the launch limits
| } | ||
|
|
||
| /* | ||
| template <typename T> |
| check_numeric_gradient(sym=test, location=[x1, x2], | ||
| grad_nodes={'data':'add', 'rois':'null'}, | ||
| numeric_eps=1e-4, rtol=1e-1, atol=1E-4) | ||
|
|
There was a problem hiding this comment.
need a forward result check in addition to gradient check
There was a problem hiding this comment.
Yeah, will a forward result check soon 👍
|
@piiswrong @zhreshold I have added the unit tests. Please see the updates. Thanks! |
|
|
||
| #define START_IND(a, b, c) static_cast<int>(floor(static_cast<float>(a * c) / b)) | ||
| #define END_IND(a, b, c) static_cast<int>(ceil(static_cast<float>((a + 1) * c) / b)) | ||
| #define START_IND(a, b, c) static_cast<int>(floor(static_cast<real>(a * c) / b)) |
| T roi_start_h = offset_bottom_rois[1] * spatial_scale; | ||
| T roi_end_w = offset_bottom_rois[2] * spatial_scale; | ||
| T roi_end_h = offset_bottom_rois[3] * spatial_scale; | ||
| // T roi_start_w = round(offset_bottom_rois[0] * spatial_scale); |
| int rois_cols) { | ||
| DCHECK(rois_cols == 4 || rois_cols == 5); | ||
|
|
||
| for (int index = 0; index < nthreads; index++) { |
There was a problem hiding this comment.
We have to use single threading in backward, due to no atomic add using CPU. We could assume no one would use cpu to train the model :)
| // (n, c, ph, pw) is an element in the pooled output | ||
| // can be parallelized using omp | ||
| // #pragma omp parallel for num_threads(32) | ||
| for (int n = 0; n < n_rois; n++) { |
| const DType *bottom_rois = in_data[0].dptr<DType>(); | ||
| DType *grad_in = outputs[0].dptr<DType>(); | ||
|
|
||
| if (kAddTo == req[roialign::kData] || kWriteTo == req[roialign::kData]) { |
There was a problem hiding this comment.
Return if NullOp before the switch?
| DMLC_DECLARE_PARAMETER(ROIAlignParam) { | ||
| DMLC_DECLARE_FIELD(pooled_size) | ||
| .set_expect_ndim(2).enforce_nonzero() | ||
| .describe("fix pooled size: (h, w)"); |
There was a problem hiding this comment.
The output roi feature sizes. Name is compatible with ROIPooling
| }; | ||
|
|
||
|
|
||
| struct ROIAlignGrad { |
There was a problem hiding this comment.
No need for this struct. Use lambda directly at set_attr
| // (n, c, ph, pw) is an element in the pooled output | ||
| // can be parallelized using omp | ||
| int n; | ||
| #pragma omp parallel for private(n) \ |
There was a problem hiding this comment.
Thanks for adding omp. Regarding to NUM_OF_ROIS and CHANNELS, I found laster is more and more bigger than former in ROIPooling, so I just apply omp on channels to achieve better application performance. Can you help benchmark the performance based on the usually roi_align size?
There was a problem hiding this comment.
Are you suggesting removing this omp?
| &pre_calc); | ||
|
|
||
| int c; | ||
| #pragma omp parallel for private(c) \ |
There was a problem hiding this comment.
keeping this ?
|
Remove OMP in backward, due to no atomic add in cpu. |
|
Gathering status, did anyone have unresolved issues? |
| DMLC_DECLARE_PARAMETER(ROIAlignParam) { | ||
| DMLC_DECLARE_FIELD(pooled_size) | ||
| .set_expect_ndim(2).enforce_nonzero() | ||
| .describe("ROI Align output roi featuremap height and width: (h, w)"); |
There was a problem hiding this comment.
featuremap -> feature map
|
|
||
| NNVM_REGISTER_OP(_contrib_ROIAlign) | ||
| .describe(R"code( | ||
| ROI Align Layer |
There was a problem hiding this comment.
Superfluous. Remove this line
|
I should have addressed most of the reviews. Please let me know if there are any further comments. Thanks! Related Gluon-CV PR dmlc/gluon-cv#140 |
|
Finally it passes the CI :) @piiswrong |
| width, | ||
| pooled_height, | ||
| pooled_width, | ||
| -1, |
There was a problem hiding this comment.
sampling_rate missing in ROIAlignParam
There was a problem hiding this comment.
Currently, it uses the adaptive size. I will make it as an option
* add roi align * lint * cpu gpu forward consistent * roi align from caffe2 * rois and unit-test * for cpplint * use pointer instead of reference for lint * fix * add docs * fix vector * more unit test * using mshadow * omp * omp on channels * remove omp due to no cpu atomic add * use lambda func for grads * knullop return
* add roi align * lint * cpu gpu forward consistent * roi align from caffe2 * rois and unit-test * for cpplint * use pointer instead of reference for lint * fix * add docs * fix vector * more unit test * using mshadow * omp * omp on channels * remove omp due to no cpu atomic add * use lambda func for grads * knullop return
* add roi align * lint * cpu gpu forward consistent * roi align from caffe2 * rois and unit-test * for cpplint * use pointer instead of reference for lint * fix * add docs * fix vector * more unit test * using mshadow * omp * omp on channels * remove omp due to no cpu atomic add * use lambda func for grads * knullop return
Description
(Brief description on what this PR is about)
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments