-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Fix CUDA Compute Function For get_valid_counts and nms
#6108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
tests/python/relay/test_op_level5.py
Outdated
| if target in ['cuda', 'opencl']: | ||
| return | ||
| # get_valid_count for opencl doesn't do data rearrangement | ||
| if target in ['opencl']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OpenCL shares the cuda implementation, so you can enable this test too. The CI doesn't run opencl so please test it manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test for opencl has been enabled.
topi/python/topi/cuda/nms.py
Outdated
| with ib.if_scope( | ||
| tvm.tir.all(data[tid * elem_length + score_index] > score_threshold, | ||
| tvm.tir.any(id_index < 0, data[tid * elem_length + id_index] >= 0))): | ||
| atomic_add_return[0] = atomic_add(tvm.tir.call_intrin("handle", "tir.address_of", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are no longer using atomic_add, should we remove those intrinsic definitions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused atomic_add definitions have been removed.
|
IIRC, data arrangement was removed from get_valid_counts to improve performance because the data arrangement would be done by NMS anyway. Does this PR maintain the performance? @Laurawly |
|
@lsy643 Regarding the thread change, could you please benchmark the performance before and after your change and share the numbers? |
| max_threads = int(tvm.target.Target.current( | ||
| allow_none=False).max_num_threads) | ||
| nthread_tx = max_threads | ||
| nthread_bx = batch_size * num_anchors // max_threads + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would like to know more about the reason behind the change, perhaps share some benchmark numbers? In some scenarios, like TF MaskRCNN, a large number of boxes (num_anchors > 20000) are in one batch, multiple threads here might provide performance improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the gpu get_valid_counts does not need to moves valid boxes to the top of input data because of the argsort in the nms, so using the original get_valid_counts_ir ought to be better. And do you think rearrange_indices_out_ir is the right way to do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only using one thread will regress the performance a lot. More benchmark should be shown according to previous PRs' workloads.
| max_threads = int(tvm.target.Target.current( | ||
| allow_none=False).max_num_threads) | ||
| nthread_tx = max_threads | ||
| nthread_bx = batch_size * num_anchors // max_threads + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only using one thread will regress the performance a lot. More benchmark should be shown according to previous PRs' workloads.
| return ib.get() | ||
|
|
||
|
|
||
| def rearrange_indices_out(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will regress performance a lot, don't recommend.
|
I'm wondering what's the purpose of this PR. Currently, there's no correctness issue with |
|
@trevor-m @yongwww @Laurawly For the |
Could you show some benchmark numbers regarding the changes? @yongwww could have a better comment on the tensorflow related changes. Also it seems that there's illegal memory access error based on CI. |
|
@lsy643 the |
|
@yongwww @Laurawly If I understand correctly, a Therefore, it seems to make more sense if we use different test data for the the original data before get_valid_counts
np_data = np.array([[[0, 0.8, 1, 20, 25, 45],
[1, 0.7, 30, 60, 50, 80],
[0, 0.4, 4, 21, 19, 40],
[2, 0.9, 35, 61, 52, 79],
[1, 0.5, 100, 60, 70, 110]]]).astype("float32")
cpu test data after get_valid_counts
np_data = np.array([[[0, 0.8, 1, 20, 25, 45],
[1, 0.7, 1, 20, 25, 45],
[2, 0.9, 35, 61, 52, 79],
[1, 0.5, 100, 60, 70, 110],
[-1, -1, -1, -1, -1, -1]]]).astype("float32")
np_indices = np.array([[0, 1, 3, 4, -1]]).astype("int32")cuda test data after get_valid_counts
np_data = np.array([[[0, 0.8, 1, 20, 25, 45],
[1, 0.7, 2, 21, 26, 45],
[-1, -1, -1, -1, -1, -1],
[2, 0.9, 35, 61, 52, 79],
[1, 0.5, 100, 60, 70, 110]]]).astype("float32")
np_indices = np.array([[0, 1, -1, 3, 4]]).astype("int32") |
|
@lsy643 you are right, the auxiliary op get_valid_count and strided_slice are utilized to help handle TensorFlow dynamic NonMaximumSuppression. As a todo task, the cpu and gpu versions of the op are expected to behave consistently. |
|
@yongwww Since there is no
2 For test data with shape
The inference time for llvm with large dataset is too large. Test data I use data_length = 20000
np_valid_count = np.array([20000]).astype("int32")
v = []
for i in range(20000):
v.append(i)
np_indices = np.array([v]).astype("int32")
np_data = np.array([[[0, 0.8, 1, 20, 25, 45], [1, 0.7, 30, 60, 50, 80],
[0, 0.4, 4, 21, 19, 40], [2, 0.9, 35, 61, 52, 79],
[1, 0.5, 100, 60, 70, 110]]]).astype("float32")
np_data = np_data.repeat(20000/5, axis=1)The compute and schedule functions I use |
|
@lsy643 thanks for sharing the results. What I am wondering is the latency of your change vs previous nms gpu version (even the output is not identical), and probably the perf number of your change vs TensorFlow baseline. As Leyuan mentioned above, the thread related change might cause performance regression, performance matters a lot for us, so we would like to see some perf number about this. If performance regression does exist, then it should be fixed. |
|
@yongwww The For test data with shape [1, 20000, 6] When
When
|
Laurawly
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please resolve conflicts with main branch.
In this PR, the CUDA compute funtions of
get_valid_countsandnmsare changed to make them work as expected.get_valid_counts, only one thread is used for one image. I am not sure whether this is a good waynms, there are two changes2.1 make
box_indicesto map back to the original data indices2.2 create
rearrange_indices_outfornmswhenreturn_indices == Trueget_valid_countsandnmsare enabled now