NMS op causing poor performance in tf2 models

This is a follow up to the PR #8174 by @masahi. Some of the tf2-tvm object detection (OD) models performed poorer than just tf2 framework. Following are the numbers observed. TF-TVM latency is on models that are post processed with combined-nms

Model | TF-TVM Latency(ms) | TF latency (ms) | Boxes Shape | Scores Shape
-|-|-|-|-
ssd_mobilenet_v1_fpn_640x640_1_nms | 377.20 | 203.31 | (1, 51150, 1, 4) | (1, 51150, 90)
efficientdet_d0_1_nms | 333.01 | 140.92 | (1, 49104, 1, 4) | (1, 49104, 90)

GPU activities for *ssd_mobilenet_v1_fpn_640x640_1_nms*
```
"GPU activities",71.961571,2.649234,10,264.923361,256.370119,274.001845,"fused_vision_all_class_non_max_suppression_kernel2"
"GPU activities",9.830572,0.361908,160,2.261926,0.870473,7.882956,"volta_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148t_nt_v1"
"GPU activities",2.893616,0.106527,70,1.521816,0.068355,7.010179,"trt_volta_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148t_nt_v1"
```
GPU activities for *efficientdet_d0_1_nms*
```
"GPU activities",79.611974,2.520980,10,252.097991,246.456348,261.662513,"fused_vision_all_class_non_max_suppression_kernel2"
"GPU activities",3.295807,0.104364,160,0.652278,0.038498,5.779056,"void cuReduceLayer::nonTailReduceExcludeW<int=32, nvinfer1::ReduceOp, float, float>(float*, cuReduceLayer::nonTailReduceExcludeW<int=32, nvinfer1::ReduceOp, float, float> const *, cuReduceLayer::LaunchParams)"
"GPU activities",1.695395,0.053686,40,1.342152,1.012731,1.530201,"void thrust::cuda_cub::cub::DeviceRadixSortDownsweepKernel<thrust::cuda_cub::cub::DeviceRadixSortPolicy<float, long, int>::Policy700, bool=0, bool=1, float, long, int>(thrust::cuda_cub::cub::DeviceRadixSortPolicy<float, long, int>::Policy700 const *, thrust::cuda_cub::cub::DeviceRadixSortDownsweepKernel<thrust::cuda_cub::cub::DeviceRadixSortPolicy<float, long, int>::Policy700, bool=0, bool=1, float, long, int>*, bool=0 const *, thrust::cuda_cub::cub::DeviceRadixSortDownsweepKernel<thrust::cuda_cub::cub::DeviceRadixSortPolicy<float, long, int>::Policy700, bool=0, bool=1, float, long, int>**, bool=1*, thrust::cuda_cub::cub::DeviceRadixSortDownsweepKernel<thrust::cuda_cub::cub::DeviceRadixSortPolicy<float, long, int>::Policy700, bool=0, bool=1, float, long, int>**, int, int, thrust::cuda_cub::cub::GridEvenShare<thrust::cuda_cub::cub::DeviceRadixSortDownsweepKernel<thrust::cuda_cub::cub::DeviceRadixSortPolicy<float, long, int>::Policy700, bool=0, bool=1, float, long, int>**>)"
```

Did you encounter these input shapes before and notice the same performance issues, do you think there is anything different we can do to optimize this input shape case? @masahi 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NMS op causing poor performance in tf2 models #8561

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	TF-TVM Latency(ms)	TF latency (ms)	Boxes Shape	Scores Shape
ssd_mobilenet_v1_fpn_640x640_1_nms	377.20	203.31	(1, 51150, 1, 4)	(1, 51150, 90)
efficientdet_d0_1_nms	333.01	140.92	(1, 49104, 1, 4)	(1, 49104, 90)

NMS op causing poor performance in tf2 models #8561

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions