Skip to content

NMS op causing poor performance in tf2 models #8561

@srinidhigoud

Description

@srinidhigoud

This is a follow up to the PR #8174 by @masahi. Some of the tf2-tvm object detection (OD) models performed poorer than just tf2 framework. Following are the numbers observed. TF-TVM latency is on models that are post processed with combined-nms

Model TF-TVM Latency(ms) TF latency (ms) Boxes Shape Scores Shape
ssd_mobilenet_v1_fpn_640x640_1_nms 377.20 203.31 (1, 51150, 1, 4) (1, 51150, 90)
efficientdet_d0_1_nms 333.01 140.92 (1, 49104, 1, 4) (1, 49104, 90)

GPU activities for ssd_mobilenet_v1_fpn_640x640_1_nms

"GPU activities",71.961571,2.649234,10,264.923361,256.370119,274.001845,"fused_vision_all_class_non_max_suppression_kernel2"
"GPU activities",9.830572,0.361908,160,2.261926,0.870473,7.882956,"volta_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148t_nt_v1"
"GPU activities",2.893616,0.106527,70,1.521816,0.068355,7.010179,"trt_volta_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148t_nt_v1"

GPU activities for efficientdet_d0_1_nms

"GPU activities",79.611974,2.520980,10,252.097991,246.456348,261.662513,"fused_vision_all_class_non_max_suppression_kernel2"
"GPU activities",3.295807,0.104364,160,0.652278,0.038498,5.779056,"void cuReduceLayer::nonTailReduceExcludeW<int=32, nvinfer1::ReduceOp, float, float>(float*, cuReduceLayer::nonTailReduceExcludeW<int=32, nvinfer1::ReduceOp, float, float> const *, cuReduceLayer::LaunchParams)"
"GPU activities",1.695395,0.053686,40,1.342152,1.012731,1.530201,"void thrust::cuda_cub::cub::DeviceRadixSortDownsweepKernel<thrust::cuda_cub::cub::DeviceRadixSortPolicy<float, long, int>::Policy700, bool=0, bool=1, float, long, int>(thrust::cuda_cub::cub::DeviceRadixSortPolicy<float, long, int>::Policy700 const *, thrust::cuda_cub::cub::DeviceRadixSortDownsweepKernel<thrust::cuda_cub::cub::DeviceRadixSortPolicy<float, long, int>::Policy700, bool=0, bool=1, float, long, int>*, bool=0 const *, thrust::cuda_cub::cub::DeviceRadixSortDownsweepKernel<thrust::cuda_cub::cub::DeviceRadixSortPolicy<float, long, int>::Policy700, bool=0, bool=1, float, long, int>**, bool=1*, thrust::cuda_cub::cub::DeviceRadixSortDownsweepKernel<thrust::cuda_cub::cub::DeviceRadixSortPolicy<float, long, int>::Policy700, bool=0, bool=1, float, long, int>**, int, int, thrust::cuda_cub::cub::GridEvenShare<thrust::cuda_cub::cub::DeviceRadixSortDownsweepKernel<thrust::cuda_cub::cub::DeviceRadixSortPolicy<float, long, int>::Policy700, bool=0, bool=1, float, long, int>**>)"

Did you encounter these input shapes before and notice the same performance issues, do you think there is anything different we can do to optimize this input shape case? @masahi

Metadata

Metadata

Assignees

No one assigned

    Labels

    topipython/tvm/topi

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions