-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Closed
Labels
topipython/tvm/topipython/tvm/topi
Description
This is a follow up to the PR #8174 by @masahi. Some of the tf2-tvm object detection (OD) models performed poorer than just tf2 framework. Following are the numbers observed. TF-TVM latency is on models that are post processed with combined-nms
| Model | TF-TVM Latency(ms) | TF latency (ms) | Boxes Shape | Scores Shape |
|---|---|---|---|---|
| ssd_mobilenet_v1_fpn_640x640_1_nms | 377.20 | 203.31 | (1, 51150, 1, 4) | (1, 51150, 90) |
| efficientdet_d0_1_nms | 333.01 | 140.92 | (1, 49104, 1, 4) | (1, 49104, 90) |
GPU activities for ssd_mobilenet_v1_fpn_640x640_1_nms
"GPU activities",71.961571,2.649234,10,264.923361,256.370119,274.001845,"fused_vision_all_class_non_max_suppression_kernel2"
"GPU activities",9.830572,0.361908,160,2.261926,0.870473,7.882956,"volta_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148t_nt_v1"
"GPU activities",2.893616,0.106527,70,1.521816,0.068355,7.010179,"trt_volta_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148t_nt_v1"
GPU activities for efficientdet_d0_1_nms
"GPU activities",79.611974,2.520980,10,252.097991,246.456348,261.662513,"fused_vision_all_class_non_max_suppression_kernel2"
"GPU activities",3.295807,0.104364,160,0.652278,0.038498,5.779056,"void cuReduceLayer::nonTailReduceExcludeW<int=32, nvinfer1::ReduceOp, float, float>(float*, cuReduceLayer::nonTailReduceExcludeW<int=32, nvinfer1::ReduceOp, float, float> const *, cuReduceLayer::LaunchParams)"
"GPU activities",1.695395,0.053686,40,1.342152,1.012731,1.530201,"void thrust::cuda_cub::cub::DeviceRadixSortDownsweepKernel<thrust::cuda_cub::cub::DeviceRadixSortPolicy<float, long, int>::Policy700, bool=0, bool=1, float, long, int>(thrust::cuda_cub::cub::DeviceRadixSortPolicy<float, long, int>::Policy700 const *, thrust::cuda_cub::cub::DeviceRadixSortDownsweepKernel<thrust::cuda_cub::cub::DeviceRadixSortPolicy<float, long, int>::Policy700, bool=0, bool=1, float, long, int>*, bool=0 const *, thrust::cuda_cub::cub::DeviceRadixSortDownsweepKernel<thrust::cuda_cub::cub::DeviceRadixSortPolicy<float, long, int>::Policy700, bool=0, bool=1, float, long, int>**, bool=1*, thrust::cuda_cub::cub::DeviceRadixSortDownsweepKernel<thrust::cuda_cub::cub::DeviceRadixSortPolicy<float, long, int>::Policy700, bool=0, bool=1, float, long, int>**, int, int, thrust::cuda_cub::cub::GridEvenShare<thrust::cuda_cub::cub::DeviceRadixSortDownsweepKernel<thrust::cuda_cub::cub::DeviceRadixSortPolicy<float, long, int>::Policy700, bool=0, bool=1, float, long, int>**>)"
Did you encounter these input shapes before and notice the same performance issues, do you think there is anything different we can do to optimize this input shape case? @masahi
Metadata
Metadata
Assignees
Labels
topipython/tvm/topipython/tvm/topi