System information
-
What is the top-level directory of the model you are using:
Standalone.
-
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
No.
-
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Android 6.0
-
TensorFlow installed from (source or binary):
Source.
-
TensorFlow version (use command below):
master 1f82b7a.
-
Bazel version (if compiling from source):
0.4.5.
-
CUDA/cuDNN version:
N.A.
-
GPU model and memory:
CPU: HiSilicon Kirin 935, 3GB
GPU: ARM Mali-T624
-
Exact command to reproduce:
# https://stackoverflow.com/a/43627334
change tensorflow/core/framework/register_types.h
#define TF_CALL_bool(m)
to
#define TF_CALL_bool(m) m(bool)
bazel build -c opt \
--crosstool_top=//external:android/crosstool \
--cpu=armeabi-v7a \
--host_crosstool_top=@bazel_tools//tools/cpp:toolchain \
tensorflow/tools/benchmark:benchmark_model
adb push bazel-bin/tensorflow/tools/benchmark/benchmark_model /data/local/tmp
bazel build tensorflow/tools/graph_transforms:transform_graph
bazel-bin/tensorflow/tools/graph_transforms/transform_graph \
--in_graph=tensorflow/models/object_detection/ssd_mobilenet_v1_coco_11_06_2017/frozen_inference_graph.pb \
--out_graph=tensorflow/models/object_detection/ssd_mobilenet_v1_coco_11_06_2017/transformed_inference_graph.pb \
--inputs='image_tensor' \
--outputs='detection_boxes,detection_scores,detection_classes,num_detections' \
--transforms='
add_default_attributes
strip_unused_nodes(type=float)
remove_nodes(op=CheckNumerics)
fold_constants(ignore_errors=true)
fold_batch_norms
fold_old_batch_norms
fuse_resize_pad_and_conv
fuse_pad_and_conv
fuse_resize_and_conv
quantize_weights
quantize_nodes
strip_unused_nodes
sort_by_execution_order'
adb push frozen_inference_graph.pb /data/local/tmp
adb push transformed_inference_graph.pb /data/local/tmp
adb shell /data/local/tmp/benchmark_model \
--graph=/data/local/tmp/frozen_inference_graph.pb \
--input_layer=image_tensor:0 \
--input_layer_shape=1,224,224,3 \
--input_layer_type=uint8 \
--output_layer=detection_boxes:0,detection_scores:0,detection_classes:0,num_detections:0 \
> frozen_inference_graph.benchmark
adb shell /data/local/tmp/benchmark_model \
--graph=/data/local/tmp/transformed_inference_graph.pb \
--input_layer=image_tensor:0 \
--input_layer_shape=1,224,224,3 \
--input_layer_type=uint8 \
--output_layer=detection_boxes:0,detection_scores:0,detection_classes:0,num_detections:0 \
> transformed_inference_graph.benchmark
Describe the problem
The time spent on NonMaxSuppression nearly doubled or more than tripled the time spent on Conv2D or QuantizedConv2D during benchmarking the inference graphs.
| Inference Graph |
Node Type |
Average Time % |
| frozen_inference_graph.pb |
NonMaxSuppression |
48.239 |
| frozen_inference_graph.pb |
Conv2D |
25.395 |
| transformed_inference_graph.pb |
NonMaxSuppression |
40.856 |
| transformed_inference_graph.pb |
QuantizedConv2D |
13.807 |
Source code / logs
frozen_inference_graph.pb
native : benchmark_model.cc:382 Graph: [/data/local/tmp/frozen_inference_graph.pb]
native : benchmark_model.cc:383 Input layers: [image_tensor:0]
native : benchmark_model.cc:384 Input shapes: [1,224,224,3]
native : benchmark_model.cc:385 Input types: [uint8]
native : benchmark_model.cc:386 Output layers: [detection_boxes:0,detection_scores:0,detection_classes:0,num_detections:0]
native : benchmark_model.cc:387 Num runs: [50]
native : benchmark_model.cc:388 Inter-run delay (seconds): [-1.0]
native : benchmark_model.cc:389 Num threads: [-1]
native : benchmark_model.cc:390 Benchmark name: []
native : benchmark_model.cc:391 Output prefix: []
native : benchmark_model.cc:392 Show sizes: [0]
native : benchmark_model.cc:393 Warmup runs: [2]
native : benchmark_model.cc:53 Loading TensorFlow.
native : benchmark_model.cc:60 Got config, 0 devices
can't determine number of CPU cores: assuming 4
can't determine number of CPU cores: assuming 4
native : benchmark_model.cc:258 Running benchmark for 2 iterations without detailed stat logging:
native : benchmark_model.cc:286 count=2 first=3273186 curr=1668712 min=1668712 max=3273186 avg=2.47095e+06 std=802237
native : benchmark_model.cc:258 Running benchmark for 50 iterations without detailed stat logging:
native : benchmark_model.cc:286 count=50 first=1687558 curr=1682345 min=1615775 max=1802978 avg=1.69049e+06 std=41851
native : benchmark_model.cc:258 Running benchmark for 50 iterations with detailed stat logging:
============================== Top by Computation Time ==============================
[node type] [start] [first] [avg ms] [%] [cdf%] [mem KB] [times called] [Name]
Conv2D 623.929 57.614 60.054 3.120% 3.120% 409.600 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_13_pointwise/convolution
Conv2D 580.451 31.382 34.441 1.789% 4.909% 409.600 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_12_pointwise/convolution
Conv2D 49.120 33.424 33.211 1.725% 6.634% 2880.000 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/convolution
Conv2D 502.062 24.714 28.450 1.478% 8.112% 739.328 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_10_pointwise/convolution
Conv2D 463.825 23.862 27.747 1.441% 9.554% 739.328 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_9_pointwise/convolution
Conv2D 540.994 23.782 27.632 1.435% 10.989% 739.328 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_11_pointwise/convolution
Conv2D 388.112 22.603 27.427 1.425% 12.414% 739.328 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_7_pointwise/convolution
Conv2D 426.231 25.427 27.163 1.411% 13.825% 739.328 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_8_pointwise/convolution
Conv2D 247.387 23.893 26.293 1.366% 15.191% 2880.000 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/convolution
DepthwiseConv2dNative 95.498 26.500 24.778 1.287% 16.478% 2937.600 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_depthwise/depthwise
============================== Top by Memory Use ==============================
[node type] [start] [first] [avg ms] [%] [cdf%] [mem KB] [times called] [Name]
Conv2D 132.991 20.703 20.848 1.083% 1.083% 5760.000 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_pointwise/convolution
DepthwiseConv2dNative 219.853 16.056 16.815 0.874% 1.957% 3225.600 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_depthwise/depthwise
DepthwiseConv2dNative 95.498 26.500 24.778 1.287% 3.244% 2937.600 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_depthwise/depthwise
Conv2D 247.387 23.893 26.293 1.366% 4.610% 2880.000 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/convolution
Conv2D 193.214 14.730 16.011 0.832% 5.442% 2880.000 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_2_pointwise/convolution
Conv2D 49.120 33.424 33.211 1.725% 7.167% 2880.000 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/convolution
DepthwiseConv2dNative 314.590 7.253 7.887 0.410% 7.577% 1653.760 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_5_depthwise/depthwise
DepthwiseConv2dNative 174.672 11.146 12.409 0.645% 8.221% 1483.776 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_2_depthwise/depthwise
Conv2D 327.950 21.425 23.446 1.218% 9.439% 1478.656 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_5_pointwise/convolution
Conv2D 293.922 14.347 15.109 0.785% 10.224% 1478.656 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_4_pointwise/convolution
============================== Summary by node type ==============================
[Node type] [count] [avg ms] [avg %] [cdf %] [mem KB] [times called]
NonMaxSuppression 90 927.182 48.239% 48.239% 36.000 90
Conv2D 34 488.102 25.395% 73.633% 23526.797 34
DepthwiseConv2dNative 13 104.880 5.457% 79.090% 16711.424 13
Mul 130 96.828 5.038% 84.127% 0.000 130
Slice 91 50.248 2.614% 86.742% 1380.240 91
Split 180 38.533 2.005% 88.746% 5464.640 180
Add 131 35.987 1.872% 90.619% 0.004 131
ConcatV2 107 33.857 1.761% 92.380% 3934.164 107
Gather 546 27.859 1.449% 93.830% 7229.200 546
Const 1979 23.447 1.220% 95.050% 0.000 1979
Relu6 35 21.407 1.114% 96.163% 0.000 35
Minimum 451 9.308 0.484% 96.648% 0.000 451
Where 180 8.533 0.444% 97.092% 2733.760 180
Maximum 360 7.390 0.384% 97.476% 0.000 360
Greater 183 5.793 0.301% 97.777% 343.303 183
Sub 192 5.494 0.286% 98.063% 0.020 192
Cast 182 5.345 0.278% 98.341% 1968.276 182
ResizeBilinear 1 4.663 0.243% 98.584% 1080.000 1
Reshape 282 4.459 0.232% 98.816% 0.000 282
StridedSlice 102 3.532 0.184% 99.000% 0.392 102
TensorArrayGatherV3 1 2.704 0.141% 99.140% 1080.000 1
BiasAdd 12 2.260 0.118% 99.258% 0.000 12
Squeeze 97 2.234 0.116% 99.374% 0.000 97
Sigmoid 1 1.754 0.091% 99.465% 0.000 1
ZerosLike 90 1.734 0.090% 99.556% 36.000 90
Shape 99 1.513 0.079% 99.634% 0.784 99
Unpack 5 1.495 0.078% 99.712% 751.464 5
TopKV2 1 1.259 0.066% 99.778% 72.000 1
NoOp 1 0.689 0.036% 99.813% 0.000 1
TensorArrayScatterV3 1 0.631 0.033% 99.846% 602.112 1
Transpose 2 0.438 0.023% 99.869% 61.344 2
RealDiv 8 0.296 0.015% 99.884% 15.336 8
Switch 20 0.288 0.015% 99.899% 0.000 22
Merge 8 0.217 0.011% 99.911% 0.032 10
Assert 5 0.210 0.011% 99.922% 0.000 5
Identity 15 0.189 0.010% 99.932% 0.000 15
Enter 6 0.179 0.009% 99.941% 0.000 6
Pack 6 0.157 0.008% 99.949% 30.672 6
Exp 2 0.132 0.007% 99.956% 0.000 2
ExpandDims 7 0.131 0.007% 99.963% 0.000 7
Range 5 0.113 0.006% 99.969% 0.424 5
TensorArrayV3 2 0.112 0.006% 99.974% 0.104 2
TensorArrayWriteV3 1 0.056 0.003% 99.977% 0.000 1
Less 1 0.056 0.003% 99.980% 0.001 2
_Arg 1 0.053 0.003% 99.983% 0.000 1
NextIteration 2 0.049 0.003% 99.986% 0.000 2
Fill 3 0.049 0.003% 99.988% 0.000 3
TensorArrayReadV3 1 0.048 0.002% 99.991% 0.000 1
Rank 2 0.041 0.002% 99.993% 0.008 2
_Retval 4 0.040 0.002% 99.995% 0.000 4
LoopCond 1 0.024 0.001% 99.996% 0.000 2
TensorArraySizeV3 1 0.022 0.001% 99.997% 0.004 1
Equal 1 0.022 0.001% 99.998% 0.001 1
Size 1 0.016 0.001% 99.999% 0.004 1
Exit 1 0.016 0.001% 100.000% 0.000 1
Timings (microseconds): count=50 first=1745329 curr=1670092 min=1670092 max=2170221 avg=1.92491e+06 std=189978
Memory (bytes): count=50 curr=67058518(all same)
5683 nodes observed
transformed_inference_graph.pb
native : benchmark_model.cc:382 Graph: [/data/local/tmp/transformed_inference_graph.pb]
native : benchmark_model.cc:383 Input layers: [image_tensor:0]
native : benchmark_model.cc:384 Input shapes: [1,224,224,3]
native : benchmark_model.cc:385 Input types: [uint8]
native : benchmark_model.cc:386 Output layers: [detection_boxes:0,detection_scores:0,detection_classes:0,num_detections:0]
native : benchmark_model.cc:387 Num runs: [50]
native : benchmark_model.cc:388 Inter-run delay (seconds): [-1.0]
native : benchmark_model.cc:389 Num threads: [-1]
native : benchmark_model.cc:390 Benchmark name: []
native : benchmark_model.cc:391 Output prefix: []
native : benchmark_model.cc:392 Show sizes: [0]
native : benchmark_model.cc:393 Warmup runs: [2]
native : benchmark_model.cc:53 Loading TensorFlow.
native : benchmark_model.cc:60 Got config, 0 devices
can't determine number of CPU cores: assuming 4
can't determine number of CPU cores: assuming 4
native : benchmark_model.cc:258 Running benchmark for 2 iterations without detailed stat logging:
native : benchmark_model.cc:286 count=2 first=2688688 curr=1373990 min=1373990 max=2688688 avg=2.03134e+06 std=657349
native : benchmark_model.cc:258 Running benchmark for 50 iterations without detailed stat logging:
native : benchmark_model.cc:286 count=50 first=1405345 curr=1400253 min=1246255 max=1466356 avg=1.37303e+06 std=33711
native : benchmark_model.cc:258 Running benchmark for 50 iterations with detailed stat logging:
============================== Top by Computation Time ==============================
[node type] [start] [first] [avg ms] [%] [cdf%] [mem KB] [times called] [Name]
Conv2D 33.710 36.773 30.203 1.931% 1.931% 2880.000 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/batchnorm/mul_1
DepthwiseConv2dNative 85.930 26.702 23.704 1.516% 3.447% 2937.600 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_depthwise/depthwise
QuantizedConv2D 140.843 21.529 20.738 1.326% 4.773% 5760.008 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_pointwise/BatchNorm/batchnorm/mul_1/eightbit
QuantizedConv2D 656.673 19.257 19.940 1.275% 6.048% 409.608 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_13_pointwise/BatchNorm/batchnorm/mul_1/eightbit
DepthwiseConv2dNative 274.615 20.146 17.493 1.119% 7.167% 3225.600 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_depthwise/depthwise
QuantizedConv2D 323.094 14.819 16.560 1.059% 8.226% 2880.008 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/BatchNorm/batchnorm/mul_1/eightbit
QuantizedConv2D 537.240 13.757 14.516 0.928% 9.154% 739.336 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_9_pointwise/BatchNorm/batchnorm/mul_1/eightbit
QuantizedConv2D 473.869 13.103 14.275 0.913% 10.067% 739.336 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_7_pointwise/BatchNorm/batchnorm/mul_1/eightbit
QuantizedConv2D 569.591 13.382 14.089 0.901% 10.968% 739.336 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_10_pointwise/BatchNorm/batchnorm/mul_1/eightbit
QuantizedConv2D 505.947 13.133 13.899 0.889% 11.856% 739.336 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_8_pointwise/BatchNorm/batchnorm/mul_1/eightbit
============================== Top by Memory Use ==============================
[node type] [start] [first] [avg ms] [%] [cdf%] [mem KB] [times called] [Name]
QuantizedAdd 175.293 12.215 11.851 0.758% 0.758% 5760.008 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_pointwise/BatchNorm/batchnorm/add_1/eightbit
QuantizedConv2D 140.843 21.529 20.738 1.326% 2.084% 5760.008 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_pointwise/BatchNorm/batchnorm/mul_1/eightbit
Dequantize 203.684 5.234 5.386 0.344% 2.428% 5760.000 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_pointwise/Relu6
DepthwiseConv2dNative 274.615 20.146 17.493 1.119% 3.547% 3225.600 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_depthwise/depthwise
DepthwiseConv2dNative 85.930 26.702 23.704 1.516% 5.063% 2937.600 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_depthwise/depthwise
QuantizedAdd 345.853 5.552 4.671 0.299% 5.361% 2880.008 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/BatchNorm/batchnorm/add_1/eightbit
QuantizedConv2D 323.094 14.819 16.560 1.059% 6.420% 2880.008 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/BatchNorm/batchnorm/mul_1/eightbit
QuantizedAdd 309.606 5.842 5.277 0.337% 6.758% 2880.008 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_depthwise/BatchNorm/batchnorm/add_1/eightbit
QuantizedMul 298.146 4.004 4.408 0.282% 7.039% 2880.008 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_depthwise/BatchNorm/batchnorm/mul_1/eightbit
QuantizedAdd 257.952 5.945 5.208 0.333% 7.372% 2880.008 1 FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_2_pointwise/BatchNorm/batchnorm/add_1/eightbit
============================== Summary by node type ==============================
[Node type] [count] [avg ms] [avg %] [cdf %] [mem KB] [times called]
NonMaxSuppression 90 637.594 40.856% 40.856% 36.000 90
QuantizedConv2D 33 215.470 13.807% 54.664% 20647.061 33
DepthwiseConv2dNative 13 106.792 6.843% 61.507% 16711.424 13
RequantizationRange 283 93.028 5.961% 67.468% 2.264 283
Requantize 283 79.127 5.070% 72.538% 18539.414 283
QuantizedAdd 130 71.310 4.569% 77.108% 36965.137 130
Slice 91 36.322 2.327% 79.435% 1380.240 91
Split 180 31.280 2.004% 81.440% 4799.104 180
Conv2D 1 30.202 1.935% 83.375% 2880.000 1
ConcatV2 107 27.631 1.771% 85.146% 3601.396 107
QuantizeV2 386 27.177 1.741% 86.887% 5043.348 386
Dequantize 307 26.689 1.710% 88.597% 25504.588 307
Gather 546 25.502 1.634% 90.231% 6120.800 546
QuantizedMul 108 23.188 1.486% 91.717% 15810.112 108
QuantizedRelu6 35 17.087 1.095% 92.812% 9224.536 35
Min 386 15.680 1.005% 93.817% 1.544 386
Max 386 15.260 0.978% 94.795% 1.544 386
Const 629 10.047 0.644% 95.439% 0.000 629
Where 180 8.200 0.525% 95.964% 2290.400 180
Minimum 451 8.197 0.525% 96.489% 0.000 451
Reshape 566 7.537 0.483% 96.972% 0.000 566
Maximum 360 6.321 0.405% 97.377% 0.000 360
Cast 182 5.764 0.369% 97.747% 1746.596 182
Greater 183 5.025 0.322% 98.069% 322.505 183
Sub 192 4.745 0.304% 98.373% 0.016 192
ResizeBilinear 1 4.461 0.286% 98.659% 1080.000 1
StridedSlice 100 3.158 0.202% 98.861% 0.384 100
QuantizedReshape 102 2.275 0.146% 99.007% 0.816 102
TensorArrayGatherV3 1 2.033 0.130% 99.137% 1080.000 1
Squeeze 97 1.766 0.113% 99.250% 0.000 97
ZerosLike 90 1.544 0.099% 99.349% 36.000 90
Sigmoid 1 1.410 0.090% 99.439% 0.000 1
QuantizedBiasAdd 12 1.406 0.090% 99.530% 728.556 12
Shape 99 1.394 0.089% 99.619% 0.784 99
Unpack 5 1.381 0.088% 99.707% 751.464 5
TensorArrayScatterV3 1 1.003 0.064% 99.772% 602.112 1
TopKV2 1 0.968 0.062% 99.834% 72.000 1
Transpose 2 0.344 0.022% 99.856% 61.344 2
Switch 20 0.272 0.017% 99.873% 0.000 22
Merge 8 0.203 0.013% 99.886% 0.032 10
Enter 6 0.191 0.012% 99.898% 0.000 6
NoOp 1 0.184 0.012% 99.910% 0.000 1
Identity 15 0.175 0.011% 99.921% 0.000 15
RealDiv 6 0.143 0.009% 99.931% 0.000 6
Pack 6 0.133 0.009% 99.939% 30.672 6
TensorArrayV3 2 0.114 0.007% 99.946% 0.104 2
ExpandDims 7 0.111 0.007% 99.953% 0.000 7
Range 5 0.097 0.006% 99.960% 0.424 5
Exp 2 0.090 0.006% 99.965% 0.000 2
Assert 4 0.075 0.005% 99.970% 0.000 4
TensorArrayWriteV3 1 0.054 0.003% 99.974% 0.000 1
Less 1 0.051 0.003% 99.977% 0.001 2
NextIteration 2 0.045 0.003% 99.980% 0.000 2
TensorArrayReadV3 1 0.044 0.003% 99.983% 0.000 1
Fill 3 0.044 0.003% 99.986% 0.000 3
_Arg 1 0.042 0.003% 99.988% 0.000 1
_Retval 4 0.037 0.002% 99.991% 0.000 4
Rank 2 0.028 0.002% 99.992% 0.008 2
Equal 1 0.026 0.002% 99.994% 0.001 1
LoopCond 1 0.023 0.001% 99.996% 0.000 2
Add 1 0.021 0.001% 99.997% 0.004 1
TensorArraySizeV3 1 0.020 0.001% 99.998% 0.004 1
Exit 1 0.015 0.001% 99.999% 0.000 1
Size 1 0.014 0.001% 100.000% 0.004 1
Timings (microseconds): count=50 first=1463739 curr=1459978 min=1392397 max=1906618 avg=1.56387e+06 std=182975
Memory (bytes): count=50 curr=176072750(all same)
6723 nodes observed
System information
What is the top-level directory of the model you are using:
Standalone.
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
No.
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Android 6.0
TensorFlow installed from (source or binary):
Source.
TensorFlow version (use command below):
master 1f82b7a.
Bazel version (if compiling from source):
0.4.5.
CUDA/cuDNN version:
N.A.
GPU model and memory:
CPU: HiSilicon Kirin 935, 3GB
GPU: ARM Mali-T624
Exact command to reproduce:
Describe the problem
The time spent on NonMaxSuppression nearly doubled or more than tripled the time spent on Conv2D or QuantizedConv2D during benchmarking the inference graphs.
Source code / logs
frozen_inference_graph.pb
transformed_inference_graph.pb