NonMaxSuppression is the biggest performance bottleneck of the SSD-MobileNet object detection models on Android mobile phone

### System information
- **What is the top-level directory of the model you are using**:
Standalone.

- **Have I written custom code (as opposed to using a stock example script provided in TensorFlow)**:
No.

- **OS Platform and Distribution (e.g., Linux Ubuntu 16.04)**:
Android 6.0

- **TensorFlow installed from (source or binary)**:
Source.

- **TensorFlow version (use command below)**:
master 1f82b7a.

- **Bazel version (if compiling from source)**:
0.4.5.

- **CUDA/cuDNN version**:
N.A.

- **GPU model and memory**:
CPU:  HiSilicon Kirin 935, 3GB
GPU: ARM Mali-T624

- **Exact command to reproduce**:
```shell
# https://stackoverflow.com/a/43627334
change tensorflow/core/framework/register_types.h
#define TF_CALL_bool(m)
to
#define TF_CALL_bool(m) m(bool)

bazel build -c opt \
  --crosstool_top=//external:android/crosstool \
  --cpu=armeabi-v7a \
  --host_crosstool_top=@bazel_tools//tools/cpp:toolchain \
  tensorflow/tools/benchmark:benchmark_model

adb push bazel-bin/tensorflow/tools/benchmark/benchmark_model /data/local/tmp

bazel build tensorflow/tools/graph_transforms:transform_graph

bazel-bin/tensorflow/tools/graph_transforms/transform_graph \
--in_graph=tensorflow/models/object_detection/ssd_mobilenet_v1_coco_11_06_2017/frozen_inference_graph.pb \
--out_graph=tensorflow/models/object_detection/ssd_mobilenet_v1_coco_11_06_2017/transformed_inference_graph.pb \
--inputs='image_tensor' \
--outputs='detection_boxes,detection_scores,detection_classes,num_detections' \
--transforms='
  add_default_attributes
  strip_unused_nodes(type=float)
  remove_nodes(op=CheckNumerics)
  fold_constants(ignore_errors=true)
  fold_batch_norms
  fold_old_batch_norms
  fuse_resize_pad_and_conv
  fuse_pad_and_conv
  fuse_resize_and_conv
  quantize_weights
  quantize_nodes
  strip_unused_nodes
  sort_by_execution_order'

adb push frozen_inference_graph.pb /data/local/tmp

adb push transformed_inference_graph.pb /data/local/tmp

adb shell /data/local/tmp/benchmark_model \
 --graph=/data/local/tmp/frozen_inference_graph.pb \
 --input_layer=image_tensor:0 \
 --input_layer_shape=1,224,224,3 \
 --input_layer_type=uint8 \
 --output_layer=detection_boxes:0,detection_scores:0,detection_classes:0,num_detections:0 \
 > frozen_inference_graph.benchmark

adb shell /data/local/tmp/benchmark_model \
 --graph=/data/local/tmp/transformed_inference_graph.pb \
 --input_layer=image_tensor:0 \
 --input_layer_shape=1,224,224,3 \
 --input_layer_type=uint8 \
 --output_layer=detection_boxes:0,detection_scores:0,detection_classes:0,num_detections:0 \
 > transformed_inference_graph.benchmark
```

### Describe the problem
The time spent on NonMaxSuppression nearly doubled or more than tripled the time spent on Conv2D or QuantizedConv2D during benchmarking the inference graphs.

| Inference Graph | Node Type | Average Time % |
| - | - | - |
| frozen_inference_graph.pb | NonMaxSuppression | 48.239 |
| frozen_inference_graph.pb | Conv2D | 25.395 |
| transformed_inference_graph.pb | NonMaxSuppression | 40.856 |
| transformed_inference_graph.pb | QuantizedConv2D | 13.807 |

### Source code / logs
frozen_inference_graph.pb
```
native : benchmark_model.cc:382 Graph: [/data/local/tmp/frozen_inference_graph.pb]
native : benchmark_model.cc:383 Input layers: [image_tensor:0]
native : benchmark_model.cc:384 Input shapes: [1,224,224,3]
native : benchmark_model.cc:385 Input types: [uint8]
native : benchmark_model.cc:386 Output layers: [detection_boxes:0,detection_scores:0,detection_classes:0,num_detections:0]
native : benchmark_model.cc:387 Num runs: [50]
native : benchmark_model.cc:388 Inter-run delay (seconds): [-1.0]
native : benchmark_model.cc:389 Num threads: [-1]
native : benchmark_model.cc:390 Benchmark name: []
native : benchmark_model.cc:391 Output prefix: []
native : benchmark_model.cc:392 Show sizes: [0]
native : benchmark_model.cc:393 Warmup runs: [2]
native : benchmark_model.cc:53 Loading TensorFlow.
native : benchmark_model.cc:60 Got config, 0 devices
can't determine number of CPU cores: assuming 4
can't determine number of CPU cores: assuming 4
native : benchmark_model.cc:258 Running benchmark for 2 iterations without detailed stat logging:
native : benchmark_model.cc:286 count=2 first=3273186 curr=1668712 min=1668712 max=3273186 avg=2.47095e+06 std=802237
native : benchmark_model.cc:258 Running benchmark for 50 iterations without detailed stat logging:
native : benchmark_model.cc:286 count=50 first=1687558 curr=1682345 min=1615775 max=1802978 avg=1.69049e+06 std=41851
native : benchmark_model.cc:258 Running benchmark for 50 iterations with detailed stat logging:
============================== Top by Computation Time ==============================
	             [node type]	  [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
	                  Conv2D	  623.929	   57.614	   60.054	  3.120%	  3.120%	   409.600	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_13_pointwise/convolution
	                  Conv2D	  580.451	   31.382	   34.441	  1.789%	  4.909%	   409.600	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_12_pointwise/convolution
	                  Conv2D	   49.120	   33.424	   33.211	  1.725%	  6.634%	  2880.000	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/convolution
	                  Conv2D	  502.062	   24.714	   28.450	  1.478%	  8.112%	   739.328	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_10_pointwise/convolution
	                  Conv2D	  463.825	   23.862	   27.747	  1.441%	  9.554%	   739.328	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_9_pointwise/convolution
	                  Conv2D	  540.994	   23.782	   27.632	  1.435%	 10.989%	   739.328	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_11_pointwise/convolution
	                  Conv2D	  388.112	   22.603	   27.427	  1.425%	 12.414%	   739.328	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_7_pointwise/convolution
	                  Conv2D	  426.231	   25.427	   27.163	  1.411%	 13.825%	   739.328	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_8_pointwise/convolution
	                  Conv2D	  247.387	   23.893	   26.293	  1.366%	 15.191%	  2880.000	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/convolution
	   DepthwiseConv2dNative	   95.498	   26.500	   24.778	  1.287%	 16.478%	  2937.600	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_depthwise/depthwise
============================== Top by Memory Use ==============================
	             [node type]	  [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
	                  Conv2D	  132.991	   20.703	   20.848	  1.083%	  1.083%	  5760.000	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_pointwise/convolution
	   DepthwiseConv2dNative	  219.853	   16.056	   16.815	  0.874%	  1.957%	  3225.600	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_depthwise/depthwise
	   DepthwiseConv2dNative	   95.498	   26.500	   24.778	  1.287%	  3.244%	  2937.600	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_depthwise/depthwise
	                  Conv2D	  247.387	   23.893	   26.293	  1.366%	  4.610%	  2880.000	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/convolution
	                  Conv2D	  193.214	   14.730	   16.011	  0.832%	  5.442%	  2880.000	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_2_pointwise/convolution
	                  Conv2D	   49.120	   33.424	   33.211	  1.725%	  7.167%	  2880.000	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/convolution
	   DepthwiseConv2dNative	  314.590	    7.253	    7.887	  0.410%	  7.577%	  1653.760	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_5_depthwise/depthwise
	   DepthwiseConv2dNative	  174.672	   11.146	   12.409	  0.645%	  8.221%	  1483.776	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_2_depthwise/depthwise
	                  Conv2D	  327.950	   21.425	   23.446	  1.218%	  9.439%	  1478.656	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_5_pointwise/convolution
	                  Conv2D	  293.922	   14.347	   15.109	  0.785%	 10.224%	  1478.656	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_4_pointwise/convolution
============================== Summary by node type ==============================
	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
	       NonMaxSuppression	       90	   927.182	    48.239%	    48.239%	    36.000	       90
	                  Conv2D	       34	   488.102	    25.395%	    73.633%	 23526.797	       34
	   DepthwiseConv2dNative	       13	   104.880	     5.457%	    79.090%	 16711.424	       13
	                     Mul	      130	    96.828	     5.038%	    84.127%	     0.000	      130
	                   Slice	       91	    50.248	     2.614%	    86.742%	  1380.240	       91
	                   Split	      180	    38.533	     2.005%	    88.746%	  5464.640	      180
	                     Add	      131	    35.987	     1.872%	    90.619%	     0.004	      131
	                ConcatV2	      107	    33.857	     1.761%	    92.380%	  3934.164	      107
	                  Gather	      546	    27.859	     1.449%	    93.830%	  7229.200	      546
	                   Const	     1979	    23.447	     1.220%	    95.050%	     0.000	     1979
	                   Relu6	       35	    21.407	     1.114%	    96.163%	     0.000	       35
	                 Minimum	      451	     9.308	     0.484%	    96.648%	     0.000	      451
	                   Where	      180	     8.533	     0.444%	    97.092%	  2733.760	      180
	                 Maximum	      360	     7.390	     0.384%	    97.476%	     0.000	      360
	                 Greater	      183	     5.793	     0.301%	    97.777%	   343.303	      183
	                     Sub	      192	     5.494	     0.286%	    98.063%	     0.020	      192
	                    Cast	      182	     5.345	     0.278%	    98.341%	  1968.276	      182
	          ResizeBilinear	        1	     4.663	     0.243%	    98.584%	  1080.000	        1
	                 Reshape	      282	     4.459	     0.232%	    98.816%	     0.000	      282
	            StridedSlice	      102	     3.532	     0.184%	    99.000%	     0.392	      102
	     TensorArrayGatherV3	        1	     2.704	     0.141%	    99.140%	  1080.000	        1
	                 BiasAdd	       12	     2.260	     0.118%	    99.258%	     0.000	       12
	                 Squeeze	       97	     2.234	     0.116%	    99.374%	     0.000	       97
	                 Sigmoid	        1	     1.754	     0.091%	    99.465%	     0.000	        1
	               ZerosLike	       90	     1.734	     0.090%	    99.556%	    36.000	       90
	                   Shape	       99	     1.513	     0.079%	    99.634%	     0.784	       99
	                  Unpack	        5	     1.495	     0.078%	    99.712%	   751.464	        5
	                  TopKV2	        1	     1.259	     0.066%	    99.778%	    72.000	        1
	                    NoOp	        1	     0.689	     0.036%	    99.813%	     0.000	        1
	    TensorArrayScatterV3	        1	     0.631	     0.033%	    99.846%	   602.112	        1
	               Transpose	        2	     0.438	     0.023%	    99.869%	    61.344	        2
	                 RealDiv	        8	     0.296	     0.015%	    99.884%	    15.336	        8
	                  Switch	       20	     0.288	     0.015%	    99.899%	     0.000	       22
	                   Merge	        8	     0.217	     0.011%	    99.911%	     0.032	       10
	                  Assert	        5	     0.210	     0.011%	    99.922%	     0.000	        5
	                Identity	       15	     0.189	     0.010%	    99.932%	     0.000	       15
	                   Enter	        6	     0.179	     0.009%	    99.941%	     0.000	        6
	                    Pack	        6	     0.157	     0.008%	    99.949%	    30.672	        6
	                     Exp	        2	     0.132	     0.007%	    99.956%	     0.000	        2
	              ExpandDims	        7	     0.131	     0.007%	    99.963%	     0.000	        7
	                   Range	        5	     0.113	     0.006%	    99.969%	     0.424	        5
	           TensorArrayV3	        2	     0.112	     0.006%	    99.974%	     0.104	        2
	      TensorArrayWriteV3	        1	     0.056	     0.003%	    99.977%	     0.000	        1
	                    Less	        1	     0.056	     0.003%	    99.980%	     0.001	        2
	                    _Arg	        1	     0.053	     0.003%	    99.983%	     0.000	        1
	           NextIteration	        2	     0.049	     0.003%	    99.986%	     0.000	        2
	                    Fill	        3	     0.049	     0.003%	    99.988%	     0.000	        3
	       TensorArrayReadV3	        1	     0.048	     0.002%	    99.991%	     0.000	        1
	                    Rank	        2	     0.041	     0.002%	    99.993%	     0.008	        2
	                 _Retval	        4	     0.040	     0.002%	    99.995%	     0.000	        4
	                LoopCond	        1	     0.024	     0.001%	    99.996%	     0.000	        2
	       TensorArraySizeV3	        1	     0.022	     0.001%	    99.997%	     0.004	        1
	                   Equal	        1	     0.022	     0.001%	    99.998%	     0.001	        1
	                    Size	        1	     0.016	     0.001%	    99.999%	     0.004	        1
	                    Exit	        1	     0.016	     0.001%	   100.000%	     0.000	        1
Timings (microseconds): count=50 first=1745329 curr=1670092 min=1670092 max=2170221 avg=1.92491e+06 std=189978
Memory (bytes): count=50 curr=67058518(all same)
5683 nodes observed
```
transformed_inference_graph.pb
```
native : benchmark_model.cc:382 Graph: [/data/local/tmp/transformed_inference_graph.pb]
native : benchmark_model.cc:383 Input layers: [image_tensor:0]
native : benchmark_model.cc:384 Input shapes: [1,224,224,3]
native : benchmark_model.cc:385 Input types: [uint8]
native : benchmark_model.cc:386 Output layers: [detection_boxes:0,detection_scores:0,detection_classes:0,num_detections:0]
native : benchmark_model.cc:387 Num runs: [50]
native : benchmark_model.cc:388 Inter-run delay (seconds): [-1.0]
native : benchmark_model.cc:389 Num threads: [-1]
native : benchmark_model.cc:390 Benchmark name: []
native : benchmark_model.cc:391 Output prefix: []
native : benchmark_model.cc:392 Show sizes: [0]
native : benchmark_model.cc:393 Warmup runs: [2]
native : benchmark_model.cc:53 Loading TensorFlow.
native : benchmark_model.cc:60 Got config, 0 devices
can't determine number of CPU cores: assuming 4
can't determine number of CPU cores: assuming 4
native : benchmark_model.cc:258 Running benchmark for 2 iterations without detailed stat logging:
native : benchmark_model.cc:286 count=2 first=2688688 curr=1373990 min=1373990 max=2688688 avg=2.03134e+06 std=657349
native : benchmark_model.cc:258 Running benchmark for 50 iterations without detailed stat logging:
native : benchmark_model.cc:286 count=50 first=1405345 curr=1400253 min=1246255 max=1466356 avg=1.37303e+06 std=33711
native : benchmark_model.cc:258 Running benchmark for 50 iterations with detailed stat logging:
============================== Top by Computation Time ==============================
	             [node type]	  [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
	                  Conv2D	   33.710	   36.773	   30.203	  1.931%	  1.931%	  2880.000	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/batchnorm/mul_1
	   DepthwiseConv2dNative	   85.930	   26.702	   23.704	  1.516%	  3.447%	  2937.600	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_depthwise/depthwise
	         QuantizedConv2D	  140.843	   21.529	   20.738	  1.326%	  4.773%	  5760.008	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_pointwise/BatchNorm/batchnorm/mul_1/eightbit
	         QuantizedConv2D	  656.673	   19.257	   19.940	  1.275%	  6.048%	   409.608	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_13_pointwise/BatchNorm/batchnorm/mul_1/eightbit
	   DepthwiseConv2dNative	  274.615	   20.146	   17.493	  1.119%	  7.167%	  3225.600	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_depthwise/depthwise
	         QuantizedConv2D	  323.094	   14.819	   16.560	  1.059%	  8.226%	  2880.008	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/BatchNorm/batchnorm/mul_1/eightbit
	         QuantizedConv2D	  537.240	   13.757	   14.516	  0.928%	  9.154%	   739.336	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_9_pointwise/BatchNorm/batchnorm/mul_1/eightbit
	         QuantizedConv2D	  473.869	   13.103	   14.275	  0.913%	 10.067%	   739.336	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_7_pointwise/BatchNorm/batchnorm/mul_1/eightbit
	         QuantizedConv2D	  569.591	   13.382	   14.089	  0.901%	 10.968%	   739.336	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_10_pointwise/BatchNorm/batchnorm/mul_1/eightbit
	         QuantizedConv2D	  505.947	   13.133	   13.899	  0.889%	 11.856%	   739.336	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_8_pointwise/BatchNorm/batchnorm/mul_1/eightbit
============================== Top by Memory Use ==============================
	             [node type]	  [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
	            QuantizedAdd	  175.293	   12.215	   11.851	  0.758%	  0.758%	  5760.008	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_pointwise/BatchNorm/batchnorm/add_1/eightbit
	         QuantizedConv2D	  140.843	   21.529	   20.738	  1.326%	  2.084%	  5760.008	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_pointwise/BatchNorm/batchnorm/mul_1/eightbit
	              Dequantize	  203.684	    5.234	    5.386	  0.344%	  2.428%	  5760.000	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_pointwise/Relu6
	   DepthwiseConv2dNative	  274.615	   20.146	   17.493	  1.119%	  3.547%	  3225.600	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_depthwise/depthwise
	   DepthwiseConv2dNative	   85.930	   26.702	   23.704	  1.516%	  5.063%	  2937.600	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_depthwise/depthwise
	            QuantizedAdd	  345.853	    5.552	    4.671	  0.299%	  5.361%	  2880.008	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/BatchNorm/batchnorm/add_1/eightbit
	         QuantizedConv2D	  323.094	   14.819	   16.560	  1.059%	  6.420%	  2880.008	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/BatchNorm/batchnorm/mul_1/eightbit
	            QuantizedAdd	  309.606	    5.842	    5.277	  0.337%	  6.758%	  2880.008	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_depthwise/BatchNorm/batchnorm/add_1/eightbit
	            QuantizedMul	  298.146	    4.004	    4.408	  0.282%	  7.039%	  2880.008	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_depthwise/BatchNorm/batchnorm/mul_1/eightbit
	            QuantizedAdd	  257.952	    5.945	    5.208	  0.333%	  7.372%	  2880.008	        1	FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_2_pointwise/BatchNorm/batchnorm/add_1/eightbit
============================== Summary by node type ==============================
	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
	       NonMaxSuppression	       90	   637.594	    40.856%	    40.856%	    36.000	       90
	         QuantizedConv2D	       33	   215.470	    13.807%	    54.664%	 20647.061	       33
	   DepthwiseConv2dNative	       13	   106.792	     6.843%	    61.507%	 16711.424	       13
	     RequantizationRange	      283	    93.028	     5.961%	    67.468%	     2.264	      283
	              Requantize	      283	    79.127	     5.070%	    72.538%	 18539.414	      283
	            QuantizedAdd	      130	    71.310	     4.569%	    77.108%	 36965.137	      130
	                   Slice	       91	    36.322	     2.327%	    79.435%	  1380.240	       91
	                   Split	      180	    31.280	     2.004%	    81.440%	  4799.104	      180
	                  Conv2D	        1	    30.202	     1.935%	    83.375%	  2880.000	        1
	                ConcatV2	      107	    27.631	     1.771%	    85.146%	  3601.396	      107
	              QuantizeV2	      386	    27.177	     1.741%	    86.887%	  5043.348	      386
	              Dequantize	      307	    26.689	     1.710%	    88.597%	 25504.588	      307
	                  Gather	      546	    25.502	     1.634%	    90.231%	  6120.800	      546
	            QuantizedMul	      108	    23.188	     1.486%	    91.717%	 15810.112	      108
	          QuantizedRelu6	       35	    17.087	     1.095%	    92.812%	  9224.536	       35
	                     Min	      386	    15.680	     1.005%	    93.817%	     1.544	      386
	                     Max	      386	    15.260	     0.978%	    94.795%	     1.544	      386
	                   Const	      629	    10.047	     0.644%	    95.439%	     0.000	      629
	                   Where	      180	     8.200	     0.525%	    95.964%	  2290.400	      180
	                 Minimum	      451	     8.197	     0.525%	    96.489%	     0.000	      451
	                 Reshape	      566	     7.537	     0.483%	    96.972%	     0.000	      566
	                 Maximum	      360	     6.321	     0.405%	    97.377%	     0.000	      360
	                    Cast	      182	     5.764	     0.369%	    97.747%	  1746.596	      182
	                 Greater	      183	     5.025	     0.322%	    98.069%	   322.505	      183
	                     Sub	      192	     4.745	     0.304%	    98.373%	     0.016	      192
	          ResizeBilinear	        1	     4.461	     0.286%	    98.659%	  1080.000	        1
	            StridedSlice	      100	     3.158	     0.202%	    98.861%	     0.384	      100
	        QuantizedReshape	      102	     2.275	     0.146%	    99.007%	     0.816	      102
	     TensorArrayGatherV3	        1	     2.033	     0.130%	    99.137%	  1080.000	        1
	                 Squeeze	       97	     1.766	     0.113%	    99.250%	     0.000	       97
	               ZerosLike	       90	     1.544	     0.099%	    99.349%	    36.000	       90
	                 Sigmoid	        1	     1.410	     0.090%	    99.439%	     0.000	        1
	        QuantizedBiasAdd	       12	     1.406	     0.090%	    99.530%	   728.556	       12
	                   Shape	       99	     1.394	     0.089%	    99.619%	     0.784	       99
	                  Unpack	        5	     1.381	     0.088%	    99.707%	   751.464	        5
	    TensorArrayScatterV3	        1	     1.003	     0.064%	    99.772%	   602.112	        1
	                  TopKV2	        1	     0.968	     0.062%	    99.834%	    72.000	        1
	               Transpose	        2	     0.344	     0.022%	    99.856%	    61.344	        2
	                  Switch	       20	     0.272	     0.017%	    99.873%	     0.000	       22
	                   Merge	        8	     0.203	     0.013%	    99.886%	     0.032	       10
	                   Enter	        6	     0.191	     0.012%	    99.898%	     0.000	        6
	                    NoOp	        1	     0.184	     0.012%	    99.910%	     0.000	        1
	                Identity	       15	     0.175	     0.011%	    99.921%	     0.000	       15
	                 RealDiv	        6	     0.143	     0.009%	    99.931%	     0.000	        6
	                    Pack	        6	     0.133	     0.009%	    99.939%	    30.672	        6
	           TensorArrayV3	        2	     0.114	     0.007%	    99.946%	     0.104	        2
	              ExpandDims	        7	     0.111	     0.007%	    99.953%	     0.000	        7
	                   Range	        5	     0.097	     0.006%	    99.960%	     0.424	        5
	                     Exp	        2	     0.090	     0.006%	    99.965%	     0.000	        2
	                  Assert	        4	     0.075	     0.005%	    99.970%	     0.000	        4
	      TensorArrayWriteV3	        1	     0.054	     0.003%	    99.974%	     0.000	        1
	                    Less	        1	     0.051	     0.003%	    99.977%	     0.001	        2
	           NextIteration	        2	     0.045	     0.003%	    99.980%	     0.000	        2
	       TensorArrayReadV3	        1	     0.044	     0.003%	    99.983%	     0.000	        1
	                    Fill	        3	     0.044	     0.003%	    99.986%	     0.000	        3
	                    _Arg	        1	     0.042	     0.003%	    99.988%	     0.000	        1
	                 _Retval	        4	     0.037	     0.002%	    99.991%	     0.000	        4
	                    Rank	        2	     0.028	     0.002%	    99.992%	     0.008	        2
	                   Equal	        1	     0.026	     0.002%	    99.994%	     0.001	        1
	                LoopCond	        1	     0.023	     0.001%	    99.996%	     0.000	        2
	                     Add	        1	     0.021	     0.001%	    99.997%	     0.004	        1
	       TensorArraySizeV3	        1	     0.020	     0.001%	    99.998%	     0.004	        1
	                    Exit	        1	     0.015	     0.001%	    99.999%	     0.000	        1
	                    Size	        1	     0.014	     0.001%	   100.000%	     0.004	        1
Timings (microseconds): count=50 first=1463739 curr=1459978 min=1392397 max=1906618 avg=1.56387e+06 std=182975
Memory (bytes): count=50 curr=176072750(all same)
6723 nodes observed
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NonMaxSuppression is the biggest performance bottleneck of the SSD-MobileNet object detection models on Android mobile phone #1609

System information

Describe the problem

Source code / logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inference Graph	Node Type	Average Time %
frozen_inference_graph.pb	NonMaxSuppression	48.239
frozen_inference_graph.pb	Conv2D	25.395
transformed_inference_graph.pb	NonMaxSuppression	40.856
transformed_inference_graph.pb	QuantizedConv2D	13.807

NonMaxSuppression is the biggest performance bottleneck of the SSD-MobileNet object detection models on Android mobile phone #1609

Description

System information

Describe the problem

Source code / logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions