[Performance Regression] GPU memory increase for training and inference models

## Description
- There is an MXNet nightly benchmark which runs CV and NLP models on MXNet Nightly pip wheel and report the metrics and it showed a performance regression on GPU Memory.
- After bisecting the PRs, the PR due to which there is an increase in GPU memory of 120-130 MB happen is [17767](https://github.com/apache/incubator-mxnet/pull/17767). I ran the SSD training with and without the PR commit and it showed around 120MB increase in GPU Memory. I haven't ran all those models personally but the reproducing script is basic and it is applicable to all those models.
- List of models that affected are as below. The GPU Memory increase data we got from our internal benchmarking system:


| model | GPU Memory (From:to) |
| ------------------------------------------------ | ---------------------------------------- |
| VGG16_training | 9.63k to 9.78k |
| SSD_training | 4.75k to 4.91k |
| LSTM_inference: Gluon and Module | G: 1.55k to 1.71 M: 1.47k to 1.64k |
| Caffenet_inference: Gluon and Module | G: 2.17k to 2.31k M: 2.0k to 2.16k |
| YoloV3_GPU: Gluon and Module | G: 1.87k to 2.03k M: 1.84k to 2.0k |
| Resnet50_v2_FP16_inference_GPU: Gluon and Module | G: 1.58k to 1.74k M: 1.55k to 1.71k |
| Resnet50_v2_inference_GPU: Gluon and Module | G: 1.55k to 1.71k M: 1.48k to 1.64k |
| Resnet152_v2_inference_GPU: Gluon and Module | G: 1.92k to 2.08k M: 1.86k to 2.02k |
| Inception_Inference_GPU: Gluon and Module | G: 1.43k to 1.59k M: 1.40 to 1.56k |
| SSD_inference_GPU: Gluon and Module | G: 1.65k to 1.81k M: 1.54k to 1.70k |
| A3C_inference_GPU: Gluon and Module | G: 1.32k to 1.48k M: 1.31k to 1.47k |
| word_language_model_hybrid_p3.16_training | 2.25k to 2.43k |
| Mobile_pose_training | 2.38k to 2.54k |


## To Reproduce
- Run the below lines of code and monitor GPU usage using `nvidia-smi` command.
```bash
import mxnet as mx
a = mx.nd.zeros((1,), ctx=mx.gpu())
```

### Output:
```bash
MXNet build from source:

Instance: p3.16xLarge

Without PR [17767] Commit id: f882de0c7ecd6ff1f0fdba492865afc6d7e29271
GPU usage: gpu(0): 1279MiB / 16160MiB

With PR [17767] Commit id: 5542d03695b4a2589afb88acf128d4ba8ac94d0d
GPU usage: gpu(0): 1407MiB / 16160MiB
```

- I am tagging the PR author here: @ptrendx 
- Thanks @ptrendx for sharing the simple reproducible script.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance Regression] GPU memory increase for training and inference models #18280

Description

To Reproduce

Output:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

model	GPU Memory (From:to)
VGG16_training	9.63k to 9.78k
SSD_training	4.75k to 4.91k
LSTM_inference: Gluon and Module	G: 1.55k to 1.71 M: 1.47k to 1.64k
Caffenet_inference: Gluon and Module	G: 2.17k to 2.31k M: 2.0k to 2.16k
YoloV3_GPU: Gluon and Module	G: 1.87k to 2.03k M: 1.84k to 2.0k
Resnet50_v2_FP16_inference_GPU: Gluon and Module	G: 1.58k to 1.74k M: 1.55k to 1.71k
Resnet50_v2_inference_GPU: Gluon and Module	G: 1.55k to 1.71k M: 1.48k to 1.64k
Resnet152_v2_inference_GPU: Gluon and Module	G: 1.92k to 2.08k M: 1.86k to 2.02k
Inception_Inference_GPU: Gluon and Module	G: 1.43k to 1.59k M: 1.40 to 1.56k
SSD_inference_GPU: Gluon and Module	G: 1.65k to 1.81k M: 1.54k to 1.70k
A3C_inference_GPU: Gluon and Module	G: 1.32k to 1.48k M: 1.31k to 1.47k
word_language_model_hybrid_p3.16_training	2.25k to 2.43k
Mobile_pose_training	2.38k to 2.54k

[Performance Regression] GPU memory increase for training and inference models #18280

Description

Description

To Reproduce

Output:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions