The default weight initialization strategy makes the VGG network difficult to converge when utilizing examples under 'example/image-classification'

## Description
On trying to train CiFAR-10 dataset with VGG16, I noticed the initial learning rate has to be set a very small value to make the training converging. According the implementation of fit.py under example/image-classification/common, the Xavier with Gaussian random type is set as the default weight initializing strategy.

On a NVIDIA P40 GPU, with **CUDA-9.1** and **openblas**, I executed the following comparison tests:
**Test 1:** Initial LR = 0.01          Xavier with **_Gaussian_** random type
**Test 2:** Initial LR = 0.00025    Xavier with **_Gaussian_** random type
**Test 3:** Initial LR = 0.01          Xavier with **_uniform_** random type

The following two figures are corresponding to the comparison of the trends of training top-1 accuracy and cross-entropy loss. The curves in blue, green, purple are corresponding to test 1, 2, 3 respectively.
It can be observed: with same initial LR (0.01), the training with Xavier-Gaussian will not converge, while the training with Xavier-uniform converges fast.

Similar phenomenon can also be observed when the dataset changed to CiFAR-100. (Retrieved from http://data.mxnet.io/data/cifar100.zip)  When training CiFAR-100, the initial learning rate has to be set to 0.00001 if the Xavier-Gaussian is used to initialize the weight. 

![image](https://user-images.githubusercontent.com/33112206/36579904-aae2350e-18a0-11e8-89aa-fc7e3876c0c3.png)
![image](https://user-images.githubusercontent.com/33112206/36579911-bdbcf1d2-18a0-11e8-8e02-d25a057bc56c.png)

## Environment info (Required)
HW: NVIDIA P40
SW:  MxNet master branch, with CUDA-v9.1, CUDNN v7.1 and openblas 2.1
 
What to do:
1. Download the training script from https://github.com/juliusshufan/mxnet 
2. Run the three python script correspond to test 1, 2, 3
(Note: The test script modified from the original train_cifar10.py coming with MXNET examples)

## Build info (Required if built from source)
Compiler: gcc 4.8.5
MXNet commit hash: cea8d2f4024a5e5d9d9edf43e42be130f10c7c27

Build config:
make -j($nproc) USE_OPENCV=1 USE_CUDA=1 USE_CUDNN=1 USE_BLAS=openblas USE_CUDA_PATH=/usr/local/cuda-9.1

## What have you tried to solve it?
Using the Xavier with uniform random, but not the Gaussian random.
I have submitted a PR https://github.com/apache/incubator-mxnet/pull/9867, in which the weight initializing strategy will be set as Xavier-uniform if network is VGG, and the initializer will not be explicitly set in the training script.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The default weight initialization strategy makes the VGG network difficult to converge when utilizing examples under 'example/image-classification' #9866

Description

Environment info (Required)

Build info (Required if built from source)

What have you tried to solve it?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

The default weight initialization strategy makes the VGG network difficult to converge when utilizing examples under 'example/image-classification' #9866

Description

Description

Environment info (Required)

Build info (Required if built from source)

What have you tried to solve it?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions