Skip to content

[RFC/WIP] Bilinear initializer#34

Closed
vchuravy wants to merge 3 commits intodmlc:masterfrom
vchuravy:vc/bilinear
Closed

[RFC/WIP] Bilinear initializer#34
vchuravy wants to merge 3 commits intodmlc:masterfrom
vchuravy:vc/bilinear

Conversation

@vchuravy
Copy link
Collaborator

This PR adds a Bilinear initializer similar to BVLC/caffe#2213 which is useful for upsampling with deconvolution. Additionally this allows to set different initializers for different layers.

Todo

  • Documentation for BilinearInitializer
  • Tests and example
  • Documentation for initializer per layer
  • Setting lr-rate per layer (since we preinitialize to a given function we need to turn learning off)
  • Initializing filter correctly for multiple channels

Setting initializer per layer

using MXNet

data = mx.Variable(:data)
pool1 = mx.Pooling(data = data, kernel = (2,2), pool_type = (:max), stride = (2,2))
deconv = mx.Deconvolution(data = pool1, num_filter = 2, kernel = (2,2), stride = (2,2), no_bias=true)

deconv_w = mx.list_arguments(deconv)[2]


data = zeros(64, 64, 1, 10)
dp = mx.ArrayDataProvider(data, batch_size = 1)

model = mx.FeedForward(deconv)
mx.init_model(model, Dict(:default => mx.UniformInitializer(), deconv_w => mx.BilinearInitializer()), data=size(data))
pred = mx.predict(model, dp)

Proper upsampling

using MXNet

# scaling factor
factor = 2
@show kernel = 2factor - factor % 2
stride = factor
@show pad = ceil(Int64, (factor - 1) / 2)

data = mx.Variable(:data)
deconv = mx.Deconvolution(data = data, num_filter = 1, kernel = (kernel, kernel), stride = (stride, stride), pad = (pad, pad), no_bias=true)

deconv_w = mx.list_arguments(deconv)[2]

data = zeros(3, 3, 1, 1)
for i in 1:3
  for j in 1:3
    data[i, j, 1, 1] = i*j
  end
end

@show data

dp = mx.ArrayDataProvider(data, batch_size = 1)

model = mx.FeedForward(deconv)
mx.init_model(model, Dict(:default => mx.UniformInitializer(), deconv_w => mx.BilinearInitializer()), data=size(data))
pred = mx.predict(model, dp)

@show pred

Ref: #31

@codecov-io
Copy link

Current coverage is 75.47%

Merging #34 into master will decrease coverage by -0.92% as of 85b9fd8

@@            master     #34   diff @@
======================================
  Files           20      20       
  Stmts         1449    1468    +19
  Branches         0       0       
  Methods          0       0       
======================================
+ Hit           1107    1108     +1
  Partial          0       0       
- Missed         342     360    +18

Review entire Coverage Diff as of 85b9fd8

Powered by Codecov. Updated on successful CI builds.

@vchuravy
Copy link
Collaborator Author

@vchuravy
Copy link
Collaborator Author

@pluskid How would I best turn of learning for a layer?

@pluskid
Copy link
Member

pluskid commented Nov 23, 2015

The API looks good to me!

By "turn of" do you mean "turn off"? There is a hacky way of turning off learning by using BlockGrad operator. It blocks gradient back propagation. With several drawbacks:

  • One need to modify the symbolic structure and insert the BlockGrad symbol.
  • Layers below it will not get gradients and therefore not get trained.

One nice thing we could have (as in Caffe) is per-layer (per-operator) learning rate. Choices include

  1. Modify the operator definition and add new argument grad_scale (currently Loss layers have this property)
  2. Utilize the newly added attribute [SYMBOL] enable attributes in graph node apache/mxnet#685 interface to attach per operator learning rate
  3. Like what you did here, pass a dictionary to the fit function specifying optionally per-operator learning rate.

I think the 3rd option sounds best as it requires minimum change to the backend codebase and it actually makes more sense as grad_scale is a property for trainer only. The only (slight) inconvenience is that the user has to specify the learning rate separately, which means you will need to use something like what you did in your iJulia notebook

deconv_w =  mx.list_arguments(deconv)[2]

to get the key to be used in the dictionary. Regarding this, the 2nd option might be a good compromise.

@vchuravy
Copy link
Collaborator Author

Yeah passing a dictionary in would be the least hacky, but also the most inconvenient. Maybe one could alleviate that by adding mx.weight_name so that the users don't have to worry about getting the correct symbol.

@vchuravy
Copy link
Collaborator Author

@pluskid So I was looking into using attributes to set lr per layer, but the calls to the updater function https://github.com/dmlc/MXNet.jl/blob/master/src/optimizer.jl#L179-L183 only receive the NDArrays, is there anyway to get the associated symbol?

@pluskid
Copy link
Member

pluskid commented Nov 24, 2015

Yes, let me think about it. When you construct a symbolic graph, the operators kind of get smashed into a single symbolic node at the end. I'm not even sure if there might be some graph re-writing to optimize runtime efficiency without looking at the libmxnet source code.

@tqchen Is there easy API to inspect the original symbolic hierarchy? (Other than dumping into a JSON)

@vchuravy
Copy link
Collaborator Author

superseded by apache/mxnet#746

@vchuravy vchuravy closed this Nov 29, 2015
vchuravy added a commit that referenced this pull request Apr 13, 2017
fixes bilinear initializer following approach in #34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments