Adding table input support for batched SparseLinear, implementing gradInput correctly, fixing other bugs by ebetica · Pull Request #698 · torch/nn

ebetica · 2016-03-07T22:09:42Z

updateGradInput has a dense gradInput, so removing that to match LookupTable. We keep a resize for compatibility (with, for example, an nngraph.Identity() node on an input). Fixed the test not to test for gradInput.

Edit: After some discussion I'm redoing SparseLinear:

Now supports table inputs, of the format {batch1, batch2, batch3,...} where each batch is an ninput x 2 sparse vector. This lets us input batches of unequal vector sizes
UpdateGradInput is now coded correctly to return a dense vector coded in sparse format.
UpdateParameters and zeroGradParameters were wrong if we called accGradParameters more than once, because it did sparse updates only on the last input.

The original batchnum x nnz x 2 format is still supported for now...

soumith · 2016-03-07T22:11:58Z

cc: @zhangxiangxiao @myhrev

soumith · 2016-03-07T22:12:31Z

cc: @MichaelMathieu

soumith · 2016-03-07T22:44:33Z

@pengsun the current updateGradInput is wrong completely

pengsun · 2016-03-07T22:50:33Z

@soumith Oh, sorry, didn't realize that... deleting my previous post...

ebetica · 2016-03-07T22:52:08Z

The current updateGradInput returns something sparse when it should be dense. The stupid way to do this would just be to return something like:

{{1, v1}, {2, v2}, {3, v3},...}

which would be easy to implement and also correct. I'd imagine the reason we don't have it in LookupTable is because of the blowup in gradient size, since it's no longer sparse.

Also, I'm pretty sure the zeroGradParameters and updateParameters functions are also wrong, since they do a sparse zero/update based on the last input. If we do multiple forward/backward passes before calling any of these functions the result will be garbled.

…dInput correctly, fixing other bugs

ebetica · 2016-03-11T21:13:12Z

I changed this pull request to basically be fixing SparseLinear, see above edit for all the changes.

soumith · 2016-03-17T18:53:46Z

   else
      parent.zeroGradParameters(self)
   end
+   self.sparseUpdate = 0


self.sparseUpdate = NO_LAST_INPUT

Adding table input support for batched SparseLinear, implementing gradInput correctly, fixing other bugs

soumith · 2016-03-17T19:18:11Z

Thanks Zeming!

zhangxiangxiao · 2016-03-19T04:25:38Z

Sorry a bit late to the discussion. The new SparseLinear module breaks projects in non-batch mode that depends on the fast implementation of updateParameters (and perhaps other functions too) that were using self.lastInput.

This module has been used in various implementations of memory networks, ranking models and word vector models. One crucial point was that it can support fast forward propagation, backward propagation and parameter update on indexed values via self.lastInput. Parallelization is taken care of outside this module itself via higher level optimization techniques such as HogWILD, rather than depending on batch mode. Parameters are updated for each sample because it is faster this way due to sparsity.

The new implementation breaks this by setting self.legacyMode = false when the input is a dimension 2 tensor (i.e., not a batch input). This breaks many projects where using self.lastInput in non-batch mode was the intention for self:updateParameters (also perhaps other functions such as updateOutput and updateGradInput too -- I did not check). Can you improve the back-compatibility of this?

soumith · 2016-03-19T17:33:55Z

@zhangxiangxiao can you give a quick test case so that I can hotfix the module rather than reverting all of it? Now that there's a CUDA version that I merged in yesterday, i'll have to revert a bunch of things.
As soon as you give a repro, i can hotfix it within 24 hours.

zhangxiangxiao · 2016-03-19T17:55:05Z

@soumith This is a performance issue rather than a correctness issue. I can write a piece of code showing where this is necessary.

soumith · 2016-03-19T17:57:39Z

when input is of dimension 2, if we set self.legacyMode to true, that should fix the perf issues then. That's doable..., I dont see why not.

zhangxiangxiao · 2016-03-19T17:59:46Z

@soumith Yeah that's what I thought too. I am not very sure except for updateParamters because I did not look at other functions. Writing the test code...

ebetica · 2016-03-19T18:59:59Z

The only functions that do not do the sparse updates are updateParameters and zeroGradParameters, like you said. I did not realize the performance hit would be so big, but in addition to the hotfix I can also do a patch for doing sparse updates in general on lastInput.

@soumith 's suggestion would disable CUDA for single batch updates.

zhangxiangxiao · 2016-03-19T19:02:43Z

Okay here is a piece of testing code:

local nn = require('nn')
local sys = require('sys')

local model = nn.SparseLinear(65536, 256)

local input = torch.rand(5, 2)
input:select(2, 1):mul(65536):ceil()

sys.tic()
local output = model:forward(input)
sys.toc(true)

local gradOutput = torch.rand(output:size())

sys.tic()
local gradInput = model:backward(input, gradOutput)
sys.toc(true)

sys.tic()
model:updateParameters(1e-3)
sys.toc(true)

sys.tic()
model:zeroGradParameters()
sys.toc(true)

And here is the output:

$ th test.lua
0.00019693374633789
0.00010895729064941
0.022776126861572
0.014401197433472

It seems both updateParameters and zeroGradParameters were unnecessarily slow. I also changed legacyMode to true for when input:dim() == 2 but there is some bug somewhere preventing this to work.

Edit: I added the test for zeroGradParameters() as well. Same problem with updateParameters().

P.S. The example above could be thought of as running a 5-gram sample on a word2vec model with 65536 words and 256 embedding dimension. In practice the number of words could be much larger than 65536

zhangxiangxiao · 2016-03-19T19:06:12Z

@ebetica Oops yes zeroGradParameters is very important to be fast too (as it is also called once every update). Updated the test case above to reflect this.

ebetica · 2016-03-19T19:48:15Z

@zhangxiangxiao Is this level of sparsity a common use case? The CUDA module is not optimized for single batch extremely sparse inputs.

zhangxiangxiao · 2016-03-19T20:08:22Z

It is a common use case for natural language processing, where this module is mostly used. A single core on my machine can go over 5000 samples per second, and using a 10-thread HOGWILD trainer my throughput is at the level of 50,000 samples per second.

At this level of sparsity CPU is much faster than GPU. But I think the two use cases (batch vs non-batch) are so distinct that it is good to combine the two when writing this module. We can live with a slow GPU use case when CPU is faster.

ebetica force-pushed the sparse_linear branch from 1338bd3 to 31d27ac Compare March 7, 2016 22:35

ebetica force-pushed the sparse_linear branch 2 times, most recently from f5ee898 to 9ed660a Compare March 8, 2016 05:30

Adding table input support for batched SparseLinear, implementing gra…

497e8cd

…dInput correctly, fixing other bugs

ebetica force-pushed the sparse_linear branch from 9ed660a to 497e8cd Compare March 11, 2016 21:07

ebetica changed the title ~~Removed updateGradInput from SparseLinear, removed unnecessary buffers~~ Adding table input support for batched SparseLinear, implementing gradInput correctly, fixing other bugs Mar 11, 2016

ebetica mentioned this pull request Mar 14, 2016

Adding SparseLinear with CUDA torch/cunn#223

Merged

soumith reviewed Mar 17, 2016
View reviewed changes

soumith added a commit that referenced this pull request Mar 17, 2016

Merge pull request #698 from ebetica/sparse_linear

ead8af9

Adding table input support for batched SparseLinear, implementing gradInput correctly, fixing other bugs

soumith merged commit ead8af9 into torch:master Mar 17, 2016

zhangxiangxiao mentioned this pull request Mar 19, 2016

Revert " Adding table input support for batched SparseLinear, implementing gradInput correctly, fixing other bugs" #723

Closed

ebetica mentioned this pull request Mar 20, 2016

Sparse Linear now does sparse updates from the last input #725

Merged

Conversation

ebetica commented Mar 7, 2016

Uh oh!

soumith commented Mar 7, 2016

Uh oh!

soumith commented Mar 7, 2016

Uh oh!

soumith commented Mar 7, 2016

Uh oh!

pengsun commented Mar 7, 2016

Uh oh!

ebetica commented Mar 7, 2016

Uh oh!

ebetica commented Mar 11, 2016

Uh oh!

soumith Mar 17, 2016

Choose a reason for hiding this comment

Uh oh!

soumith commented Mar 17, 2016

Uh oh!

zhangxiangxiao commented Mar 19, 2016

Uh oh!

soumith commented Mar 19, 2016

Uh oh!

zhangxiangxiao commented Mar 19, 2016

Uh oh!

soumith commented Mar 19, 2016

Uh oh!

zhangxiangxiao commented Mar 19, 2016

Uh oh!

ebetica commented Mar 19, 2016

Uh oh!

zhangxiangxiao commented Mar 19, 2016

Uh oh!

zhangxiangxiao commented Mar 19, 2016

Uh oh!

ebetica commented Mar 19, 2016

Uh oh!

zhangxiangxiao commented Mar 19, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants