Skip to content

Adding table input support for batched SparseLinear, implementing gradInput correctly, fixing other bugs#698

Merged
soumith merged 1 commit intotorch:masterfrom
ebetica:sparse_linear
Mar 17, 2016
Merged

Adding table input support for batched SparseLinear, implementing gradInput correctly, fixing other bugs#698
soumith merged 1 commit intotorch:masterfrom
ebetica:sparse_linear

Conversation

@ebetica
Copy link
Copy Markdown
Contributor

@ebetica ebetica commented Mar 7, 2016

updateGradInput has a dense gradInput, so removing that to match LookupTable. We keep a resize for compatibility (with, for example, an nngraph.Identity() node on an input). Fixed the test not to test for gradInput.

Edit: After some discussion I'm redoing SparseLinear:

  1. Now supports table inputs, of the format {batch1, batch2, batch3,...} where each batch is an ninput x 2 sparse vector. This lets us input batches of unequal vector sizes
  2. UpdateGradInput is now coded correctly to return a dense vector coded in sparse format.
  3. UpdateParameters and zeroGradParameters were wrong if we called accGradParameters more than once, because it did sparse updates only on the last input.

The original batchnum x nnz x 2 format is still supported for now...

@soumith
Copy link
Copy Markdown
Member

soumith commented Mar 7, 2016

cc: @zhangxiangxiao @myhrev

@soumith
Copy link
Copy Markdown
Member

soumith commented Mar 7, 2016

cc: @MichaelMathieu

@soumith
Copy link
Copy Markdown
Member

soumith commented Mar 7, 2016

@pengsun the current updateGradInput is wrong completely

@pengsun
Copy link
Copy Markdown
Contributor

pengsun commented Mar 7, 2016

@soumith Oh, sorry, didn't realize that... deleting my previous post...

@ebetica
Copy link
Copy Markdown
Contributor Author

ebetica commented Mar 7, 2016

The current updateGradInput returns something sparse when it should be dense. The stupid way to do this would just be to return something like:

{{1, v1}, {2, v2}, {3, v3},...}

which would be easy to implement and also correct. I'd imagine the reason we don't have it in LookupTable is because of the blowup in gradient size, since it's no longer sparse.

Also, I'm pretty sure the zeroGradParameters and updateParameters functions are also wrong, since they do a sparse zero/update based on the last input. If we do multiple forward/backward passes before calling any of these functions the result will be garbled.

@ebetica ebetica force-pushed the sparse_linear branch 2 times, most recently from f5ee898 to 9ed660a Compare March 8, 2016 05:30
@ebetica ebetica changed the title Removed updateGradInput from SparseLinear, removed unnecessary buffers Adding table input support for batched SparseLinear, implementing gradInput correctly, fixing other bugs Mar 11, 2016
@ebetica
Copy link
Copy Markdown
Contributor Author

ebetica commented Mar 11, 2016

I changed this pull request to basically be fixing SparseLinear, see above edit for all the changes.

Comment thread SparseLinear.lua
else
parent.zeroGradParameters(self)
end
self.sparseUpdate = 0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.sparseUpdate = NO_LAST_INPUT

soumith added a commit that referenced this pull request Mar 17, 2016
 Adding table input support for batched SparseLinear, implementing gradInput correctly, fixing other bugs
@soumith soumith merged commit ead8af9 into torch:master Mar 17, 2016
@soumith
Copy link
Copy Markdown
Member

soumith commented Mar 17, 2016

Thanks Zeming!

@zhangxiangxiao
Copy link
Copy Markdown
Contributor

Sorry a bit late to the discussion. The new SparseLinear module breaks projects in non-batch mode that depends on the fast implementation of updateParameters (and perhaps other functions too) that were using self.lastInput.

This module has been used in various implementations of memory networks, ranking models and word vector models. One crucial point was that it can support fast forward propagation, backward propagation and parameter update on indexed values via self.lastInput. Parallelization is taken care of outside this module itself via higher level optimization techniques such as HogWILD, rather than depending on batch mode. Parameters are updated for each sample because it is faster this way due to sparsity.

The new implementation breaks this by setting self.legacyMode = false when the input is a dimension 2 tensor (i.e., not a batch input). This breaks many projects where using self.lastInput in non-batch mode was the intention for self:updateParameters (also perhaps other functions such as updateOutput and updateGradInput too -- I did not check). Can you improve the back-compatibility of this?

@soumith
Copy link
Copy Markdown
Member

soumith commented Mar 19, 2016

@zhangxiangxiao can you give a quick test case so that I can hotfix the module rather than reverting all of it? Now that there's a CUDA version that I merged in yesterday, i'll have to revert a bunch of things.
As soon as you give a repro, i can hotfix it within 24 hours.

@zhangxiangxiao
Copy link
Copy Markdown
Contributor

@soumith This is a performance issue rather than a correctness issue. I can write a piece of code showing where this is necessary.

@soumith
Copy link
Copy Markdown
Member

soumith commented Mar 19, 2016

when input is of dimension 2, if we set self.legacyMode to true, that should fix the perf issues then. That's doable..., I dont see why not.

@zhangxiangxiao
Copy link
Copy Markdown
Contributor

@soumith Yeah that's what I thought too. I am not very sure except for updateParamters because I did not look at other functions. Writing the test code...

@ebetica
Copy link
Copy Markdown
Contributor Author

ebetica commented Mar 19, 2016

The only functions that do not do the sparse updates are updateParameters and zeroGradParameters, like you said. I did not realize the performance hit would be so big, but in addition to the hotfix I can also do a patch for doing sparse updates in general on lastInput.

@soumith 's suggestion would disable CUDA for single batch updates.

@zhangxiangxiao
Copy link
Copy Markdown
Contributor

Okay here is a piece of testing code:

local nn = require('nn')
local sys = require('sys')

local model = nn.SparseLinear(65536, 256)

local input = torch.rand(5, 2)
input:select(2, 1):mul(65536):ceil()

sys.tic()
local output = model:forward(input)
sys.toc(true)

local gradOutput = torch.rand(output:size())

sys.tic()
local gradInput = model:backward(input, gradOutput)
sys.toc(true)

sys.tic()
model:updateParameters(1e-3)
sys.toc(true)

sys.tic()
model:zeroGradParameters()
sys.toc(true)

And here is the output:

$ th test.lua
0.00019693374633789
0.00010895729064941
0.022776126861572
0.014401197433472

It seems both updateParameters and zeroGradParameters were unnecessarily slow. I also changed legacyMode to true for when input:dim() == 2 but there is some bug somewhere preventing this to work.

Edit: I added the test for zeroGradParameters() as well. Same problem with updateParameters().

P.S. The example above could be thought of as running a 5-gram sample on a word2vec model with 65536 words and 256 embedding dimension. In practice the number of words could be much larger than 65536

@zhangxiangxiao
Copy link
Copy Markdown
Contributor

@ebetica Oops yes zeroGradParameters is very important to be fast too (as it is also called once every update). Updated the test case above to reflect this.

@ebetica
Copy link
Copy Markdown
Contributor Author

ebetica commented Mar 19, 2016

@zhangxiangxiao Is this level of sparsity a common use case? The CUDA module is not optimized for single batch extremely sparse inputs.

@zhangxiangxiao
Copy link
Copy Markdown
Contributor

It is a common use case for natural language processing, where this module is mostly used. A single core on my machine can go over 5000 samples per second, and using a 10-thread HOGWILD trainer my throughput is at the level of 50,000 samples per second.

At this level of sparsity CPU is much faster than GPU. But I think the two use cases (batch vs non-batch) are so distinct that it is good to combine the two when writing this module. We can live with a slow GPU use case when CPU is faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants