Adding table input support for batched SparseLinear, implementing gradInput correctly, fixing other bugs#698
Conversation
|
cc: @zhangxiangxiao @myhrev |
|
cc: @MichaelMathieu |
|
@pengsun the current updateGradInput is wrong completely |
|
@soumith Oh, sorry, didn't realize that... deleting my previous post... |
|
The current updateGradInput returns something sparse when it should be dense. The stupid way to do this would just be to return something like: {{1, v1}, {2, v2}, {3, v3},...} which would be easy to implement and also correct. I'd imagine the reason we don't have it in LookupTable is because of the blowup in gradient size, since it's no longer sparse. Also, I'm pretty sure the zeroGradParameters and updateParameters functions are also wrong, since they do a sparse zero/update based on the last input. If we do multiple forward/backward passes before calling any of these functions the result will be garbled. |
f5ee898 to
9ed660a
Compare
…dInput correctly, fixing other bugs
|
I changed this pull request to basically be fixing SparseLinear, see above edit for all the changes. |
| else | ||
| parent.zeroGradParameters(self) | ||
| end | ||
| self.sparseUpdate = 0 |
There was a problem hiding this comment.
self.sparseUpdate = NO_LAST_INPUT
Adding table input support for batched SparseLinear, implementing gradInput correctly, fixing other bugs
|
Thanks Zeming! |
|
Sorry a bit late to the discussion. The new SparseLinear module breaks projects in non-batch mode that depends on the fast implementation of updateParameters (and perhaps other functions too) that were using self.lastInput. This module has been used in various implementations of memory networks, ranking models and word vector models. One crucial point was that it can support fast forward propagation, backward propagation and parameter update on indexed values via self.lastInput. Parallelization is taken care of outside this module itself via higher level optimization techniques such as HogWILD, rather than depending on batch mode. Parameters are updated for each sample because it is faster this way due to sparsity. The new implementation breaks this by setting self.legacyMode = false when the input is a dimension 2 tensor (i.e., not a batch input). This breaks many projects where using self.lastInput in non-batch mode was the intention for self:updateParameters (also perhaps other functions such as updateOutput and updateGradInput too -- I did not check). Can you improve the back-compatibility of this? |
|
@zhangxiangxiao can you give a quick test case so that I can hotfix the module rather than reverting all of it? Now that there's a CUDA version that I merged in yesterday, i'll have to revert a bunch of things. |
|
@soumith This is a performance issue rather than a correctness issue. I can write a piece of code showing where this is necessary. |
|
when input is of dimension 2, if we set self.legacyMode to true, that should fix the perf issues then. That's doable..., I dont see why not. |
|
@soumith Yeah that's what I thought too. I am not very sure except for updateParamters because I did not look at other functions. Writing the test code... |
|
The only functions that do not do the sparse updates are updateParameters and zeroGradParameters, like you said. I did not realize the performance hit would be so big, but in addition to the hotfix I can also do a patch for doing sparse updates in general on lastInput. @soumith 's suggestion would disable CUDA for single batch updates. |
|
Okay here is a piece of testing code: local nn = require('nn')
local sys = require('sys')
local model = nn.SparseLinear(65536, 256)
local input = torch.rand(5, 2)
input:select(2, 1):mul(65536):ceil()
sys.tic()
local output = model:forward(input)
sys.toc(true)
local gradOutput = torch.rand(output:size())
sys.tic()
local gradInput = model:backward(input, gradOutput)
sys.toc(true)
sys.tic()
model:updateParameters(1e-3)
sys.toc(true)
sys.tic()
model:zeroGradParameters()
sys.toc(true)And here is the output: It seems both updateParameters and zeroGradParameters were unnecessarily slow. I also changed legacyMode to true for when input:dim() == 2 but there is some bug somewhere preventing this to work. Edit: I added the test for zeroGradParameters() as well. Same problem with updateParameters(). P.S. The example above could be thought of as running a 5-gram sample on a word2vec model with 65536 words and 256 embedding dimension. In practice the number of words could be much larger than 65536 |
|
@ebetica Oops yes zeroGradParameters is very important to be fast too (as it is also called once every update). Updated the test case above to reflect this. |
|
@zhangxiangxiao Is this level of sparsity a common use case? The CUDA module is not optimized for single batch extremely sparse inputs. |
|
It is a common use case for natural language processing, where this module is mostly used. A single core on my machine can go over 5000 samples per second, and using a 10-thread HOGWILD trainer my throughput is at the level of 50,000 samples per second. At this level of sparsity CPU is much faster than GPU. But I think the two use cases (batch vs non-batch) are so distinct that it is good to combine the two when writing this module. We can live with a slow GPU use case when CPU is faster. |
updateGradInput has a dense gradInput, so removing that to match LookupTable. We keep a resize for compatibility (with, for example, an nngraph.Identity() node on an input). Fixed the test not to test for gradInput.
Edit: After some discussion I'm redoing SparseLinear:
The original batchnum x nnz x 2 format is still supported for now...