Conversation
| nnz, | ||
| &pBufferSize | ||
| ); | ||
| cudaMalloc((void**)&pBuffer, pBufferSize); |
There was a problem hiding this comment.
could you avoid cudaMalloc and free, and instead preallocate a buffer that is passed-in.
cudaFree causes a device synchronization, which stops us from doing multi-GPU
There was a problem hiding this comment.
The buffer size is not known ahead of time... I'm not sure how I would preallocate it. Would using a THCudaStorage work?
There was a problem hiding this comment.
I think since this is part of an nn layer, you can initialize it in nn, keep it around and pass it in. You can call THCudaTensor_resize() which will just resize if it needs a bigger buffer.
A similar line of code would be:
https://github.com/torch/nn/blob/master/SpatialConvolution.lua#L51
https://github.com/torch/nn/blob/master/SpatialConvolution.lua#L109
https://github.com/torch/nn/blob/master/SpatialConvolution.lua#L180
and follow the "columns" variable in here:
https://github.com/torch/cunn/blob/master/lib/THCUNN/SpatialConvolutionMM.cu
53e46ee to
37e29f6
Compare
|
Should be fixed, you merged the addition in nn half an hour ago Soumith, thanks. |
3239d64 to
1a09fac
Compare
|
This is now updated with the batch version of sparse linear given in this commit |
| csr_int = THCudaIntTensor_newWithSize1d(state, batchnum+1); | ||
| init_cusparse(); | ||
| for (h = 0; h < batchnum+1; h++) { | ||
| THCudaIntTensor_set1d(state, csr_int, h, 1 + nnz * h); |
There was a problem hiding this comment.
make this for loop a simple-stupid CUDA kernel.
|
This has been updated to work with the PR @ torch/nn#698 |
| thrust::copy(ptr, ptr+THCudaTensor_nElement(state, tensor), std::ostream_iterator<float>(std::cout, "\t")); | ||
| printf("\n"); | ||
| } | ||
| void printCuda(THCState *state, THCudaIntTensor *tensor, char* str) { |
There was a problem hiding this comment.
This function seems to be double-declared here
|
Fixed nits |
|
Thanks Zeming! |
Adding SparseLinear with CUDA. Most of the functions are directly converted from SparseLinear.c. Depending on how well THCudaBlas operations are pipelined, it may be more efficient to write custom kernels for most of the operations. UpdateOutput uses cusparse.