Implement SpatialPyramidPoolingLayer with the Split, Pooling, Flatten & Concat layers#560
Conversation
|
I appreciate how quickly this contribution has appeared, but this should almost certainly be done by composition and not copy-paste. For example, consider the within-channel LRN https://github.com/BVLC/caffe/blob/master/src/caffe/layers/lrn_layer.cpp#L31 |
|
Tests passed but the gradient checks were slow. |
|
@bhack, would you like to review if the implementation is consistent with the algorithm described in the section 2.3 of the SPP-net paper? |
|
@kloudkl I hope that i can do it this evening or tomorrow. |
|
(Sorry posted this before seeing recent changes, please ignore my previous post - deleted) |
|
@kloudkl I've not compiled the code to deeply trying it but seems that you have simply handled the concat between pooling layer and cumulating loss. |
|
I don't think multi-size training is blocked by the transformation layers. In the paper, the authors simulated multi-size training with multiple fixed-size networks. As the output vectors of the conv5 layers are pooled into fixed-length features by the SpatialPyramidPooling layer, the networks of different sizes can share the same fully-connected layers as their last layers. I prefer to follow the path of @moskewcz's #308 DenseNet feature pyramid computation. But their code seems too heavy weight to integrate with the SPP. More likely, I will implement the Caffe version of Torch7's PyramidPacker and PyramidUnpacker to extract features for multiple scales of an images as discussed in #189. |
|
@kloudkl Right |
There was a problem hiding this comment.
While this is written in Kaiming's paper, I guess there will be some problems with this pooling approach. For example, if image_side_length == 17 and spatial_bin == 6, then you have kernel_size == 3 and stride == 2, so you actually get 8_8 bins, instead of 6_6 bins. @kloudkl Could you tell me whether I am right?
There was a problem hiding this comment.
Hi, @kloudkl I emailed Dr. Kaiming He for details, and he told me that this is how they perform spatial pyramid pooling:
Denote the width and height of the conv5 feature maps (can be the full image or a window) as w and h. For a pyramid level with n_n bins, the (i,j)-th bin is in the range of [floor((i-1)_w/n), ceil(i_w/n)] \* [floor((j-1)_h/n), ceil(j*h/n)].
I copied this PR and currently I am trying to implement a PyramidLevelLayer to implement this pooling behavor, based on the rectangular pooling #614.
There was a problem hiding this comment.
Yes, you are right. I realized the problem when I wrote the test cases.
Thank you for contacting the authors for clarification!
There was a problem hiding this comment.
And I think the range above includes left border but excludes right border, i.e. [0, 3] contains 0, 1, 2 but not 3.
|
To be more faithful to the implementation of the authors of the SPP network paper, the pooling layer is extended to support floating point height and width of the kernel and stride. The 36 test cases of the pooling layer are all passed. The spatial pooling layer is also tested on both the CPU and the GPU. |
|
Classification accuracy on the VOC 2012 dataset:
The spatial pyramid pooling layer consists of four pyramid levels each of which respectively splits the images into 1 , 2, 3, 6 patches evenly in both the vertical and horizontal directions. |
|
The SPP-net performed worse as the fully connected layer after the last convolution layer has larger dimensions with the reference imagenet model. Its parameters were randomly initialized and caused over-fitting on the relatively small VOC 2012 dataset. If it is first fine-tuned on a much larger dataset, its perfermance will certainly be superior as described in the paper. |
4278286 to
c01f07a
Compare
There was a problem hiding this comment.
I am wondering that, the voc2012 classification has multiple labels, how to do leveldb?
|
What's going on with this? Can I help? |
|
This algorithm involves some very complicated corner cases. For example, a candidate region in the original image may be mapped into a very small region with the width or height equal to or smaller than 1 pixel. It's very hard to detect objects whose sizes are small relative to the image. GoogLeNet combined with RCNN is a much more robust but much slower solution. In practice, you may find the object detectors included in the latest OpenCV quite handy for most use cases if you are required to quickly complete a project. |
|
@kloudkl I'm interested in helping. Maybe we can chat about what is holding up this PR. How can we do that? |
|
Closing since this PR is abandoned and the code is non-compositional. This is better achieved through layer composition with There is an expected replacement: spatial pyramid pooling has been given to a student as Caffe practice. |
|
@shelhamer Can you please update us on the current status of this? |
|
See #2177 for spatial pyramid pooling. |
The spatial pyramid pooling layer [1] mentioned in #548 is the combination of the existing PoolingLayer and ConcatLayer. It automatically computes the sliding windows sizes and strides for the multiple pyramid levels, applies the PoolingLayer on each level, and finally concatenates the outputs of all the levels into fixed-size vectors to feed into classifiers or fully connected layers.
[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. The 13th European Conference on Computer Vision (ECCV), 2014