Skip to content

Comments

Fix RNG segfault related to #297#336

Merged
shelhamer merged 3 commits intoBVLC:devfrom
jeffdonahue:fix-rng-segfault
Apr 19, 2014
Merged

Fix RNG segfault related to #297#336
shelhamer merged 3 commits intoBVLC:devfrom
jeffdonahue:fix-rng-segfault

Conversation

@jeffdonahue
Copy link
Contributor

This PR removes the caffe_set_rng behavior which was causing a crash (#335) with segfault by constructing a variate generator with a reference to Caffe's RNG engine (something that I found out was possible after hours of googling..).

For some reason, I still get a segfault about 10% of the time I try to run ImageNet training, but this time it happens when filling the biases in conv1 (specifically when SyncedMemory calls malloc to initialize them)...

I0418 10:50:52.465656 16095 net.cpp:75] Creating Layer conv1                                                                                                                                                                        [399/3264]
I0418 10:50:52.465667 16095 net.cpp:85] conv1 <- data
I0418 10:50:52.465687 16095 net.cpp:111] conv1 -> conv1
[New Thread 0x7fffe2db1700 (LWP 16097)]

Program received signal SIGSEGV, Segmentation fault.
0x00007fffef7ee5ae in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007fffef7ee5ae in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fffef7f0f95 in malloc () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x0000000000461c72 in caffe::CaffeMallocHost (ptr=0x9791a0, size=384) at ./include/caffe/syncedmem.hpp:27
#3  0x0000000000461ce7 in caffe::SyncedMemory::to_cpu (this=0x9791a0) at src/caffe/syncedmem.cpp:25
#4  0x0000000000461bb8 in caffe::SyncedMemory::mutable_cpu_data (this=0x9791a0) at src/caffe/syncedmem.cpp:73
#5  0x000000000045f058 in caffe::Blob<float>::mutable_cpu_data (this=0xcfa480) at src/caffe/blob.cpp:67
#6  0x00000000004cc167 in caffe::ConstantFiller<float>::Fill (this=0x936790, blob=0xcfa480)
    at ./include/caffe/filler.hpp:37
#7  0x00000000004d2614 in caffe::ConvolutionLayer<float>::SetUp (this=0x936e60, bottom=..., top=0x8fbb38)
    at src/caffe/layers/conv_layer.cpp:65
#8  0x000000000046419f in caffe::Net<float>::Init (this=0x8fd300, in_param=...) at src/caffe/net.cpp:124
#9  0x0000000000462dc1 in caffe::Net<float>::Net (this=0x8fd300, param_file=...) at src/caffe/net.cpp:31
#10 0x0000000000454c01 in caffe::Solver<float>::Init (this=0x7fffffffda80, param=...) at src/caffe/solver.cpp:39
#11 0x00000000004549e0 in caffe::Solver<float>::Solver (this=0x7fffffffda80, param=...) at src/caffe/solver.cpp:23
#12 0x0000000000415482 in caffe::SGDSolver<float>::SGDSolver (this=0x7fffffffda80, param=...)
    at ./include/caffe/solver.hpp:57
#13 0x0000000000415026 in main (argc=2, argv=0x7fffffffdce8) at tools/train_net.cpp:27

Not exactly sure what to say about this but it seems like it must be unrelated to the RNG since the ConstantFiller does not use it...

@shelhamer
Copy link
Member

That we could pass a reference could have been so much better documented by boost. Thanks for hunting this down–along with working, the code is simplified this way.

I'll merge once tests finish running and I do a couple train_net.

shelhamer added a commit that referenced this pull request Apr 19, 2014
@shelhamer shelhamer merged commit f8c751f into BVLC:dev Apr 19, 2014
@shelhamer
Copy link
Member

Although this is only a partial fix, it's better to merge this step in the right direction and then address the newly-revealed bias crash by further debugging and a follow-up PR. This seems to be a stranger crash than we originally thought, but the RNG code is better off with this change anyway.

@shelhamer shelhamer mentioned this pull request Apr 21, 2014
@jeffdonahue jeffdonahue deleted the fix-rng-segfault branch April 21, 2014 23:44
mitmul pushed a commit to mitmul/caffe that referenced this pull request Sep 30, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants