-
Notifications
You must be signed in to change notification settings - Fork 260
Description
I’m going crazy trying to track down this bug. I’ve been trying to find something to go on, but I don’t have much. I can’t find any error messages anywhere.
I’ve been noticing that the DIGITS “production” server (digits-server) crashes sometimes, but the “development” server (digits-devserver) never does.
Caffe v0.12.2 is fine. So is v0.13.0. But v0.13.1 and v0.13.2 crash the production server. There’s something in these changes that doesn’t play nice with the production server.
Version cuDNN CNMeM digits-devserver digits-server Crash time Last message
0.12.2 v2 OK OK
0.13.0 v3 OK OK
0.13.0 OK OK
0.13.1 v3 1.0.0 OK CRASH net.forward() None
0.13.1 v3 OK CRASH caffe.Net() cudnn_conv_layer.cpp:256] Reallocating workspace storage: 100
0.13.1 1.0.0 OK CRASH net.forward() None
0.13.1 OK CRASH net.forward() None
The production server uses gunicorn for the webserver framework, and the development server uses Flask. I’m looking into what the differences could be (path setup, memory usage, environment variables, etc.) but I haven’t come up with anything so far. Any ideas about what I should look for?
Things I’ve investigated:
- Make vs. CMake
- makes no difference (no pun intended)
- Out-of-memory
- I’m using LeNet on a 6GB card. Should be no problem.
- Plus, there’s no out-of-memory errors
- Timeout
- Nope. When it works, this finishes in ~0.002 seconds. And when it fails, it fails pretty much instantly, too.
/cc @slayton @drnikolaev