Skip to content

Version >=0.13.1 doesn't work with digits-server #38

@lukeyeager

Description

@lukeyeager

I’m going crazy trying to track down this bug. I’ve been trying to find something to go on, but I don’t have much. I can’t find any error messages anywhere.

I’ve been noticing that the DIGITS “production” server (digits-server) crashes sometimes, but the “development” server (digits-devserver) never does.

Caffe v0.12.2 is fine. So is v0.13.0. But v0.13.1 and v0.13.2 crash the production server. There’s something in these changes that doesn’t play nice with the production server.

Version cuDNN   CNMeM   digits-devserver        digits-server   Crash time      Last message
0.12.2  v2              OK                      OK
0.13.0  v3              OK                      OK
0.13.0                  OK                      OK
0.13.1  v3      1.0.0   OK                      CRASH           net.forward()   None
0.13.1  v3              OK                      CRASH           caffe.Net()     cudnn_conv_layer.cpp:256] Reallocating workspace storage: 100
0.13.1          1.0.0   OK                      CRASH           net.forward()   None
0.13.1                  OK                      CRASH           net.forward()   None

The production server uses gunicorn for the webserver framework, and the development server uses Flask. I’m looking into what the differences could be (path setup, memory usage, environment variables, etc.) but I haven’t come up with anything so far. Any ideas about what I should look for?

Things I’ve investigated:

  • Make vs. CMake
    • makes no difference (no pun intended)
  • Out-of-memory
    • I’m using LeNet on a 6GB card. Should be no problem.
    • Plus, there’s no out-of-memory errors
  • Timeout
    • Nope. When it works, this finishes in ~0.002 seconds. And when it fails, it fails pretty much instantly, too.

/cc @slayton @drnikolaev

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions