Skip to content

When Caffe is run on multi-gpu(>2), softmax loss becomes 87.3365. #4435

@tianzhi0549

Description

@tianzhi0549

Hello,

When I ran the mnist demo shipped with official Caffe with 4 K40 gpus, the softmax loss becomes 87.3365 after a few hundred of iterations. But if I run it with one or two gpus, it can get correct outcome. The used Caffe is the latest(f28f5ae).

Does anyone encounter the same problem?
Thanks.

The following is the log when I ran it with 4 gpus:

I0709 10:34:20.722365 16617 solver.cpp:337] Iteration 0, Testing net (#0)
I0709 10:34:24.325662 16617 solver.cpp:404]     Test net output #0: accuracy = 0.0761
I0709 10:34:24.325714 16617 solver.cpp:404]     Test net output #1: loss = 2.34995 (* 1 = 2.34995 loss)
I0709 10:34:24.384223 16617 solver.cpp:228] Iteration 0, loss = 2.2995
I0709 10:34:24.384263 16617 solver.cpp:244]     Train net output #0: loss = 2.2995 (* 1 = 2.2995 loss)
I0709 10:34:24.384330 16617 sgd_solver.cpp:106] Iteration 0, lr = 0.01
I0709 10:34:29.217810 16617 solver.cpp:228] Iteration 100, loss = 0.548973
I0709 10:34:29.217882 16617 solver.cpp:244]     Train net output #0: loss = 0.548973 (* 1 = 0.548973 loss)
I0709 10:34:29.242552 16617 sgd_solver.cpp:106] Iteration 100, lr = 0.00992565
I0709 10:34:34.147747 16617 solver.cpp:228] Iteration 200, loss = 0.49398
I0709 10:34:34.147788 16617 solver.cpp:244]     Train net output #0: loss = 0.49398 (* 1 = 0.49398 loss)
I0709 10:34:34.161823 16617 sgd_solver.cpp:106] Iteration 200, lr = 0.00985258
I0709 10:34:39.072245 16617 solver.cpp:228] Iteration 300, loss = 0.829506
I0709 10:34:39.072304 16617 solver.cpp:244]     Train net output #0: loss = 0.829506 (* 1 = 0.829506 loss)
I0709 10:34:39.072336 16617 sgd_solver.cpp:106] Iteration 300, lr = 0.00978075
I0709 10:34:43.944337 16617 solver.cpp:228] Iteration 400, loss = 0.194765
I0709 10:34:43.944378 16617 solver.cpp:244]     Train net output #0: loss = 0.194765 (* 1 = 0.194765 loss)
I0709 10:34:43.944406 16617 sgd_solver.cpp:106] Iteration 400, lr = 0.00971013
I0709 10:34:48.838815 16617 solver.cpp:337] Iteration 500, Testing net (#0)
I0709 10:34:51.885664 16617 solver.cpp:404]     Test net output #0: accuracy = 0.1009
I0709 10:34:51.885836 16617 solver.cpp:404]     Test net output #1: loss = 87.3365 (* 1 = 87.3365 loss)
I0709 10:34:51.895921 16617 solver.cpp:228] Iteration 500, loss = 87.3365
I0709 10:34:51.895948 16617 solver.cpp:244]     Train net output #0: loss = 87.3365 (* 1 = 87.3365 loss)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions