Fixes 2 issues involving races and run to run inconsitencies. by thatguymike · Pull Request #3 · cypof/caffe

thatguymike · 2015-05-14T23:14:10Z

First is a barrier to prevent the parent from smashing weights as weight updates are proceeding in the children. Second is to fix random number initialization on secondary threads to use device #.

- Interrupt the thread before waiting on join - Provide a method for looping threads to exit on demand - CHECK if start and stop succeed instead of returning an error

- Uses a blocking queue to transfer data to data_layer - Asynchronously pushes data to the GPU using a stream

- Makes sure each solver accesses a different subset of the data - Sequential reading of DB for performance - Prefetches a configurable amount of data to host memory - Distributes data to solvers in round-robin way for determinism

- Split batches between GPUs, and tree-reduce the gradients - Detects machine topology (twin-GPU boards, P2P connectivity) - Inserts a callbak in the solver for minimal code change - Changed caffe.cpp to use all available GPUs by default - Deterministic architecture for reproducible runs

…s a barrier to prevent the parent from smashing weights as weight updates are proceeding in the children. Second is to fix random number initialization on secondary threads to use device #.

cypof · 2015-05-14T23:57:37Z

Thanks Mike! Looks good, but I am worried about the fix in solver.cpp. All solvers might run with the same random sequence. The first fix in parallel.cpp should be enough, it sets the seed for all non-root solvers.

I am still not sure why the random seed doesn't get initialized correctly during thread creation in internal_thread, but don't know how to debug that right now.

thatguymike · 2015-05-14T23:59:32Z

I am also not in love with that change, but it does fix reproducibility issues. The first fix in parallel.cpp is actually not enough. It took me a long time to find that it was not enough.

thatguymike · 2015-05-15T00:04:19Z

One option is to tack on solver state to separate "child" solvers and root solvers and add an || Caffe::child_solver(). Or modulate by the device number like we did in parallel.cpp, but not clean in the CPU only case.

thatguymike · 2015-05-15T00:43:44Z

Okay, after more testing, at least on 2 GPUs, the change to solver.cpp is not needed. Revalidating on other configs.

cypof · 2015-05-15T00:56:03Z

OK, good news. There might be more magic somewhere, but I need to understand what is going on. Looking at your fix, I think I might start to see why the barrier is necessary. When I split the work into multiple PRs, I put the solver refactoring in BVLC#2397. I forgot that one of things it does is move the net_->Update() call into SGDSolver, not Solver. So it is only called for the root, not workers. Not sure yet, but that could finally explain why I don't have those instability issues. If you have time to try with BVLC#2397 please do, I will do some more testing with both cases. I also probably should reverse the map and reduce phases, first send all the weights, then SGD, then gradient reduce. It should be equivalent, but I realized I would have to do that for the distributed version as the weights are not equal on each box initially.

thatguymike · 2015-05-15T01:00:36Z

I'll give that a go in the morning. Yes, the race was on the parent nodes updating the weights of the childern as the childern were doing their updates.

As far as the map and reduce, there are several options in how the gradients and weights are updated which all have different performance tradeoffs. If you are doing SGD, in theory you only have to send the gradients as long as everyone adds them together in exactly the same order. Then you can use bidirectional swapping.

I have some of those changes in flight in my branches, but I wanted to get your PR "clean" so people can start. Perf is good on 2 GPUs and all of the optimizations I'm working on have only a small improvement in the 2GPU case and are really targetting the 4+ GPU scaling case.

lukeyeager · 2015-05-18T23:26:07Z

FWIW, I just pulled thatguymike/race_fix and built it. Everything compiled just fine and back-to-back runs with the same random seed produced exactly the same loss numbers.

I ran AlexNet with batch size 100 on two GPUs. Here's the memory usage as reported by nvidia-smi:
GTX 980 - 3305/4095 MiB
Tesla K20 - 1298/4799 MiB

thatguymike · 2015-07-14T01:05:15Z

No longer needed against "parallel" branch.

cypof and others added 9 commits April 28, 2015 14:28

Added BlockingQueue for inter-thread communication.

a0a36f8

Thread-local Caffe

f8220f5

Changed the way threads are started and stopped

d181ffb

- Interrupt the thread before waiting on join - Provide a method for looping threads to exit on demand - CHECK if start and stop succeed instead of returning an error

Persistent prefetch thread

2f439ac

- Uses a blocking queue to transfer data to data_layer - Asynchronously pushes data to the GPU using a stream

Fixes 2 issues involving races and run to run inconsitencies. First i…

f153be7

…s a barrier to prevent the parent from smashing weights as weight updates are proceeding in the children. Second is to fix random number initialization on secondary threads to use device #.

Squash lint errors

e72293a

More lint errors

7b3b834

Revert change to solver.cpp that is suspicious.

e90b05e

cypof force-pushed the multi_gpu branch from 054b14e to 13aee7f Compare May 19, 2015 18:16

thatguymike closed this Jul 14, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Fixes 2 issues involving races and run to run inconsitencies. #3

Fixes 2 issues involving races and run to run inconsitencies. #3
thatguymike wants to merge 10 commits intocypof:multi_gpufrom
thatguymike:race_fix

thatguymike commented May 14, 2015

Uh oh!

cypof commented May 14, 2015

Uh oh!

thatguymike commented May 14, 2015

Uh oh!

thatguymike commented May 15, 2015

Uh oh!

thatguymike commented May 15, 2015

Uh oh!

cypof commented May 15, 2015

Uh oh!

thatguymike commented May 15, 2015

Uh oh!

lukeyeager commented May 18, 2015

Uh oh!

thatguymike commented Jul 14, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

thatguymike commented May 14, 2015

Uh oh!

cypof commented May 14, 2015

Uh oh!

thatguymike commented May 14, 2015

Uh oh!

thatguymike commented May 15, 2015

Uh oh!

thatguymike commented May 15, 2015

Uh oh!

cypof commented May 15, 2015

Uh oh!

thatguymike commented May 15, 2015

Uh oh!

lukeyeager commented May 18, 2015

Uh oh!

thatguymike commented Jul 14, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants