Fixes 2 issues involving races and run to run inconsitencies. #3
Fixes 2 issues involving races and run to run inconsitencies. #3thatguymike wants to merge 10 commits intocypof:multi_gpufrom
Conversation
- Interrupt the thread before waiting on join - Provide a method for looping threads to exit on demand - CHECK if start and stop succeed instead of returning an error
- Uses a blocking queue to transfer data to data_layer - Asynchronously pushes data to the GPU using a stream
- Makes sure each solver accesses a different subset of the data - Sequential reading of DB for performance - Prefetches a configurable amount of data to host memory - Distributes data to solvers in round-robin way for determinism
- Split batches between GPUs, and tree-reduce the gradients - Detects machine topology (twin-GPU boards, P2P connectivity) - Inserts a callbak in the solver for minimal code change - Changed caffe.cpp to use all available GPUs by default - Deterministic architecture for reproducible runs
…s a barrier to prevent the parent from smashing weights as weight updates are proceeding in the children. Second is to fix random number initialization on secondary threads to use device #.
|
Thanks Mike! Looks good, but I am worried about the fix in solver.cpp. All solvers might run with the same random sequence. The first fix in parallel.cpp should be enough, it sets the seed for all non-root solvers. I am still not sure why the random seed doesn't get initialized correctly during thread creation in internal_thread, but don't know how to debug that right now. |
|
I am also not in love with that change, but it does fix reproducibility issues. The first fix in parallel.cpp is actually not enough. It took me a long time to find that it was not enough. |
|
One option is to tack on solver state to separate "child" solvers and root solvers and add an || Caffe::child_solver(). Or modulate by the device number like we did in parallel.cpp, but not clean in the CPU only case. |
|
Okay, after more testing, at least on 2 GPUs, the change to solver.cpp is not needed. Revalidating on other configs. |
|
OK, good news. There might be more magic somewhere, but I need to understand what is going on. Looking at your fix, I think I might start to see why the barrier is necessary. When I split the work into multiple PRs, I put the solver refactoring in BVLC#2397. I forgot that one of things it does is move the net_->Update() call into SGDSolver, not Solver. So it is only called for the root, not workers. Not sure yet, but that could finally explain why I don't have those instability issues. If you have time to try with BVLC#2397 please do, I will do some more testing with both cases. I also probably should reverse the map and reduce phases, first send all the weights, then SGD, then gradient reduce. It should be equivalent, but I realized I would have to do that for the distributed version as the weights are not equal on each box initially. |
|
I'll give that a go in the morning. Yes, the race was on the parent nodes updating the weights of the childern as the childern were doing their updates. As far as the map and reduce, there are several options in how the gradients and weights are updated which all have different performance tradeoffs. If you are doing SGD, in theory you only have to send the gradients as long as everyone adds them together in exactly the same order. Then you can use bidirectional swapping. I have some of those changes in flight in my branches, but I wanted to get your PR "clean" so people can start. Perf is good on 2 GPUs and all of the optimizations I'm working on have only a small improvement in the 2GPU case and are really targetting the 4+ GPU scaling case. |
|
FWIW, I just pulled I ran AlexNet with batch size 100 on two GPUs. Here's the memory usage as reported by nvidia-smi: |
|
No longer needed against "parallel" branch. |
First is a barrier to prevent the parent from smashing weights as weight updates are proceeding in the children. Second is to fix random number initialization on secondary threads to use device #.