Conversation
|
All of my quick sanity tests are passing. Despite knowing that by default this is weak scaling, e.g. the specified batch size in the train_val.prototxt is multiplied by the number of GPUs you choose to run on, I forgot that when validating accuracy graphs. I still fear that is going to bite users. |
ba35568 to
e46996b
Compare
|
OK, training works for me. The thread launch code is much better without the fields, that's great. |
|
Thanks for testing @thatguymike and @cypof. My short test worked so once we hear from @cdoersch about the ec2 test I think this is ready to merge. |
src/caffe/solver.cpp
Outdated
There was a problem hiding this comment.
must add 'timer.Start();' here to restart timer, or line 266 that timing for grads maybe incorrect.
There was a problem hiding this comment.
Added to line 224 before forward + backward, thanks.
|
Training seems to be working fine on ec2. |
f165d86 to
2b51a08
Compare
|
After discussion with @longjon we decided the timing code is too intrusive to bundle in this change. I have stripped it but archived the branch with timing at |
src/caffe/solver.cpp
Outdated
There was a problem hiding this comment.
probably a bit late to comment on this, and to me not necessary for merge, but these conditional LOG(INFO) calls could be made a bit more compact using LOG_IFs, e.g. LOG_IF(INFO, Caffe::root_solver()) << "Iteration..."
87a69ea to
53a4dca
Compare
src/caffe/parallel.cpp
Outdated
There was a problem hiding this comment.
This call to params() and the two other calls below should be replaced with learnable_params() after #2866, I think? (I was debating whether the public params() method should just be removed, or if params() should just return learnable_params_, or...)
There was a problem hiding this comment.
Agreed. Making the switch seems to have no effect though and I have the same test failures before and after.
80dbdaa to
dd3e064
Compare
|
@cypof @thatguymike it turns out #2114 was not rigorously checking solver updates; see #2114 (comment). Fixing the test net targets reveals that all the Apart from the tests, my experiments to check parallel training on real nets make progress so there's hope. #2866 is not the problem as the same failures show up in the multi-GPU branch before the latest rebase when the test is fixed. This can be seen in |
|
I'm fairly positive this is a test artifact due to the random Gaussian targets. The multiple solvers can't reproduce random draws equivalent to the single solver sequence:
The solution seems to be making the solver tests take fixed external data, such as the hdf5 data used in the |
bb75c36 to
186d453
Compare
|
This is now based on #2887 but the multi-GPU solver tests still fail. I believe this is because |
- Interrupt the thread before waiting on join - Provide a method for looping threads to exit on demand - CHECK if start and stop succeed instead of returning an error
- Make sure each solver accesses a different subset of the data - Sequential reading of DB for performance - Prefetch a configurable amount of data to host memory - Distribute data to solvers in round-robin way for determinism
thanks to discussion by @thatguymike and @flx42
- Parallelize batches among GPUs and tree-reduce the gradients - The effective batch size scales with the number of devices - Batch size is multiplied by the number of devices - Split batches between GPUs, and tree-reduce the gradients - Detect machine topology (twin-GPU boards, P2P connectivity) - Track device in syncedmem (thanks @thatguymike) - Insert a callback in the solver for minimal code change - Accept list for gpu flag of caffe tool, e.g. '-gpu 0,1' or '-gpu all'. Run on default GPU if no ID given. - Add multi-GPU solver test - Deterministic architecture for reproducible runs
- Start with distant nodes in broadcast - Fix outside loop to loop for full tree depth
|
I was off yesterday, but looking at it now. |
|
Everyone see #2903 for the rigorously tested and passing multi-GPU branch. @ronghanghu has developed a parallel data layer solution. |
|
Merged in #2903 |
PR for Multi-GPU has been merged into the master branch of Caffe. BVLC/caffe#2870
This is my packaging of #2114 for merge. I figured @cypof @thatguymike and company had made plenty of revisions and that I could help.
This PR is ready to use for data parallel training of networks but
DataReaderwhich are resolved for merge by #2903.
in place of@ronghanghuDataReaderCHECK(false)withLOG(FATAL)@cypof @thatguymike @longjon @jeffdonahue please take a look.
@cdoersch could you fire up your parallel training test again?
Reviews and testing by the community are welcome!