Skip to content

Comments

Fixes 2 issues involving races and run to run inconsitencies. #3

Closed
thatguymike wants to merge 10 commits intocypof:multi_gpufrom
thatguymike:race_fix
Closed

Fixes 2 issues involving races and run to run inconsitencies. #3
thatguymike wants to merge 10 commits intocypof:multi_gpufrom
thatguymike:race_fix

Conversation

@thatguymike
Copy link

First is a barrier to prevent the parent from smashing weights as weight updates are proceeding in the children. Second is to fix random number initialization on secondary threads to use device #.

cypof and others added 9 commits April 28, 2015 14:28
- Interrupt the thread before waiting on join
- Provide a method for looping threads to exit on demand
- CHECK if start and stop succeed instead of returning an error
 - Uses a blocking queue to transfer data to data_layer
 - Asynchronously pushes data to the GPU using a stream
- Makes sure each solver accesses a different subset of the data
- Sequential reading of DB for performance
- Prefetches a configurable amount of data to host memory
- Distributes data to solvers in round-robin way for determinism
- Split batches between GPUs, and tree-reduce the gradients
- Detects machine topology (twin-GPU boards, P2P connectivity)
- Inserts a callbak in the solver for minimal code change
- Changed caffe.cpp to use all available GPUs by default
- Deterministic architecture for reproducible runs
…s a barrier to prevent the parent from smashing weights as weight updates are proceeding in the children. Second is to fix random number initialization on secondary threads to use device #.
@cypof
Copy link
Owner

cypof commented May 14, 2015

Thanks Mike! Looks good, but I am worried about the fix in solver.cpp. All solvers might run with the same random sequence. The first fix in parallel.cpp should be enough, it sets the seed for all non-root solvers.

I am still not sure why the random seed doesn't get initialized correctly during thread creation in internal_thread, but don't know how to debug that right now.

@thatguymike
Copy link
Author

I am also not in love with that change, but it does fix reproducibility issues. The first fix in parallel.cpp is actually not enough. It took me a long time to find that it was not enough.

@thatguymike
Copy link
Author

One option is to tack on solver state to separate "child" solvers and root solvers and add an || Caffe::child_solver(). Or modulate by the device number like we did in parallel.cpp, but not clean in the CPU only case.

@thatguymike
Copy link
Author

Okay, after more testing, at least on 2 GPUs, the change to solver.cpp is not needed. Revalidating on other configs.

@cypof
Copy link
Owner

cypof commented May 15, 2015

OK, good news. There might be more magic somewhere, but I need to understand what is going on. Looking at your fix, I think I might start to see why the barrier is necessary. When I split the work into multiple PRs, I put the solver refactoring in BVLC#2397. I forgot that one of things it does is move the net_->Update() call into SGDSolver, not Solver. So it is only called for the root, not workers. Not sure yet, but that could finally explain why I don't have those instability issues. If you have time to try with BVLC#2397 please do, I will do some more testing with both cases. I also probably should reverse the map and reduce phases, first send all the weights, then SGD, then gradient reduce. It should be equivalent, but I realized I would have to do that for the distributed version as the weights are not equal on each box initially.

@thatguymike
Copy link
Author

I'll give that a go in the morning. Yes, the race was on the parent nodes updating the weights of the childern as the childern were doing their updates.

As far as the map and reduce, there are several options in how the gradients and weights are updated which all have different performance tradeoffs. If you are doing SGD, in theory you only have to send the gradients as long as everyone adds them together in exactly the same order. Then you can use bidirectional swapping.

I have some of those changes in flight in my branches, but I wanted to get your PR "clean" so people can start. Perf is good on 2 GPUs and all of the optimizations I'm working on have only a small improvement in the 2GPU case and are really targetting the 4+ GPU scaling case.

@lukeyeager
Copy link

FWIW, I just pulled thatguymike/race_fix and built it. Everything compiled just fine and back-to-back runs with the same random seed produced exactly the same loss numbers.

I ran AlexNet with batch size 100 on two GPUs. Here's the memory usage as reported by nvidia-smi:
GTX 980 - 3305/4095 MiB
Tesla K20 - 1298/4799 MiB

@thatguymike
Copy link
Author

No longer needed against "parallel" branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants