Fit many replicas in parallel#1039
Conversation
|
This is not the same as training N single replicas, right? E.g. it is logically possible that the loss is very bad for some replicas and very good for others. |
|
Yes. The plan is to make it so that it looks at all replicas separately (right now only at the sum), but for certain things it doesn't matter. For instance, if any of the hyperopt replicas is bad then I am ok with trashing that run. |
|
Exciting! |
|
Greetings from your nice fit 🤖 !
Check the report carefully, and please buy me a ☕ , or better, a GPU 😉! |
289ed4f to
5f0cef4
Compare
|
I wasn't really expecting the stopping to be the most not-ready-for-parallelization thing in While I was at it I've removed stuff that wasn't useful anymore (for instance, There's a few checks that I need to do before having this ready for review:
I won't implement the parallel modes in hyperopt in this PR. It is already big enough as it is. Once I'm happy with it* I'll ask for the merging of this one (which is useless in that it just allows you to do many non-replica'd parallel runs) and then I'll create a few other ones with the possibility of using different data replicas (always with the same tr/vl split!) and using this for hyperopt. *meaning: once I can look at the stopping without becoming this meme |
|
Greetings from your nice fit 🤖 !
Check the report carefully, and please buy me a ☕ , or better, a GPU 😉! |
b50d9ad to
503112f
Compare
|
Ok... this is ready to be reviewed. I feel sorry for whoever steps forward... on the bright side some of the changes I had to do in order to allow GPU running actually simplify several parts of the code. The main thing (towards simplification) is that now the models don't produce by default the values of the predictions but rather the values of the chi2 (not divided by N) for each experiment. This is a big PR and can't really be broken down easily, I guess this is the kind of thing that would greatly benefit of being able to sit in the same place for a few days to fix some of the most stupid choices I've made (but I honestly think it removes quite a few of those!). With respect to the failing tests in Travis: I would rather skip some tests in mac when they are using features of TF > 2.0.0 and drop support for TF < 2.2 instead of having a number of if(version) conditions which are only uised by Travis... (if you or others are actually using older versions that's a different story). |
|
Greetings from your nice fit 🤖 !
Check the report carefully, and please buy me a ☕ , or better, a GPU 😉! |
That is the version I'm using yes, when I search tensorflow I think TF 2.0 is the only version on conda for mac (eigen or otherwise) whereas a cursory search on linux yielded versions up to 2.3. |
|
In pypi they have up to 2.4 so it should be possible. I guess they are lacking a packager for mac os in conda. Ok, then I'll try to work around the issue. |
|
Hmm, the situation with TF on conda is worrisome. E.g. conda-forge/tensorflow-feedstock#107 (defaults is more up to date, but still). @scarlehoff @wilsonmr does pip install tensorflow on a conda environment "just work"? I'd rather not go that route by default at the moment (for one setting up the CIs would be a pain) but we should start looking into alternatives... |
|
Defaults works well at least in Linux (the problem with conda-forge is actually quite bad because they also have the policy of having the strict conda-forge > default priority, which breaks a lot of recipes). |
|
I will have a look at using pip TF on MacOS, I suppose the thing to watch out for is that it doesn't modify/install conflicting packages when I do the pip installation |
Co-authored-by: Roy Stegeman <roystegeman@live.nl>
Co-authored-by: Roy Stegeman <roystegeman@live.nl>
da385a0 to
3546780
Compare
|
Closing in favour of #1153 |
Given that one of the problems that we have towards a future (better) hyperparameter scan is that we are limited by how many replicas we can use to inform the hyperparametrization algorithm, I thought it would be nice to exploit the fact that many of the calculations done for different replicas are shared.
The basis of this PR (once it is finished, there are others to come* but I've decided to do things step by step) is to create a model that concatenates all PDFs, so that the PDF is
(n_replicas, xgrid, flavours)then one continues doing everything in the normal way and at the endn3fitcomputes:Total_Loss = \sum_{replicas} L_{i}where each
L_{i}depends only on one of the PDFs so that the gradient descent will try to minimize all of them. As the fit advances some of the replicas will stop training (still to do) which means as it goes to the endn3fitwill still be calculating 50 gradients even though 49 will have weight 0 but the performance gain outweighs that little inefficiency.The way the optimizers in TensorFlow work they will try to minimize
Total_Lossmeaning any bad behaved replica will dominate, fixable but I'll leave it for the future. The short (for whatever value of short) term goal is to improve on the hyperoptimization, for which badly behave replicas would mean bad architectures so we don't want them anyway.Note that the optimizer to be used is quite important here, any GA would be very bad and only gradient descent with learning rate per weight/layer can be expected to train.
I'm also guessing this will be very useful for closure test or, even, for using closure tests as the hyperoptimization procedure reward.
My to do list for this PR is:
evaluatethe modelPlease let me know if you see any issues with this or if you think anything should be added (so that I either add it to this PR or to the list of things to be completed afterwards below).
The usage is quite simple, it is enough to add a
parallel_modelsflag to the runcardand then
will fit from replica 1 to 50 in one go.
The code here is not that much better on a CPU (it is even very bad if you try to fit too many replicas, crashes have been seen) but I've manage to fit 50 replicas in less than 8 GPU hours in a discrete GPU. There's also a certain flexibility to be exploited, for instance many replicas at once might not be that useful in the end, but fitting many different architectures at once for a hyperparameter scan would be.
Note: pointing to tf2.4 because I have it now installed in my PC so I rebased, but works on 3 as well, an older version is in here which was a quick testing I did.
*the other PRs for which I have (not necessarily functional) prototypes are: