Fit many replicas in parallel by scarlehoff · Pull Request #1039 · NNPDF/nnpdf

scarlehoff · 2020-12-16T11:40:09Z

Given that one of the problems that we have towards a future (better) hyperparameter scan is that we are limited by how many replicas we can use to inform the hyperparametrization algorithm, I thought it would be nice to exploit the fact that many of the calculations done for different replicas are shared.

The basis of this PR (once it is finished, there are others to come* but I've decided to do things step by step) is to create a model that concatenates all PDFs, so that the PDF is (n_replicas, xgrid, flavours) then one continues doing everything in the normal way and at the end n3fit computes:

Total_Loss = \sum_{replicas} L_{i}

where each L_{i} depends only on one of the PDFs so that the gradient descent will try to minimize all of them. As the fit advances some of the replicas will stop training (still to do) which means as it goes to the end n3fit will still be calculating 50 gradients even though 49 will have weight 0 but the performance gain outweighs that little inefficiency.

The way the optimizers in TensorFlow work they will try to minimize Total_Loss meaning any bad behaved replica will dominate, fixable but I'll leave it for the future. The short (for whatever value of short) term goal is to improve on the hyperoptimization, for which badly behave replicas would mean bad architectures so we don't want them anyway.

Note that the optimizer to be used is quite important here, any GA would be very bad and only gradient descent with learning rate per weight/layer can be expected to train.

I'm also guessing this will be very useful for closure test or, even, for using closure tests as the hyperoptimization procedure reward.

My to do list for this PR is:

Fit many replicas in one go
Change the output of the model predict to be the loss, so that there's no need to evaluate the model
Keep track of all replicas separately in the stopping
Stop training when the stopping decides that it is time to stop
Apply positivity separately per replica
Checks so that this feature is only used with options which are known to work
Test that a standard fit works
Test that non-standard fitting techniques work

Please let me know if you see any issues with this or if you think anything should be added (so that I either add it to this PR or to the list of things to be completed afterwards below).

The usage is quite simple, it is enough to add a parallel_models flag to the runcard

parallel_models: 50

and then

n3fit runcard.yml 1

will fit from replica 1 to 50 in one go.

The code here is not that much better on a CPU (it is even very bad if you try to fit too many replicas, crashes have been seen) but I've manage to fit 50 replicas in less than 8 GPU hours in a discrete GPU. There's also a certain flexibility to be exploited, for instance many replicas at once might not be that useful in the end, but fitting many different architectures at once for a hyperparameter scan would be.

Note: pointing to tf2.4 because I have it now installed in my PC so I rebased, but works on 3 as well, an older version is in here which was a quick testing I did.

*the other PRs for which I have (not necessarily functional) prototypes are:

Fit to different data per replica
Allow the usage within hyperopt (multireplica/multimodel)
Good separation of minimization per partial loss

Zaharid · 2020-12-16T12:44:02Z

This is not the same as training N single replicas, right? E.g. it is logically possible that the loss is very bad for some replicas and very good for others.

scarlehoff · 2020-12-16T12:47:37Z

Yes. The plan is to make it so that it looks at all replicas separately (right now only at the sum), but for certain things it doesn't matter. For instance, if any of the hyperopt replicas is bad then I am ok with trashing that run.

wilsonmr · 2020-12-16T13:23:21Z

Exciting!

github-actions · 2020-12-16T15:09:15Z

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Fit Name: NNBOT-8196e80d-2020-12-16
Fit Report: https://vp.nnpdf.science/oQDk5LlfT-iUjYPkHyT6TA==
Fit Data: https://data.nnpdf.science/fits/NNBOT-8196e80d-2020-12-16.tar.gz

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

scarlehoff · 2020-12-22T15:42:02Z

I wasn't really expecting the stopping to be the most not-ready-for-parallelization thing in n3fit so while the rest of the code hasn't changed much (most of the changes are either a change of axis or a generalization of whatever it was already doing) the stopping has completely changed.

While I was at it I've removed stuff that wasn't useful anymore (for instance, save_weights_each is useless when you can just use tensorboard). I've completely changed FitState to be much more useful and a ReplicaStatus object that holds the information about the best replica (the History is practically a list of FitState and ReplicaStatus now).

There's a few checks that I need to do before having this ready for review:

Check that diagonalization works (not sure!)
Check that hyperoptimization works (it doesn't)

I won't implement the parallel modes in hyperopt in this PR. It is already big enough as it is.

Once I'm happy with it* I'll ask for the merging of this one (which is useless in that it just allows you to do many non-replica'd parallel runs) and then I'll create a few other ones with the possibility of using different data replicas (always with the same tr/vl split!) and using this for hyperopt.

*meaning: once I can look at the stopping without becoming this meme

github-actions · 2020-12-22T18:18:48Z

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Fit Name: NNBOT-c33c2fb4-2020-12-22
Fit Report: https://vp.nnpdf.science/XbNHglB1TvK5RwX3D71cEQ==
Fit Data: https://data.nnpdf.science/fits/NNBOT-c33c2fb4-2020-12-22.tar.gz

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

scarlehoff · 2021-01-05T18:35:57Z

Ok... this is ready to be reviewed.

I feel sorry for whoever steps forward... on the bright side some of the changes I had to do in order to allow GPU running actually simplify several parts of the code. The main thing (towards simplification) is that now the models don't produce by default the values of the predictions but rather the values of the chi2 (not divided by N) for each experiment.
The predictions can be obtained if needed quite easily, but in reality they are never needed during training so a lot of stuff simplifies out. I don't know why I didn't think about this before, now it looks so obvious...

This is a big PR and can't really be broken down easily, I guess this is the kind of thing that would greatly benefit of being able to sit in the same place for a few days to fix some of the most stupid choices I've made (but I honestly think it removes quite a few of those!).

With respect to the failing tests in Travis:
@wilsonmr when you use TF in Mac OS which version are you using? TF 2.0 is the only one for MacOS in conda for tf-eigen but not sure whether it is actually used by anyone...

I would rather skip some tests in mac when they are using features of TF > 2.0.0 and drop support for TF < 2.2 instead of having a number of if(version) conditions which are only uised by Travis... (if you or others are actually using older versions that's a different story).

github-actions · 2021-01-05T21:18:27Z

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Fit Name: NNBOT-ed744dd3-2021-01-05
Fit Report: https://vp.nnpdf.science/ae_HE_D2Q3afXqAG_CclOA==
Fit Data: https://data.nnpdf.science/fits/NNBOT-ed744dd3-2021-01-05.tar.gz

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

wilsonmr · 2021-01-06T10:36:01Z

With respect to the failing tests in Travis:
@wilsonmr when you use TF in Mac OS which version are you using? TF 2.0 is the only one for MacOS in conda for tf-eigen but not sure whether it is actually used by anyone...

That is the version I'm using yes, when I search tensorflow I think TF 2.0 is the only version on conda for mac (eigen or otherwise) whereas a cursory search on linux yielded versions up to 2.3.

scarlehoff · 2021-01-06T12:28:56Z

In pypi they have up to 2.4 so it should be possible. I guess they are lacking a packager for mac os in conda. Ok, then I'll try to work around the issue.

Zaharid · 2021-01-07T12:24:50Z

Hmm, the situation with TF on conda is worrisome. E.g. conda-forge/tensorflow-feedstock#107 (defaults is more up to date, but still).

@scarlehoff @wilsonmr does pip install tensorflow on a conda environment "just work"? I'd rather not go that route by default at the moment (for one setting up the CIs would be a pain) but we should start looking into alternatives...

scarlehoff · 2021-01-07T13:10:34Z

Defaults works well at least in Linux (the problem with conda-forge is actually quite bad because they also have the policy of having the strict conda-forge > default priority, which breaks a lot of recipes).

wilsonmr · 2021-01-07T13:40:05Z

I will have a look at using pip TF on MacOS, I suppose the thing to watch out for is that it doesn't modify/install conflicting packages when I do the pip installation

Co-authored-by: Roy Stegeman <roystegeman@live.nl>

scarlehoff · 2021-03-16T13:42:44Z

Closing in favour of #1153

scarlehoff added n3fit Issues and PRs related to n3fit run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Dec 16, 2020

Base automatically changed from tf2.4 to master December 16, 2020 15:29

scarlehoff added run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Dec 22, 2020

scarlehoff force-pushed the multireplica_n3fit_mk2 branch from 289ed4f to 5f0cef4 Compare December 22, 2020 15:16

scarlehoff added run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Dec 22, 2020

scarlehoff force-pushed the multireplica_n3fit_mk2 branch from b50d9ad to 503112f Compare January 5, 2021 12:37

scarlehoff changed the base branch from master to test_use_json January 5, 2021 12:38

scarlehoff marked this pull request as ready for review January 5, 2021 18:36

scarlehoff added run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Jan 5, 2021

scarlehoff changed the title ~~[WIP] Fit many replicas in parallel~~ Fit many replicas in parallel Jan 6, 2021

wilsonmr mentioned this pull request Jan 12, 2021

Tensorflow 2.4 on mac #1049

Closed

scarlehoff and others added 23 commits January 28, 2021 11:34

promote the losses to their own layers as outputs of predict

edf697a

added a test for the new losses

a40f872

minor changes so the code continues to work

d20a9f3

keep track of the validation and positivity separately

353bd2a

minimal working example

c4c6d26

allow only running in configurations that might work

823915d

dont let tensors get out of the model

765b735

change stopping to accept the replica dimension

9fd2bbb

several more fixes

867cb93

many changes to stopping, remove deprecated options

579429b

hotfix

edbcb2d

stop at different points for each replica

7493533

1-dataset runcards should also work

ab64c29

remove unnecesary middle layers

a875924

rebase to test with json

f7ba7e4

changes to ensure everything (hyperopt, kfolding, diagonal covmat) works

a559be1

abstract out the generation of the experimental layers

96bf3b5

possible fix for mac

99eea82

add some docs

0c33dbc

fix the problems with the rebasing

40d590a

fix the tests

117a911

Update n3fit/src/n3fit/model_gen.py

702a325

Co-authored-by: Roy Stegeman <roystegeman@live.nl>

Update n3fit/src/n3fit/model_gen.py

3546780

Co-authored-by: Roy Stegeman <roystegeman@live.nl>

scarlehoff force-pushed the multireplica_n3fit_mk2 branch from da385a0 to 3546780 Compare January 28, 2021 10:34

apply comments

b6e8543

scarlehoff marked this pull request as draft February 4, 2021 15:34

scarlehoff mentioned this pull request Mar 16, 2021

Fit many replicas in parallel #1153

Merged

scarlehoff closed this Mar 16, 2021

scarrazza deleted the multireplica_n3fit_mk2 branch August 27, 2021 21:42

Conversation

scarlehoff commented Dec 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zaharid commented Dec 16, 2020

Uh oh!

scarlehoff commented Dec 16, 2020

Uh oh!

wilsonmr commented Dec 16, 2020

Uh oh!

github-actions Bot commented Dec 16, 2020

Uh oh!

scarlehoff commented Dec 22, 2020

Uh oh!

github-actions Bot commented Dec 22, 2020

Uh oh!

scarlehoff commented Jan 5, 2021

Uh oh!

github-actions Bot commented Jan 5, 2021

Uh oh!

wilsonmr commented Jan 6, 2021

Uh oh!

scarlehoff commented Jan 6, 2021

Uh oh!

Zaharid commented Jan 7, 2021

Uh oh!

scarlehoff commented Jan 7, 2021

Uh oh!

wilsonmr commented Jan 7, 2021

Uh oh!

scarlehoff commented Mar 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

scarlehoff commented Dec 16, 2020 •

edited

Loading