Skip to content

Fit many replicas in parallel#1039

Closed
scarlehoff wants to merge 27 commits into
masterfrom
multireplica_n3fit_mk2
Closed

Fit many replicas in parallel#1039
scarlehoff wants to merge 27 commits into
masterfrom
multireplica_n3fit_mk2

Conversation

@scarlehoff
Copy link
Copy Markdown
Member

@scarlehoff scarlehoff commented Dec 16, 2020

Given that one of the problems that we have towards a future (better) hyperparameter scan is that we are limited by how many replicas we can use to inform the hyperparametrization algorithm, I thought it would be nice to exploit the fact that many of the calculations done for different replicas are shared.

The basis of this PR (once it is finished, there are others to come* but I've decided to do things step by step) is to create a model that concatenates all PDFs, so that the PDF is (n_replicas, xgrid, flavours) then one continues doing everything in the normal way and at the end n3fit computes:

Total_Loss = \sum_{replicas} L_{i}

where each L_{i} depends only on one of the PDFs so that the gradient descent will try to minimize all of them. As the fit advances some of the replicas will stop training (still to do) which means as it goes to the end n3fit will still be calculating 50 gradients even though 49 will have weight 0 but the performance gain outweighs that little inefficiency.

The way the optimizers in TensorFlow work they will try to minimize Total_Loss meaning any bad behaved replica will dominate, fixable but I'll leave it for the future. The short (for whatever value of short) term goal is to improve on the hyperoptimization, for which badly behave replicas would mean bad architectures so we don't want them anyway.

Note that the optimizer to be used is quite important here, any GA would be very bad and only gradient descent with learning rate per weight/layer can be expected to train.

I'm also guessing this will be very useful for closure test or, even, for using closure tests as the hyperoptimization procedure reward.

My to do list for this PR is:

  • Fit many replicas in one go
  • Change the output of the model predict to be the loss, so that there's no need to evaluate the model
  • Keep track of all replicas separately in the stopping
  • Stop training when the stopping decides that it is time to stop
  • Apply positivity separately per replica
  • Checks so that this feature is only used with options which are known to work
  • Test that a standard fit works
  • Test that non-standard fitting techniques work

Please let me know if you see any issues with this or if you think anything should be added (so that I either add it to this PR or to the list of things to be completed afterwards below).

The usage is quite simple, it is enough to add a parallel_models flag to the runcard

parallel_models: 50

and then

n3fit runcard.yml 1

will fit from replica 1 to 50 in one go.

The code here is not that much better on a CPU (it is even very bad if you try to fit too many replicas, crashes have been seen) but I've manage to fit 50 replicas in less than 8 GPU hours in a discrete GPU. There's also a certain flexibility to be exploited, for instance many replicas at once might not be that useful in the end, but fitting many different architectures at once for a hyperparameter scan would be.

Note: pointing to tf2.4 because I have it now installed in my PC so I rebased, but works on 3 as well, an older version is in here which was a quick testing I did.

*the other PRs for which I have (not necessarily functional) prototypes are:

  • Fit to different data per replica
  • Allow the usage within hyperopt (multireplica/multimodel)
  • Good separation of minimization per partial loss

@scarlehoff scarlehoff added n3fit Issues and PRs related to n3fit run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Dec 16, 2020
@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Dec 16, 2020

This is not the same as training N single replicas, right? E.g. it is logically possible that the loss is very bad for some replicas and very good for others.

@scarlehoff
Copy link
Copy Markdown
Member Author

Yes. The plan is to make it so that it looks at all replicas separately (right now only at the sum), but for certain things it doesn't matter. For instance, if any of the hyperopt replicas is bad then I am ok with trashing that run.

@wilsonmr
Copy link
Copy Markdown
Contributor

Exciting!

@github-actions
Copy link
Copy Markdown

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

Base automatically changed from tf2.4 to master December 16, 2020 15:29
@scarlehoff scarlehoff added run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Dec 22, 2020
@scarlehoff scarlehoff force-pushed the multireplica_n3fit_mk2 branch from 289ed4f to 5f0cef4 Compare December 22, 2020 15:16
@scarlehoff
Copy link
Copy Markdown
Member Author

I wasn't really expecting the stopping to be the most not-ready-for-parallelization thing in n3fit so while the rest of the code hasn't changed much (most of the changes are either a change of axis or a generalization of whatever it was already doing) the stopping has completely changed.

While I was at it I've removed stuff that wasn't useful anymore (for instance, save_weights_each is useless when you can just use tensorboard). I've completely changed FitState to be much more useful and a ReplicaStatus object that holds the information about the best replica (the History is practically a list of FitState and ReplicaStatus now).

There's a few checks that I need to do before having this ready for review:

  • Check that diagonalization works (not sure!)
  • Check that hyperoptimization works (it doesn't)

I won't implement the parallel modes in hyperopt in this PR. It is already big enough as it is.

Once I'm happy with it* I'll ask for the merging of this one (which is useless in that it just allows you to do many non-replica'd parallel runs) and then I'll create a few other ones with the possibility of using different data replicas (always with the same tr/vl split!) and using this for hyperopt.

*meaning: once I can look at the stopping without becoming this meme

@scarlehoff scarlehoff added run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Dec 22, 2020
@github-actions
Copy link
Copy Markdown

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

@scarlehoff scarlehoff force-pushed the multireplica_n3fit_mk2 branch from b50d9ad to 503112f Compare January 5, 2021 12:37
@scarlehoff scarlehoff changed the base branch from master to test_use_json January 5, 2021 12:38
@scarlehoff
Copy link
Copy Markdown
Member Author

Ok... this is ready to be reviewed.

I feel sorry for whoever steps forward... on the bright side some of the changes I had to do in order to allow GPU running actually simplify several parts of the code. The main thing (towards simplification) is that now the models don't produce by default the values of the predictions but rather the values of the chi2 (not divided by N) for each experiment.
The predictions can be obtained if needed quite easily, but in reality they are never needed during training so a lot of stuff simplifies out. I don't know why I didn't think about this before, now it looks so obvious...

This is a big PR and can't really be broken down easily, I guess this is the kind of thing that would greatly benefit of being able to sit in the same place for a few days to fix some of the most stupid choices I've made (but I honestly think it removes quite a few of those!).

With respect to the failing tests in Travis:
@wilsonmr when you use TF in Mac OS which version are you using? TF 2.0 is the only one for MacOS in conda for tf-eigen but not sure whether it is actually used by anyone...

I would rather skip some tests in mac when they are using features of TF > 2.0.0 and drop support for TF < 2.2 instead of having a number of if(version) conditions which are only uised by Travis... (if you or others are actually using older versions that's a different story).

@scarlehoff scarlehoff marked this pull request as ready for review January 5, 2021 18:36
@scarlehoff scarlehoff added run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Jan 5, 2021
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jan 5, 2021

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

@wilsonmr
Copy link
Copy Markdown
Contributor

wilsonmr commented Jan 6, 2021

With respect to the failing tests in Travis:
@wilsonmr when you use TF in Mac OS which version are you using? TF 2.0 is the only one for MacOS in conda for tf-eigen but not sure whether it is actually used by anyone...

That is the version I'm using yes, when I search tensorflow I think TF 2.0 is the only version on conda for mac (eigen or otherwise) whereas a cursory search on linux yielded versions up to 2.3.

@scarlehoff
Copy link
Copy Markdown
Member Author

In pypi they have up to 2.4 so it should be possible. I guess they are lacking a packager for mac os in conda. Ok, then I'll try to work around the issue.

@scarlehoff scarlehoff changed the title [WIP] Fit many replicas in parallel Fit many replicas in parallel Jan 6, 2021
@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Jan 7, 2021

Hmm, the situation with TF on conda is worrisome. E.g. conda-forge/tensorflow-feedstock#107 (defaults is more up to date, but still).

@scarlehoff @wilsonmr does pip install tensorflow on a conda environment "just work"? I'd rather not go that route by default at the moment (for one setting up the CIs would be a pain) but we should start looking into alternatives...

@scarlehoff
Copy link
Copy Markdown
Member Author

Defaults works well at least in Linux (the problem with conda-forge is actually quite bad because they also have the policy of having the strict conda-forge > default priority, which breaks a lot of recipes).

@wilsonmr
Copy link
Copy Markdown
Contributor

wilsonmr commented Jan 7, 2021

I will have a look at using pip TF on MacOS, I suppose the thing to watch out for is that it doesn't modify/install conflicting packages when I do the pip installation

@wilsonmr wilsonmr mentioned this pull request Jan 12, 2021
@scarlehoff scarlehoff force-pushed the multireplica_n3fit_mk2 branch from da385a0 to 3546780 Compare January 28, 2021 10:34
@scarlehoff scarlehoff marked this pull request as draft February 4, 2021 15:34
@scarlehoff
Copy link
Copy Markdown
Member Author

Closing in favour of #1153

@scarlehoff scarlehoff closed this Mar 16, 2021
@scarrazza scarrazza deleted the multireplica_n3fit_mk2 branch August 27, 2021 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

n3fit Issues and PRs related to n3fit

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants