Fit many replicas in parallel by scarlehoff · Pull Request #1153 · NNPDF/nnpdf

scarlehoff · 2021-03-16T13:41:45Z

A rebase for #1039 seemed way more complicated than manually porting the changes so I did that.

Missing: a few review comments that I still need to implement and ensuring I haven't broken anything along the way.

github-actions · 2021-03-16T17:34:04Z

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Fit Name: NNBOT-c2e30882f-2021-03-16
Fit Report: https://vp.nnpdf.science/Z79LCwGgRTeEYTlxgrrU7A==
Fit Data: https://data.nnpdf.science/fits/NNBOT-c2e30882f-2021-03-16.tar.gz

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

wilsonmr · 2021-03-17T11:06:22Z

+  mcseed: 3
+  sum_rules: "All"
+
+  genrep: False     # true = generate MC replicas, false = use real data


Was this to save time because the generation of reps is slow? Or was it some other reason? I worry about this propagating if it's set in the example.

No, just because I didn't do it in the original.
It shouldn't be complicated because the logic is the same as the one with 1 -r 30 but now in parallel. But doing it while wasting as little memory as possible might need some work (or might not, but need to check).

scarlehoff · 2021-03-17T11:10:01Z

This is now ready. I'm running things like hyperopt to ensure nothing is broken. Nothing was in #1039 but this merge/rebase was not trivial so I want to check.
It addresses the review comments from @RoyStegeman in #1039

The main limitation is that at the moment one cannot run in parallel while generating pseudodata (so it is useless for a fit!). It shouldn't be difficult (specially now with #1081), just telling the provider we want X replicas, but wasn't part of the original PR that was already reviewed so it will go to a separate one.

scarlehoff · 2021-03-17T15:05:18Z

The failed fit: https://vp.nnpdf.science/ccTgYKZbRdO7l97w8RAShg==/
Hyperopt: https://vp.nnpdf.science/RRzuKZfeRl2B47MsJoC0oQ==/

RoyStegeman

This looks good and I haven't noticed any problems while running the code. I left two comments for things I would personally change, but otherwise it's fine.

scarlehoff · 2021-03-31T12:51:56Z

Given that we are pushing back a bit NNPDF I'd rather have this merged, but in doing the final tests I've noticed that while for TF 2.2 I didn't notice a difference on speed, with 2.4.1 (the current Conda default) this branch is about 10/15% slower, so I need to investigate that (as the main way of running the code will still be in CPU).

scarlehoff · 2021-04-01T09:14:29Z

Ok, it was the usage of einsum in a few places (/cc @scarrazza your horror stories with qibo were useful :P) which is good for GPU but not for CPU.
I'll continue with the tests so it hopefully can be merged next week.

RoyStegeman · 2021-04-01T09:17:47Z

That's actually good to know! I like einsum but wasn't aware it's significantly slower on cpu than some alternatives.

But how does that explain the difference between tf2.2 and tf2.4.1 (or related numpy version)?

scarlehoff · 2021-04-01T09:20:27Z

Probably the speed difference was drowned by something else so I didn't notice it (both are faster now in Boogiepop than before).

scarlehoff · 2021-04-01T11:46:43Z

Ok, tests performed:

Standard:

Travis
Bot

Non standard fits:

25 replicas with only a few datasets, to see that 1) it runs 2) positivity work well 3) nothing obviously break

1 net per flavour
pch
feature scaling
l2 and diagonalization

Benchmarks

On doing this I noted the einsum problem.

Speed
Memory

If I haven't forgotten anything important, this can be merged.

scarlehoff · 2021-05-13T09:48:01Z

First I need to have a good fit bot with the new data (#1218) and then test the bot in this branch and then it can merged.

Co-authored-by: Roy Stegeman <roystegeman@live.nl>

github-actions · 2021-05-13T16:41:14Z

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Fit Name: NNBOT-c0912ed3f-2021-05-13
Fit Report: https://vp.nnpdf.science/s8nAEnIVRU-nzFYvshE0zQ==
Fit Data: https://data.nnpdf.science/fits/NNBOT-c0912ed3f-2021-05-13.tar.gz

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

scarlehoff · 2021-05-14T13:40:42Z

I'll merge this as soon as #1211 is green-lighted

scarlehoff marked this pull request as draft March 16, 2021 13:42

scarlehoff added the run-fit-bot Starts fit bot from a PR. label Mar 16, 2021

scarlehoff mentioned this pull request Mar 16, 2021

Fit many replicas in parallel #1039

Closed

8 tasks

scarlehoff added n3fit Issues and PRs related to n3fit run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Mar 17, 2021

scarlehoff marked this pull request as ready for review March 17, 2021 10:58

wilsonmr reviewed Mar 17, 2021

View reviewed changes

scarlehoff added run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Mar 17, 2021

RoyStegeman approved these changes Mar 23, 2021

View reviewed changes

Comment thread n3fit/runcards/Basic_runcard_parallel.yml Outdated

Comment thread n3fit/src/n3fit/performfit.py Outdated

RoyStegeman reviewed Mar 23, 2021

View reviewed changes

Comment thread n3fit/src/n3fit/performfit.py Outdated

scarlehoff added the NNPDF4.1 label Mar 24, 2021

scarlehoff force-pushed the multireplica_n3fit_mk3 branch from 53a4914 to 8d1e363 Compare March 26, 2021 07:42

scarlehoff marked this pull request as draft March 31, 2021 12:52

scarlehoff force-pushed the multireplica_n3fit_mk3 branch from 8b5dd60 to 19d235f Compare March 31, 2021 12:57

scarlehoff marked this pull request as ready for review April 1, 2021 09:13

Zaharid reviewed Apr 7, 2021

View reviewed changes

Comment thread n3fit/src/n3fit/layers/losses.py

scarlehoff force-pushed the multireplica_n3fit_mk3 branch from 9b2edce to 04f2a22 Compare May 4, 2021 09:04

Zaharid reviewed May 13, 2021

View reviewed changes

Comment thread n3fit/src/n3fit/checks.py Outdated

scarlehoff and others added 15 commits May 13, 2021 14:54

merge the unconflicting parts from multireplica_n3fit_mk2

fa03b29

finish merging multireplica_n3fit_mk2

ad71e99

fix remaining problems

d8a39d1

typo

7c105f7

data transformation checked

e1e4c76

use reasonable naming for the layers

84e32ff

make the MSR rely on just one PDF layer

4000bc9

dont rely on specific order

45e6e62

typos

3d2fd5f

fill in lists before exiting loop..

379c1ef

Update n3fit/src/n3fit/performfit.py

1f210ad

Co-authored-by: Roy Stegeman <roystegeman@live.nl>

remove unnecesary runcard information

fde2299

dont rely on hardcoded basis size when having flavour dictionary

5b2fb62

forgot to pass scaler through

02e03b1

only use einsum in GPU

69ab060

scarlehoff removed the run-fit-bot Starts fit bot from a PR. label May 13, 2021

remove spurious default

1e4ab93

scarlehoff force-pushed the multireplica_n3fit_mk3 branch from 7e773f5 to 1e4ab93 Compare May 13, 2021 12:57

scarlehoff changed the base branch from master to update_bot_to_40 May 13, 2021 12:57

scarlehoff added run-fit-bot Starts fit bot from a PR. and removed NNPDF4.1 labels May 13, 2021

Base automatically changed from update_bot_to_40 to master May 14, 2021 13:39

scarlehoff merged commit fe5e00c into master May 17, 2021

scarlehoff deleted the multireplica_n3fit_mk3 branch May 17, 2021 07:17

wilsonmr mentioned this pull request May 26, 2021

Add option to use the same tr/vl split for different replicas #1244

Closed

scarlehoff mentioned this pull request Jun 1, 2021

Generate replicas when fitting models in parallel #1261

Merged

Conversation

scarlehoff commented Mar 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Mar 16, 2021

Uh oh!

wilsonmr Mar 17, 2021

Choose a reason for hiding this comment

Uh oh!

scarlehoff Mar 17, 2021

Choose a reason for hiding this comment

Uh oh!

scarlehoff commented Mar 17, 2021

Uh oh!

scarlehoff commented Mar 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RoyStegeman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

scarlehoff commented Mar 31, 2021

Uh oh!

scarlehoff commented Apr 1, 2021

Uh oh!

RoyStegeman commented Apr 1, 2021

Uh oh!

scarlehoff commented Apr 1, 2021

Uh oh!

scarlehoff commented Apr 1, 2021

Standard:

Non standard fits:

Benchmarks

Uh oh!

Uh oh!

scarlehoff commented May 13, 2021

Uh oh!

Uh oh!

github-actions Bot commented May 13, 2021

Uh oh!

scarlehoff commented May 14, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

scarlehoff commented Mar 16, 2021 •

edited

Loading

scarlehoff commented Mar 17, 2021 •

edited

Loading