Skip to content

Fit many replicas in parallel#1153

Merged
scarlehoff merged 16 commits into
masterfrom
multireplica_n3fit_mk3
May 17, 2021
Merged

Fit many replicas in parallel#1153
scarlehoff merged 16 commits into
masterfrom
multireplica_n3fit_mk3

Conversation

@scarlehoff
Copy link
Copy Markdown
Member

@scarlehoff scarlehoff commented Mar 16, 2021

A rebase for #1039 seemed way more complicated than manually porting the changes so I did that.

Missing: a few review comments that I still need to implement and ensuring I haven't broken anything along the way.

@scarlehoff scarlehoff marked this pull request as draft March 16, 2021 13:42
@scarlehoff scarlehoff added the run-fit-bot Starts fit bot from a PR. label Mar 16, 2021
@scarlehoff scarlehoff mentioned this pull request Mar 16, 2021
8 tasks
@github-actions
Copy link
Copy Markdown

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

@scarlehoff scarlehoff added n3fit Issues and PRs related to n3fit run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Mar 17, 2021
@scarlehoff scarlehoff marked this pull request as ready for review March 17, 2021 10:58
mcseed: 3
sum_rules: "All"

genrep: False # true = generate MC replicas, false = use real data
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this to save time because the generation of reps is slow? Or was it some other reason? I worry about this propagating if it's set in the example.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just because I didn't do it in the original.
It shouldn't be complicated because the logic is the same as the one with 1 -r 30 but now in parallel. But doing it while wasting as little memory as possible might need some work (or might not, but need to check).

@scarlehoff
Copy link
Copy Markdown
Member Author

This is now ready. I'm running things like hyperopt to ensure nothing is broken. Nothing was in #1039 but this merge/rebase was not trivial so I want to check.
It addresses the review comments from @RoyStegeman in #1039

The main limitation is that at the moment one cannot run in parallel while generating pseudodata (so it is useless for a fit!). It shouldn't be difficult (specially now with #1081), just telling the provider we want X replicas, but wasn't part of the original PR that was already reviewed so it will go to a separate one.

@scarlehoff scarlehoff added run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Mar 17, 2021
@scarlehoff
Copy link
Copy Markdown
Member Author

scarlehoff commented Mar 17, 2021

@scarlehoff scarlehoff added run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Mar 17, 2021
Copy link
Copy Markdown
Member

@RoyStegeman RoyStegeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good and I haven't noticed any problems while running the code. I left two comments for things I would personally change, but otherwise it's fine.

Comment thread n3fit/runcards/Basic_runcard_parallel.yml Outdated
Comment thread n3fit/src/n3fit/performfit.py Outdated
Comment thread n3fit/src/n3fit/performfit.py Outdated
@scarlehoff
Copy link
Copy Markdown
Member Author

Given that we are pushing back a bit NNPDF I'd rather have this merged, but in doing the final tests I've noticed that while for TF 2.2 I didn't notice a difference on speed, with 2.4.1 (the current Conda default) this branch is about 10/15% slower, so I need to investigate that (as the main way of running the code will still be in CPU).

@scarlehoff scarlehoff marked this pull request as draft March 31, 2021 12:52
@scarlehoff scarlehoff force-pushed the multireplica_n3fit_mk3 branch from 8b5dd60 to 19d235f Compare March 31, 2021 12:57
@scarlehoff scarlehoff marked this pull request as ready for review April 1, 2021 09:13
@scarlehoff
Copy link
Copy Markdown
Member Author

Ok, it was the usage of einsum in a few places (/cc @scarrazza your horror stories with qibo were useful :P) which is good for GPU but not for CPU.
I'll continue with the tests so it hopefully can be merged next week.

@RoyStegeman
Copy link
Copy Markdown
Member

That's actually good to know! I like einsum but wasn't aware it's significantly slower on cpu than some alternatives.

But how does that explain the difference between tf2.2 and tf2.4.1 (or related numpy version)?

@scarlehoff
Copy link
Copy Markdown
Member Author

Probably the speed difference was drowned by something else so I didn't notice it (both are faster now in Boogiepop than before).

@scarlehoff
Copy link
Copy Markdown
Member Author

Ok, tests performed:

Standard:

  • Travis
  • Bot

Non standard fits:

25 replicas with only a few datasets, to see that 1) it runs 2) positivity work well 3) nothing obviously break

  • 1 net per flavour
  • pch
  • feature scaling
  • l2 and diagonalization

Benchmarks

On doing this I noted the einsum problem.

  • Speed
  • Memory

If I haven't forgotten anything important, this can be merged.

Comment thread n3fit/src/n3fit/layers/losses.py
@scarlehoff scarlehoff force-pushed the multireplica_n3fit_mk3 branch from 9b2edce to 04f2a22 Compare May 4, 2021 09:04
@scarlehoff
Copy link
Copy Markdown
Member Author

First I need to have a good fit bot with the new data (#1218) and then test the bot in this branch and then it can merged.

Comment thread n3fit/src/n3fit/checks.py Outdated
@scarlehoff scarlehoff removed the run-fit-bot Starts fit bot from a PR. label May 13, 2021
@scarlehoff scarlehoff force-pushed the multireplica_n3fit_mk3 branch from 7e773f5 to 1e4ab93 Compare May 13, 2021 12:57
@scarlehoff scarlehoff changed the base branch from master to update_bot_to_40 May 13, 2021 12:57
@scarlehoff scarlehoff added run-fit-bot Starts fit bot from a PR. and removed NNPDF4.1 labels May 13, 2021
@github-actions
Copy link
Copy Markdown

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

Base automatically changed from update_bot_to_40 to master May 14, 2021 13:39
@scarlehoff
Copy link
Copy Markdown
Member Author

I'll merge this as soon as #1211 is green-lighted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

n3fit Issues and PRs related to n3fit run-fit-bot Starts fit bot from a PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants