Make a n3fit.io provider by wilsonmr · Pull Request #1081 · NNPDF/nnpdf

wilsonmr · 2021-02-03T16:50:13Z

duplicate the parts of n3fit.io.reader which get called by fit, but transform into validphys style providers.

I haven't looked at the integrability or positivity sets, although they looked kind of trivial in comparison from a scan (famous last words).

The idea here was to change to using providers in n3fit but still driven by libnnpdf so hopefully nothing changes in a fit. I don't think this breaks hyperopt or conflicts with any PRs out there. Obviously a run fit bot will test this fully.

The next step would be to replace the calling of libnnpdf with providers from the various destroy c++ PRs. I would probably move the location of this new module somewhere else and call it something better as well - this was more of a proof of principle. Change the module name and moving it just requires updating the providers list in n3fit_exec so very easy to do without breaking anything.

I added some inefficieny with the mask functions but the upside is we expose the masks for each dataset. This action is easy to modify to remove the dependency on theoryid taking group_name and dataset_inputs_loaded_cd_with_cuts instead. I don't think these functions are anywhere near the bottleneck though so I can't be bothered to think of a better way to do this.

wilsonmr · 2021-02-03T17:00:17Z

to do:

Check the docstrings make sense and are up to scratch.
do the same for the positivity sets?
since I'm now using fitting as a namespace, perhaps there are some other optimisations where we extract things from fitting.

wilsonmr · 2021-02-03T17:00:55Z

WIP but also please take a look and tell me if it's horrible.

wilsonmr · 2021-02-03T17:04:46Z

I should say this links to #1080 because I would vote to move the new module in this PR to that sub dir

scarlehoff · 2021-02-03T17:34:35Z

Thanks! I'll have a look at it tomorrow.

scarlehoff

The positivity is trivial if and only if you remember that both integrability and positivity go through the same place and you set all the correct flags (which for now is set True if "INTEG" in dataset.name). For the rest it is just a simpler dataset.

scarlehoff · 2021-02-04T08:44:21Z

    # run once and all_exp_infos is a list of just than one element
    stopwatch.register_times("data_loaded")
-    for replica_number, exp_info, nnseed in zip(replica, all_exp_infos, nnseeds):
+    for replica_number, exp_info, nnseed in zip(replicas, all_exp_infos, replicas_nnseed):


I guess at this point (maybe for the future) it would make sense to have one single object which contains all_exp_infos and also the replica number and the corresponding seed?

So that one can do exps_replicas_fitting_data_dict.iterate_replicas() or exps_replicas_fitting_data_dict.iterate_experiments().

But for now it surely makes sense to keep it with the "old structure"

scarlehoff · 2021-02-04T08:52:17Z

+from NNPDF import RandomGenerator
+from reportengine import collect
+
+from n3fit.io.reader import common_data_reader_experiment


I guess this common_data_reader_experiment would be made obsolete by the python commondata?

Possibly. I guess a large amount of n3fit relies on the dictionaries which are returned by common_data_reader_* functions.

Obviously the easiest thing would make some kind of "wrapper" which took the python commondata and then filled the same dictionary, so the replacement might look quite similar. It's been a while since I saw how deep these dictionaries get passed.

scarlehoff · 2021-02-04T09:07:29Z

Btw, feel free to "break" any part of the reader you dislike. It was basically written during a time when n3fit wasn't even part of validphys and haven't changed since then, and the things that have changed were done in a time when I had no idea how vp really worked... so there are probably many stupid decisions contained in those functions.

In summary, if at any point you wonder "should I use this clever vp function instead of this convoluted loop of loops" the answer is probably "yes".

wilsonmr · 2021-02-04T12:13:15Z

The positivity datasets were easy enough to "actionify", the integrability less so because of the weird combo of parsing and production rule and the chance that they might not exist. In this case I just moved the part in performfit to a different location.

Since fitting gets expanded as a namespace I was able to remove it as input and take the actual parameters. The obvious downside is performfit take loads of parameters now. I think the upside is it's a bit less obfuscated what should be present in fitting.

I still need to update the docstrings in the actions. Probably also the names they're a bit disgusting right now.

siranipour

Looks good, as you say a fit bot will be nice to see everything works as it should

wilsonmr · 2021-02-17T09:59:06Z

I will finish this... later

wilsonmr · 2021-03-02T14:44:22Z

This is ready for review, I moved the reader to validphys now. I also added an action to returns the pseudodata. From what I can tell the output should be pretty much identical to read_fit_pseudodata obviously to get the correct tables one will need to look at which replicas were chosen by postfit but then it should just be a case of returning the loaded tables. The one caveat is if you run multiple replicas at once because the table only gets saved with all replicas to the final replica - there are ways around this and I think this should be addressed in the PR which looks to load these tables in.

Quite keen for somebody to take a look at this because I'd like to dump the pseudodata from the closure tests for something Luigi and I are looking into, equally I don't want to rush this through because it touches some pretty fundamental parts of running a fit (hopefully without changing anything though) If the tests run I will run the fit bot.

wilsonmr · 2021-03-02T14:46:43Z

Btw as well as making several things easier such as #1059 and allowing us to slot in the python replica generation it also is one of the steps to make seperate packages for n3fit and validphys

scarlehoff · 2021-03-02T15:08:37Z

+    load=None,
+    sum_rules=True,
+    tensorboard=None,
+    save=None,


I would prefer to have the parameters, since there are so many, a bit more organized, like load/save being together and the same for tensorboard/save_weights_each/debug.
Also kfold_parameters should have =None looking at the docstr.

That makes sense, I think the ordering I was aiming for was order used. But I agree that it would probably be easier to read if they were grouped by context. With regards to your other comment would you prefer the check-only arguments are first or last?

Also kfold_parameters should have =None looking at the docstr.

This doesn't need a default because the default gets filled in by the production rule in the Config class. I think the docstring is consistent with its actual values

That way it also goes with the other hyperopt options.
Regarding the checks, I don't mind, maybe a comment is enough. It might confuse people at some point in the future (I guess pylint for instance will tell you to remove them)

Is that better?

scarlehoff · 2021-03-02T15:22:42Z

+
+        # place tables in last replica path
+        self.table_folder = path / TAB_FOLDER
+        self.table_folder.mkdir(exist_ok=True)


Can this be moved to if file_content["fitting"].get("savepseudodata"):? Or do we want the tables folder to always exist even when it will be empty?

I put it here because it modifies the environment attributes, I felt like having empty tables folders wasn't a disaster but I think saying I want that would be a step too far :P

The alternative would be to just set the table path to be exactly the replica path and the table would get put with all the other files. I don't think that would be so bad

I've done my alternative suggestion, I personally quite like it:

$ n3fit test.yml 1 ... $ ls test/nnfit/replica_1 chi2exps.log test.exportgrid test.params test.time datacuts_theory_fitting_pseudodata_table.csv test.fitinfo test.preproc version.info tboard test.json test.sumrules weights.h5

I prefer it to the tables folder.

Yes, the name isn't ideal but also I don't mind that

scarlehoff · 2021-03-02T15:28:39Z

The one caveat is if you run multiple replicas at once because the table only gets saved with all replicas to the final replica

I think it would be good too have a check for the moment and if more than one replica is requested then pseudodata can't be saved and the fit won't run.
Multireplica is not widely used anyway but we risk having a number of fits (those run from the merging of this PR to the merging of the PR fixing that problem) where the pseudodata is saved in random folders scattered through the fit (nothing stops you from running 1 to 10, then 11 to 15 and then the rest in groups of 3).

wilsonmr · 2021-03-02T15:32:15Z

That would be one solution. Perhaps the easiest. I was thinking that if one loaded and concatenated all the tables from all the replicas then it wouldn't matter if somebody did the above. Then the function would yield the columns of that super table in an order which corresponded to the final PDF replica.

scarlehoff · 2021-03-02T15:33:10Z

https://github.com/NNPDF/nnpdf/pull/1081/checks?check_run_id=2014283823#step:6:43 Why is Tensorflow 1.14 being installed with the conda package for the bot? Travis tests have passed so the conda recipe should be ok...

wilsonmr · 2021-03-02T15:34:36Z

and why is it coming from conda-forge..? hmm

siranipour · 2021-03-04T12:25:10Z

I thought I had virtually copied the original seed behaviour when I refactored this part. Have I messed it up potentially? Are there any serious ramifications?

wilsonmr · 2021-03-04T12:29:51Z

You maybe did, I'm not suggesting you caused the issue I'm more concerned that you've been using get_pseudodata for some project and assuming that it reproduces the same pseudodata as n3fit - it doesn't. The reason the test doesn't catch it is because I assume you ran a fit like n3fit runcard.yml 1 -r 10 in which case the pseudodata would be reproduced. *

There's no real ramifications apart from if somebody was relying on reproducing the pseudodata used in n3fit.

*whereas running

n3fit runcard.yml 1
n3fit runcard.yml 2
...
n3fit runcard.yml 10

would give you some different replicas

siranipour · 2021-03-04T12:31:30Z

Ah my apologies everyone! Fortunately I ended up not really needing that function anyway. Sorry for missing this at the time

wilsonmr · 2021-03-04T12:34:14Z

No need to apoligise. So the test that I can't easily get to pass is test_pseudodata because it uses a pickle dump of pseudodata from the fit. I can't really be bothered to rerun the fit, so I'm planning to just overwrite the file with the current result so it will merely check that the pseudodata is consistently generated. This is sort of taking the mick with the test but I think this part of the code is gonna be deprecated rather soon so I can't be bothered to spend ages on the test - especially if you're not using it.

… behaviour has changed slightly but now is reproducible independently of number of replicas being fitted at once

wilsonmr · 2021-03-05T13:00:52Z

Are there any further comments on this? Or can we merge?

wilsonmr · 2021-03-08T12:50:39Z

Are there any further comments on this? Or can we merge?

Another bump. Appreciate everyone is busy but as mentioned on Friday I would like this for the closure fits and as somebody else mentioned I would ideally have those fits done in advance of my pdf4lhc talk (clearly having them by the practice this week is unfeasible)

siranipour · 2021-03-08T12:52:16Z

Sure thing mate, this is high on my priority now, will have it reviewed by end of the day

siranipour

I think it'd be nice if you could save the training and validation splits as is done in the current C++ code but I appreciate you want this merged as a matter of urgency

wilsonmr · 2021-03-08T19:17:55Z

Thanks for taking a look, I'll make appropriate changes tomorrow morning

wilsonmr · 2021-03-09T10:53:48Z

Hi @siranipour I think I addressed all of the points I agreed with - thanks for the review. I also added an action which dumps the training masks with an index. This action is basically only callable from the API right now, but in the future people are welcome to save this with the data in the same way the pseudodata is saved.

I've expanded the documentation on the relevant actions and updated the action in n3fit_data which used a different formatting of docstring. In particular there should be an example with most of the actions now

siranipour · 2021-03-09T15:16:15Z

Docs and examples from the new commits look great, cheers Mikey!

Zaharid · 2021-03-09T15:56:25Z

Don't think I'll be looking at this in detail but just to echo that it indeed looks very good.
@scarlehoff Are you happy with this? Also I am thinking that now that the hairball has been untangled, it now becomes much easier to add tests for many of these things, which as we know are rather brittle. For the future, looking at things like fitting_data_dict in the context of #800 and also the aforementioned testability, I am wondering if that could be broken down in several actions.

scarlehoff · 2021-03-09T16:07:50Z

Yes, lgtm

wilsonmr requested review from Zaharid, scarlehoff and siranipour February 3, 2021 16:50

wilsonmr changed the title ~~Make a n3fit.io provider~~ [WIP] Make a n3fit.io provider Feb 3, 2021

scarlehoff reviewed Feb 4, 2021

View reviewed changes