Make a n3fit.io provider#1081
Conversation
|
to do:
|
|
WIP but also please take a look and tell me if it's horrible. |
|
I should say this links to #1080 because I would vote to move the new module in this PR to that sub dir |
|
Thanks! I'll have a look at it tomorrow. |
scarlehoff
left a comment
There was a problem hiding this comment.
The positivity is trivial if and only if you remember that both integrability and positivity go through the same place and you set all the correct flags (which for now is set True if "INTEG" in dataset.name). For the rest it is just a simpler dataset.
| # run once and all_exp_infos is a list of just than one element | ||
| stopwatch.register_times("data_loaded") | ||
| for replica_number, exp_info, nnseed in zip(replica, all_exp_infos, nnseeds): | ||
| for replica_number, exp_info, nnseed in zip(replicas, all_exp_infos, replicas_nnseed): |
There was a problem hiding this comment.
I guess at this point (maybe for the future) it would make sense to have one single object which contains all_exp_infos and also the replica number and the corresponding seed?
So that one can do exps_replicas_fitting_data_dict.iterate_replicas() or exps_replicas_fitting_data_dict.iterate_experiments().
But for now it surely makes sense to keep it with the "old structure"
| from NNPDF import RandomGenerator | ||
| from reportengine import collect | ||
|
|
||
| from n3fit.io.reader import common_data_reader_experiment |
There was a problem hiding this comment.
I guess this common_data_reader_experiment would be made obsolete by the python commondata?
There was a problem hiding this comment.
Possibly. I guess a large amount of n3fit relies on the dictionaries which are returned by common_data_reader_* functions.
Obviously the easiest thing would make some kind of "wrapper" which took the python commondata and then filled the same dictionary, so the replacement might look quite similar. It's been a while since I saw how deep these dictionaries get passed.
|
Btw, feel free to "break" any part of the reader you dislike. It was basically written during a time when In summary, if at any point you wonder "should I use this clever vp function instead of this convoluted loop of loops" the answer is probably "yes". |
|
The positivity datasets were easy enough to "actionify", the integrability less so because of the weird combo of parsing and production rule and the chance that they might not exist. In this case I just moved the part in performfit to a different location. Since fitting gets expanded as a namespace I was able to remove it as input and take the actual parameters. The obvious downside is performfit take loads of parameters now. I think the upside is it's a bit less obfuscated what should be present in fitting. I still need to update the docstrings in the actions. Probably also the names they're a bit disgusting right now. |
siranipour
left a comment
There was a problem hiding this comment.
Looks good, as you say a fit bot will be nice to see everything works as it should
|
I will finish this... later |
|
This is ready for review, I moved the reader to validphys now. I also added an action to returns the pseudodata. From what I can tell the output should be pretty much identical to Quite keen for somebody to take a look at this because I'd like to dump the pseudodata from the closure tests for something Luigi and I are looking into, equally I don't want to rush this through because it touches some pretty fundamental parts of running a fit (hopefully without changing anything though) If the tests run I will run the fit bot. |
|
Btw as well as making several things easier such as #1059 and allowing us to slot in the python replica generation it also is one of the steps to make seperate packages for n3fit and validphys |
| load=None, | ||
| sum_rules=True, | ||
| tensorboard=None, | ||
| save=None, |
There was a problem hiding this comment.
I would prefer to have the parameters, since there are so many, a bit more organized, like load/save being together and the same for tensorboard/save_weights_each/debug.
Also kfold_parameters should have =None looking at the docstr.
There was a problem hiding this comment.
That makes sense, I think the ordering I was aiming for was order used. But I agree that it would probably be easier to read if they were grouped by context. With regards to your other comment would you prefer the check-only arguments are first or last?
There was a problem hiding this comment.
Also
kfold_parametersshould have=Nonelooking at the docstr.
This doesn't need a default because the default gets filled in by the production rule in the Config class. I think the docstring is consistent with its actual values
There was a problem hiding this comment.
That way it also goes with the other hyperopt options.
Regarding the checks, I don't mind, maybe a comment is enough. It might confuse people at some point in the future (I guess pylint for instance will tell you to remove them)
|
|
||
| # place tables in last replica path | ||
| self.table_folder = path / TAB_FOLDER | ||
| self.table_folder.mkdir(exist_ok=True) |
There was a problem hiding this comment.
Can this be moved to if file_content["fitting"].get("savepseudodata"):? Or do we want the tables folder to always exist even when it will be empty?
There was a problem hiding this comment.
I put it here because it modifies the environment attributes, I felt like having empty tables folders wasn't a disaster but I think saying I want that would be a step too far :P
There was a problem hiding this comment.
The alternative would be to just set the table path to be exactly the replica path and the table would get put with all the other files. I don't think that would be so bad
There was a problem hiding this comment.
I've done my alternative suggestion, I personally quite like it:
$ n3fit test.yml 1
...
$ ls test/nnfit/replica_1
chi2exps.log test.exportgrid test.params test.time
datacuts_theory_fitting_pseudodata_table.csv test.fitinfo test.preproc version.info
tboard test.json test.sumrules weights.h5
There was a problem hiding this comment.
I prefer it to the tables folder.
There was a problem hiding this comment.
Yes, the name isn't ideal but also I don't mind that
I think it would be good too have a check for the moment and if more than one replica is requested then pseudodata can't be saved and the fit won't run. |
|
That would be one solution. Perhaps the easiest. I was thinking that if one loaded and concatenated all the tables from all the replicas then it wouldn't matter if somebody did the above. Then the function would yield the columns of that super table in an order which corresponded to the final PDF replica. |
|
https://github.com/NNPDF/nnpdf/pull/1081/checks?check_run_id=2014283823#step:6:43 Why is Tensorflow 1.14 being installed with the conda package for the bot? Travis tests have passed so the conda recipe should be ok... |
|
and why is it coming from |
|
I thought I had virtually copied the original seed behaviour when I refactored this part. Have I messed it up potentially? Are there any serious ramifications? |
|
You maybe did, I'm not suggesting you caused the issue I'm more concerned that you've been using There's no real ramifications apart from if somebody was relying on reproducing the pseudodata used in n3fit. *whereas running would give you some different replicas |
|
Ah my apologies everyone! Fortunately I ended up not really needing that function anyway. Sorry for missing this at the time |
|
No need to apoligise. So the test that I can't easily get to pass is |
… behaviour has changed slightly but now is reproducible independently of number of replicas being fitted at once
|
Are there any further comments on this? Or can we merge? |
Another bump. Appreciate everyone is busy but as mentioned on Friday I would like this for the closure fits and as somebody else mentioned I would ideally have those fits done in advance of my pdf4lhc talk (clearly having them by the practice this week is unfeasible) |
|
Sure thing mate, this is high on my priority now, will have it reviewed by end of the day |
siranipour
left a comment
There was a problem hiding this comment.
I think it'd be nice if you could save the training and validation splits as is done in the current C++ code but I appreciate you want this merged as a matter of urgency
|
Thanks for taking a look, I'll make appropriate changes tomorrow morning |
|
Hi @siranipour I think I addressed all of the points I agreed with - thanks for the review. I also added an action which dumps the training masks with an index. This action is basically only callable from the API right now, but in the future people are welcome to save this with the data in the same way the pseudodata is saved. I've expanded the documentation on the relevant actions and updated the action in |
|
Docs and examples from the new commits look great, cheers Mikey! |
|
Don't think I'll be looking at this in detail but just to echo that it indeed looks very good. |
|
Yes, lgtm |
duplicate the parts of n3fit.io.reader which get called by fit, but transform into validphys style providers.
I haven't looked at the integrability or positivity sets, although they looked kind of trivial in comparison from a scan (famous last words).
The idea here was to change to using providers in n3fit but still driven by libnnpdf so hopefully nothing changes in a fit. I don't think this breaks hyperopt or conflicts with any PRs out there. Obviously a run fit bot will test this fully.
The next step would be to replace the calling of libnnpdf with providers from the various destroy c++ PRs. I would probably move the location of this new module somewhere else and call it something better as well - this was more of a proof of principle. Change the module name and moving it just requires updating the providers list in n3fit_exec so very easy to do without breaking anything.
I added some inefficieny with the mask functions but the upside is we expose the masks for each dataset. This action is easy to modify to remove the dependency on
theoryidtakinggroup_nameanddataset_inputs_loaded_cd_with_cutsinstead. I don't think these functions are anywhere near the bottleneck though so I can't be bothered to think of a better way to do this.