Automatic port old commondata to the new format#1931
Conversation
|
Thanks @scarlehoff! I will have a look this afternoon. Re the 3rd point on Positivity, why can't you just use the same script to also convert the positivity datasets? These should be exactly the same are regular datasets. |
They are not exactly the same. I need to change the reader to read it normally. |
Indeed! For the time being, if positivity datasets are provided in the new format, everything inside |
|
I think I'm planning to do this (let me know if it makes sense)
Kinematics are needed. |
Yes, I think this definitely makes sense! The main important thing is that the the value of |
dec57f2 to
dfa62ab
Compare
929b692 to
0c9b2bd
Compare
51da5d2 to
95c157b
Compare
a990c53 to
3760ee4
Compare
0c9b2bd to
902e9bb
Compare
|
I have updated the names according to the discussions today. @felixhekhorn @enocera @Radonirinaunimi @t7phy @comane Please have a look. If you could add suggestions under the names that you think that should be changed it would be much appreciated. |
Co-authored-by: Felix Hekhorn <felixhekhorn@users.noreply.github.com>
659a9ce to
3c65d51
Compare
|
This is ready for review. I've rebased this branch on top of master and the reader on top of this branch So, by checking out the reader branch #1678 you should be able to use the new commondata with everything that is in master right now. git checkout final_reader_for_new_commondata_mk2To test new commondata not included here (i.e., not part of the port) you can just copy it to I've tried a few plots, chi2 comparisons and several fit runcards, but of course, the range of things that can go wrong is very wide. Please let me know if you find any other problems. |
|
There is this dataset In addition, there are also the ATLAS 3D distributions ( PS: there is also the FPF datasets but I don't think we need to port them now. If absolutely needed, this can be done later. |
We might want to leave them so that when the corresponding fktables are generated, the (new) commondata is generated accordingly. |
| variables: | ||
| k1: | ||
| description: Variable k1 | ||
| label: k1 | ||
| units: '' | ||
| k2: | ||
| description: Variable k2 | ||
| label: k2 | ||
| units: '' | ||
| k3: | ||
| description: Variable k3 | ||
| label: k3 | ||
| units: '' |
There was a problem hiding this comment.
(just picked a random dataset so not sure about others) is this PR also supposed to get the correct meta data? like e.g. here k1=x, k2=Q2, k2=y
There was a problem hiding this comment.
No. This is an automatic port also in that sense, there were no labels in the old files.
There was a problem hiding this comment.
I see - so we still need PRs to fine tune the outcome here, right?
There was a problem hiding this comment.
The original idea (and the reason why it has taken so long) was to manually reimplement all dataset so that we can correct the many bugs that might have accumulated.
However, I think that's an impossible ideal... so... don't count on it ^^U
|
I'm finding some performance issues that would actually be solved by just having a dataset-per-folder (sort of like in the old implementation was a dataset-per-file) since that would allow us to use the filesystem as a database. There are many places in which it is assumed that reading the datasets names is cheap, while now it requires reading the metadata + observables. There are workarounds of course, but I feel that's adding complexity to a really simple problem... and would put an end to the naming discussion. |
Just to get an idea, when you say performance issues, how much worse is it? |
|
A factor of almost infinite! Anyway, the reasons are:
I'm thinking that a way of dealing with all three problems at once is to compile a cache of pickled commondata as they get read for the first time, with more or less the following algorithm:
Whether this is a good solution or not will depend on how fast we can take the Instead of _ we could even do /.uncert /.data etc so that each dataframe is by itself and one can inspect the dataframes. I know I said I didn't like increasing the complexity but now I'm thinking it might be a net positive after all. Specially this last point I think would be very useful. |
|
Proposal to unify the variables, to be discussed later today:
DIS variables: Squared quantities will simply have a This covers all variables used, while keeping some conventions that are used across validphys (like Once agreed I'll simply do a search-and-replace. |
|
It seems a good idea, consistency is good. although I do think that for invariant dijet mass, m_jj would be a better name, as it's very clear what it means by looking. |
|
agreed, I've updated the list. It will also make it consistent with |
Maybe it is better to choose a different name for the rapidity as it can easily be confused with the inelasticity in DIS. |
|
But in principal wouldn't it be better to always have y_<some_particle_name> and so on for other variables like pT. This way we have a consistent name that also works for same process types but is also verbose for readers. That also avoids it's confusion with DIS y. |
|
We should at the very least use Another option is to use |
|
I've standardized now the variable names. Regarding the problem with y/eta/etay problem, I've decided not to touch those. I would suggest we use If nobody has any complains, I will merge the new commondata files into master later today. It would allo also people to start fixing the bugs directly as PR to master, or implementing new data as PR to master. I'll keep keeping the reader on top of master which hopefully be only for a few weeks more. |
This is the port of all the old commondata into the new format.
It is mostly working and it is mostly automated. However, before pushing the new set of data I need some input.
In the file buildmaster/old_new_porting_map.yml you can find the mapping that I'm using. The mapping (maps) datasets to:
ℹ️ HELP and FEEDBACK WANTED:
Please, review the file
buildmaster/old_new_porting_map.yml
it should be one of the ones at the top if you go to Files Changed.
If people could already go through the naming that would very helpful. Specially the choice of name for the observable.
And, please, check the energies for CHORUS / SLAC / NUTEV / etc. I've put the average energy as @enocera suggested, but not sure what to put for SLAC there.
For the positivity datasets, all of them are now part of the
POSprocess and theNNPDFexperiment. As far as I am aware, experiment and process for positivity datasets should not mean anything, but if it does somehow please let me know.Anything else you find funny or think that should be changed, please point it out!
Thank you very much!
(note that the mapping should not include datasets already implemented)
Todo:
legacyversion to some of the dataset already implemented if necessary (or change the variant name to legacy, instead ofbugged)Go through the entire dataset, not only 4.0. The amount of metadata there will be of, err, lesser quality. But I think we can live with that.Once everything is working I will remove all old commondata and the possibility of reading it up with validphys.