Automatic port old commondata to the new format by scarlehoff · Pull Request #1931 · NNPDF/nnpdf

scarlehoff · 2024-02-05T17:15:13Z

This is the port of all the old commondata into the new format.

It is mostly working and it is mostly automated. However, before pushing the new set of data I need some input.

In the file buildmaster/old_new_porting_map.yml you can find the mapping that I'm using. The mapping (maps) datasets to:

A new name
The arxiv and journal reference (thanks @enocera !). There's also hepdata for some since I already had some from pinecards. But it might be misleading. Feel free to point out when this is the case and I will remove it.

ℹ️ HELP and FEEDBACK WANTED:

Please, review the file
buildmaster/old_new_porting_map.yml
it should be one of the ones at the top if you go to Files Changed.

If people could already go through the naming that would very helpful. Specially the choice of name for the observable.

And, please, check the energies for CHORUS / SLAC / NUTEV / etc. I've put the average energy as @enocera suggested, but not sure what to put for SLAC there.

For the positivity datasets, all of them are now part of the POS process and the NNPDF experiment. As far as I am aware, experiment and process for positivity datasets should not mean anything, but if it does somehow please let me know.

Anything else you find funny or think that should be changed, please point it out!

Thank you very much!

(note that the mapping should not include datasets already implemented)

Todo:

In the next few days I will push also the actual data, together with reports separated by experiment to check that everything works with the reader. For now I've only checked I could read / compute covmat / compute predictions. I want to check that the plotting also works.
I will add a legacy version to some of the dataset already implemented if necessary (or change the variant name to legacy, instead of bugged)
Still not sure how to deal with Anticipate how POS FK tables are parsed in the new commondata reader #1908, but I think I will add a fake dataset.
~~Go through the entire dataset, not only 4.0. The amount of metadata there will be of, err, lesser quality. But I think we can live with that.~~

Once everything is working I will remove all old commondata and the possibility of reading it up with validphys.

Radonirinaunimi · 2024-02-06T11:13:06Z

Thanks @scarlehoff! I will have a look this afternoon.

Re the 3rd point on Positivity, why can't you just use the same script to also convert the positivity datasets? These should be exactly the same are regular datasets.

scarlehoff · 2024-02-06T11:17:40Z

why can't you just use the same script to also convert the positivity datasets?

They are not exactly the same. I need to change the reader to read it normally.
Also, I would like to avoid some of the vices of the previous implementation.

Radonirinaunimi · 2024-02-06T11:25:41Z

They are not exactly the same. I need to change the reader to read it normally.

Indeed! For the time being, if positivity datasets are provided in the new format, everything inside theory of the metadata.yaml are ignored as these information are/were still read from the yamldb.

scarlehoff · 2024-02-06T11:36:09Z

I think I'm planning to do this (let me know if it makes sense)

NNPDF_POS_5GEV as the dataset holding all metadata.

Kinematics are needed.
Data is needed (even if set to 0, but that allows for cuts)
Uncertainties not needed
Theory as normal.

Radonirinaunimi · 2024-02-06T12:17:32Z

I think I'm planning to do this (let me know if it makes sense)

NNPDF_POS_5GEV as the dataset holding all metadata.

Kinematics are needed. Data is needed (even if set to 0, but that allows for cuts) Uncertainties not needed Theory as normal.

Yes, I think this definitely makes sense! The main important thing is that the the value of theory:operation would later on (once parsed) be a part of the POS object.

scarlehoff · 2024-02-07T17:19:43Z

I have updated the names according to the discussions today.

@felixhekhorn @enocera @Radonirinaunimi @t7phy @comane

Please have a look. If you could add suggestions under the names that you think that should be changed it would be much appreciated.
I've dealt with some of the comments that were here in the PR.
There's some inconsistencies between the names of the DY datasets between ATLAS and CMS. In order to resolve some of these I would need to go through all of these papers...

Co-authored-by: Felix Hekhorn <felixhekhorn@users.noreply.github.com>

…ble Y/eta

scarlehoff · 2024-02-12T09:57:21Z

This is ready for review.

I've rebased this branch on top of master and the reader on top of this branch

So, by checking out the reader branch #1678 you should be able to use the new commondata with everything that is in master right now.

git checkout final_reader_for_new_commondata_mk2

To test new commondata not included here (i.e., not part of the port) you can just copy it to validphys2/src/validphys/datafiles/new_commondata (remember to install the code in editable mode! (pip install -e .)

I've tried a few plots, chi2 comparisons and several fit runcards, but of course, the range of things that can go wrong is very wide.
For now the only thing I've found that it is broken is the kinematic coverage (x-q2) plot since some of the new commondata don't include this information #1934

Please let me know if you find any other problems.
Also, if you need me to port some data not already included here let me know. I will probably remove the old commondata from the code (so when you install you only get the new) so if you rely in some fringe dataset please let me know so I can add it to the map yaml file.

Radonirinaunimi · 2024-02-12T11:11:49Z

There is this dataset ATLAS_WMU_8TEV for which we have old commondata files but no FK tables. If the old commondata will be totally scrapped, maybe we should also convert it (I couldn't find it in the current list).

In addition, there are also the ATLAS 3D distributions (ATLAS_Z0_8TEV_3D_CRAP, ATLAS_Z0_8TEV_3D_FRAP) for which we have the filters but no commondata files yet and no FK tables. Do we want to generate these old commondata files, convert them, and add them here?

PS: there is also the FPF datasets but I don't think we need to port them now. If absolutely needed, this can be done later.

scarlehoff · 2024-02-12T12:13:18Z

Do we want to generate these old commondata files

We might want to leave them so that when the corresponding fktables are generated, the (new) commondata is generated accordingly.

felixhekhorn · 2024-02-13T10:54:23Z

+    variables:
+      k1:
+        description: Variable k1
+        label: k1
+        units: ''
+      k2:
+        description: Variable k2
+        label: k2
+        units: ''
+      k3:
+        description: Variable k3
+        label: k3
+        units: ''


(just picked a random dataset so not sure about others) is this PR also supposed to get the correct meta data? like e.g. here k1=x, k2=Q2, k2=y

No. This is an automatic port also in that sense, there were no labels in the old files.

I see - so we still need PRs to fine tune the outcome here, right?

The original idea (and the reason why it has taken so long) was to manually reimplement all dataset so that we can correct the many bugs that might have accumulated.

However, I think that's an impossible ideal... so... don't count on it ^^U

scarlehoff · 2024-02-13T19:10:16Z

I'm finding some performance issues that would actually be solved by just having a dataset-per-folder (sort of like in the old implementation was a dataset-per-file) since that would allow us to use the filesystem as a database.

There are many places in which it is assumed that reading the datasets names is cheap, while now it requires reading the metadata + observables. There are workarounds of course, but I feel that's adding complexity to a really simple problem... and would put an end to the naming discussion.
We can discuss more about this tomorrow.

Radonirinaunimi · 2024-02-13T20:20:29Z

I'm finding some performance issues that would actually be solved by just having a dataset-per-folder (sort of like in the old implementation was a dataset-per-file) since that would allow us to use the filesystem as a database.

Just to get an idea, when you say performance issues, how much worse is it?

scarlehoff · 2024-02-13T21:46:50Z

A factor of almost infinite!
Not really, but the problem is that we are comparing with "asking the filesystem" with "asking the filesystem for the folder + the metadata file + reading it + looping over all observable until we find the one we want".
And because before the time was close to 0, there was no measure in place to avoid it, which is fair. So for simple things is very noticeable, for more complicated stuff it will of course be less noticeable (or not at all).

Anyway, the reasons are:

Because now we have the information in several .yaml file that needs to be reconstructed back, before the commondata dataframe was one single file. In particular in files with hundreds of systematics this is bad
There are several "datapoints" that before were fixed and now are derived quantities (such as the number of datapoints or the number of systematics). ndata, even though we agreed it is derives, it's in practice fixed in most metadata. And nsys could be fixed just the same (and updated through variants). See point 1.
Not having direct access to the dataset names (they need to be reconstructed from the metadata). This is the most obvious problem and so the first I found, but it's subleading with respect to 1.

I'm thinking that a way of dealing with all three problems at once is to compile a cache of pickled commondata as they get read for the first time, with more or less the following algorithm:

Try to read one dataset
Look in ~/.local/share/NNPDF/vp-cache/ for _
If _ exist, load the information you want from there, otherwise, recompute it and put it there
Before loading check the hash, if it has changed, recompute it and put it there.

Whether this is a good solution or not will depend on how fast we can take the hash of the files https://docs.python.org/3/library/hashlib.html

Instead of _ we could even do /.uncert /.data etc so that each dataframe is by itself and one can inspect the dataframes.

I know I said I didn't like increasing the complexity but now I'm thinking it might be a net positive after all. Specially this last point I think would be very useful.

scarlehoff · 2024-02-14T10:52:34Z

Proposal to unify the variables, to be discussed later today:

pT_t: pT of the top quark
y_t: rapidity of the top quark
m_t: mass of the top
y_ttBar: rapidity of the top-antitop pair
m_ttBar: mass of the top-antitop pair
etay : rapidity or pseudorapidity
m_jj: mass of an invariant pair (of jets)
m_ll: mass of an invariant pair (of leptons)
sqrts: square root of the center of mass energy
pT: transverse momementum
M: invariant mass of a gauge boson
ydiff: a difference of rapidity

DIS variables: Q2, x

Squared quantities will simply have a 2 at the end (i.e., mass squared m_ttBar2, or pt squared pT2).

This covers all variables used, while keeping some conventions that are used across validphys (like sqrts or 2 for the squares) and it seems not to break anything (the most important big!)
There are a few more but are really variations or combinations of these (like the pt of the top: pT_ttbar)

Once agreed I'll simply do a search-and-replace.

t7phy · 2024-02-14T11:16:38Z

It seems a good idea, consistency is good. although I do think that for invariant dijet mass, m_jj would be a better name, as it's very clear what it means by looking.

scarlehoff · 2024-02-14T11:27:01Z

agreed, I've updated the list. It will also make it consistent with m_ll currently used for leptons.

Radonirinaunimi · 2024-02-14T16:17:11Z

Proposal to unify the variables, to be discussed later today:

pT_t: pT of the top quark y_t: rapidity of the top quark y_ttBar: rapidity of the top-antitop pair m_ttBar: mass of the top-antitop pair y : rapidity eta: pseudorapidity m_jj: mass of an invariant pair (of jets) m_ll: mass of an invariant pair (of leptons) sqrts: square root of the center of mass energy pT: transverse momementum M: invariant mass of a gauge boson

DIS variables: Q2, x

Squared quantities will simply have a 2 at the end (i.e., mass squared m_ttBar2, or pt squared pT2).

This covers all variables used, while keeping some conventions that are used across validphys (like sqrts or 2 for the squares) and it seems not to break anything (the most important big!) There are a few more but are really variations or combinations of these (like the pt of the top: pT_ttbar)

Once agreed I'll simply do a search-and-replace.

Maybe it is better to choose a different name for the rapidity as it can easily be confused with the inelasticity in DIS.

t7phy · 2024-02-14T16:21:52Z

But in principal wouldn't it be better to always have y_<some_particle_name> and so on for other variables like pT. This way we have a consistent name that also works for same process types but is also verbose for readers. That also avoids it's confusion with DIS y.

scarlehoff · 2024-02-14T17:29:55Z

We should at the very least use V for both Z and W when relevant.

Another option is to use etay since most of the time the distinction is not that relevant (and it is what was used before)

scarlehoff · 2024-02-15T12:30:33Z

I've standardized now the variable names.

Regarding the problem with y/eta/etay problem, I've decided not to touch those. I would suggest we use etay moving forward but I've preferred not to touch the ones that are already there.

If nobody has any complains, I will merge the new commondata files into master later today.
This does not include the reader (so the files will be in master, but still usable only with the reader) but it is a way of canonising the current implementation.

It would allo also people to start fixing the bugs directly as PR to master, or implementing new data as PR to master. I'll keep keeping the reader on top of master which hopefully be only for a few weeks more.

scarlehoff assigned cschwan, felixhekhorn, enocera, Radonirinaunimi, niclaurenti, comane, t7phy and tgiani Feb 5, 2024

scarlehoff force-pushed the automatic_port_old_commondata branch 2 times, most recently from dec57f2 to dfa62ab Compare February 6, 2024 15:19

scarlehoff force-pushed the new_commondata_collected branch from 929b692 to 0c9b2bd Compare February 7, 2024 10:45

scarlehoff force-pushed the automatic_port_old_commondata branch from 51da5d2 to 95c157b Compare February 7, 2024 10:46