Skip to content

Automatic port old commondata to the new format#1931

Merged
scarlehoff merged 20 commits into
new_commondata_collectedfrom
automatic_port_old_commondata
Feb 15, 2024
Merged

Automatic port old commondata to the new format#1931
scarlehoff merged 20 commits into
new_commondata_collectedfrom
automatic_port_old_commondata

Conversation

@scarlehoff
Copy link
Copy Markdown
Member

@scarlehoff scarlehoff commented Feb 5, 2024

This is the port of all the old commondata into the new format.

It is mostly working and it is mostly automated. However, before pushing the new set of data I need some input.

In the file buildmaster/old_new_porting_map.yml you can find the mapping that I'm using. The mapping (maps) datasets to:

  1. A new name
  2. The arxiv and journal reference (thanks @enocera !). There's also hepdata for some since I already had some from pinecards. But it might be misleading. Feel free to point out when this is the case and I will remove it.

ℹ️ HELP and FEEDBACK WANTED:

Please, review the file
buildmaster/old_new_porting_map.yml
it should be one of the ones at the top if you go to Files Changed.


If people could already go through the naming that would very helpful. Specially the choice of name for the observable.

And, please, check the energies for CHORUS / SLAC / NUTEV / etc. I've put the average energy as @enocera suggested, but not sure what to put for SLAC there.

For the positivity datasets, all of them are now part of the POS process and the NNPDF experiment. As far as I am aware, experiment and process for positivity datasets should not mean anything, but if it does somehow please let me know.

Anything else you find funny or think that should be changed, please point it out!

Thank you very much!

(note that the mapping should not include datasets already implemented)


Todo:

  • In the next few days I will push also the actual data, together with reports separated by experiment to check that everything works with the reader. For now I've only checked I could read / compute covmat / compute predictions. I want to check that the plotting also works.
  • I will add a legacy version to some of the dataset already implemented if necessary (or change the variant name to legacy, instead of bugged)
  • Still not sure how to deal with Anticipate how POS FK tables are parsed in the new commondata reader #1908, but I think I will add a fake dataset.
  • Go through the entire dataset, not only 4.0. The amount of metadata there will be of, err, lesser quality. But I think we can live with that.

Once everything is working I will remove all old commondata and the possibility of reading it up with validphys.

@Radonirinaunimi
Copy link
Copy Markdown
Member

Thanks @scarlehoff! I will have a look this afternoon.

Re the 3rd point on Positivity, why can't you just use the same script to also convert the positivity datasets? These should be exactly the same are regular datasets.

@scarlehoff
Copy link
Copy Markdown
Member Author

why can't you just use the same script to also convert the positivity datasets?

They are not exactly the same. I need to change the reader to read it normally.
Also, I would like to avoid some of the vices of the previous implementation.

@Radonirinaunimi
Copy link
Copy Markdown
Member

They are not exactly the same. I need to change the reader to read it normally.

Indeed! For the time being, if positivity datasets are provided in the new format, everything inside theory of the metadata.yaml are ignored as these information are/were still read from the yamldb.

@scarlehoff
Copy link
Copy Markdown
Member Author

I think I'm planning to do this (let me know if it makes sense)

NNPDF_POS_5GEV as the dataset holding all metadata.

Kinematics are needed.
Data is needed (even if set to 0, but that allows for cuts)
Uncertainties not needed
Theory as normal.

@Radonirinaunimi
Copy link
Copy Markdown
Member

I think I'm planning to do this (let me know if it makes sense)

NNPDF_POS_5GEV as the dataset holding all metadata.

Kinematics are needed. Data is needed (even if set to 0, but that allows for cuts) Uncertainties not needed Theory as normal.

Yes, I think this definitely makes sense! The main important thing is that the the value of theory:operation would later on (once parsed) be a part of the POS object.

@scarlehoff scarlehoff force-pushed the automatic_port_old_commondata branch 2 times, most recently from dec57f2 to dfa62ab Compare February 6, 2024 15:19
@scarlehoff scarlehoff force-pushed the new_commondata_collected branch from 929b692 to 0c9b2bd Compare February 7, 2024 10:45
@scarlehoff scarlehoff force-pushed the automatic_port_old_commondata branch from 51da5d2 to 95c157b Compare February 7, 2024 10:46
Comment thread buildmaster/old_new_porting_map.yml
Comment thread buildmaster/old_new_porting_map.yml Outdated
Comment thread buildmaster/old_new_porting_map.yml Outdated
Comment thread buildmaster/old_new_porting_map.yml Outdated
Comment thread buildmaster/old_new_porting_map.yml Outdated
Comment thread buildmaster/old_new_porting_map.yml Outdated
Comment thread buildmaster/old_new_porting_map.yml Outdated
Comment thread buildmaster/old_new_porting_map.yml Outdated
Comment thread buildmaster/old_new_porting_map.yml Outdated
Comment thread buildmaster/old_new_porting_map.yml Outdated
Comment thread buildmaster/old_new_porting_map.yml Outdated
Comment thread buildmaster/old_new_porting_map.yml
Comment thread buildmaster/old_new_porting_map.yml Outdated
@scarlehoff scarlehoff force-pushed the automatic_port_old_commondata branch from a990c53 to 3760ee4 Compare February 7, 2024 17:16
@scarlehoff scarlehoff force-pushed the new_commondata_collected branch from 0c9b2bd to 902e9bb Compare February 7, 2024 17:16
@scarlehoff
Copy link
Copy Markdown
Member Author

I have updated the names according to the discussions today.

@felixhekhorn @enocera @Radonirinaunimi @t7phy @comane

Please have a look. If you could add suggestions under the names that you think that should be changed it would be much appreciated.
I've dealt with some of the comments that were here in the PR.
There's some inconsistencies between the names of the DY datasets between ATLAS and CMS. In order to resolve some of these I would need to go through all of these papers...

@scarlehoff scarlehoff force-pushed the automatic_port_old_commondata branch from 659a9ce to 3c65d51 Compare February 12, 2024 09:39
@scarlehoff scarlehoff marked this pull request as ready for review February 12, 2024 09:43
@scarlehoff
Copy link
Copy Markdown
Member Author

This is ready for review.

I've rebased this branch on top of master and the reader on top of this branch

So, by checking out the reader branch #1678 you should be able to use the new commondata with everything that is in master right now.

git checkout final_reader_for_new_commondata_mk2

To test new commondata not included here (i.e., not part of the port) you can just copy it to validphys2/src/validphys/datafiles/new_commondata (remember to install the code in editable mode! (pip install -e .)

I've tried a few plots, chi2 comparisons and several fit runcards, but of course, the range of things that can go wrong is very wide.
For now the only thing I've found that it is broken is the kinematic coverage (x-q2) plot since some of the new commondata don't include this information #1934

Please let me know if you find any other problems.
Also, if you need me to port some data not already included here let me know. I will probably remove the old commondata from the code (so when you install you only get the new) so if you rely in some fringe dataset please let me know so I can add it to the map yaml file.

@Radonirinaunimi
Copy link
Copy Markdown
Member

There is this dataset ATLAS_WMU_8TEV for which we have old commondata files but no FK tables. If the old commondata will be totally scrapped, maybe we should also convert it (I couldn't find it in the current list).

In addition, there are also the ATLAS 3D distributions (ATLAS_Z0_8TEV_3D_CRAP, ATLAS_Z0_8TEV_3D_FRAP) for which we have the filters but no commondata files yet and no FK tables. Do we want to generate these old commondata files, convert them, and add them here?

PS: there is also the FPF datasets but I don't think we need to port them now. If absolutely needed, this can be done later.

@scarlehoff
Copy link
Copy Markdown
Member Author

Do we want to generate these old commondata files

We might want to leave them so that when the corresponding fktables are generated, the (new) commondata is generated accordingly.

Comment on lines +260 to +272
variables:
k1:
description: Variable k1
label: k1
units: ''
k2:
description: Variable k2
label: k2
units: ''
k3:
description: Variable k3
label: k3
units: ''
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just picked a random dataset so not sure about others) is this PR also supposed to get the correct meta data? like e.g. here k1=x, k2=Q2, k2=y

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This is an automatic port also in that sense, there were no labels in the old files.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - so we still need PRs to fine tune the outcome here, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original idea (and the reason why it has taken so long) was to manually reimplement all dataset so that we can correct the many bugs that might have accumulated.

However, I think that's an impossible ideal... so... don't count on it ^^U

@scarlehoff
Copy link
Copy Markdown
Member Author

I'm finding some performance issues that would actually be solved by just having a dataset-per-folder (sort of like in the old implementation was a dataset-per-file) since that would allow us to use the filesystem as a database.

There are many places in which it is assumed that reading the datasets names is cheap, while now it requires reading the metadata + observables. There are workarounds of course, but I feel that's adding complexity to a really simple problem... and would put an end to the naming discussion.
We can discuss more about this tomorrow.

@Radonirinaunimi
Copy link
Copy Markdown
Member

Radonirinaunimi commented Feb 13, 2024

I'm finding some performance issues that would actually be solved by just having a dataset-per-folder (sort of like in the old implementation was a dataset-per-file) since that would allow us to use the filesystem as a database.

Just to get an idea, when you say performance issues, how much worse is it?

@scarlehoff
Copy link
Copy Markdown
Member Author

scarlehoff commented Feb 13, 2024

A factor of almost infinite!
Not really, but the problem is that we are comparing with "asking the filesystem" with "asking the filesystem for the folder + the metadata file + reading it + looping over all observable until we find the one we want".
And because before the time was close to 0, there was no measure in place to avoid it, which is fair. So for simple things is very noticeable, for more complicated stuff it will of course be less noticeable (or not at all).

Anyway, the reasons are:

  1. Because now we have the information in several .yaml file that needs to be reconstructed back, before the commondata dataframe was one single file. In particular in files with hundreds of systematics this is bad
  2. There are several "datapoints" that before were fixed and now are derived quantities (such as the number of datapoints or the number of systematics). ndata, even though we agreed it is derives, it's in practice fixed in most metadata. And nsys could be fixed just the same (and updated through variants). See point 1.
  3. Not having direct access to the dataset names (they need to be reconstructed from the metadata). This is the most obvious problem and so the first I found, but it's subleading with respect to 1.

I'm thinking that a way of dealing with all three problems at once is to compile a cache of pickled commondata as they get read for the first time, with more or less the following algorithm:

  1. Try to read one dataset
  2. Look in ~/.local/share/NNPDF/vp-cache/ for _
  3. If _ exist, load the information you want from there, otherwise, recompute it and put it there
  4. Before loading check the hash, if it has changed, recompute it and put it there.

Whether this is a good solution or not will depend on how fast we can take the hash of the files https://docs.python.org/3/library/hashlib.html

Instead of _ we could even do /.uncert /.data etc so that each dataframe is by itself and one can inspect the dataframes.

I know I said I didn't like increasing the complexity but now I'm thinking it might be a net positive after all. Specially this last point I think would be very useful.

@scarlehoff
Copy link
Copy Markdown
Member Author

scarlehoff commented Feb 14, 2024

Proposal to unify the variables, to be discussed later today:

pT_t: pT of the top quark
y_t: rapidity of the top quark
m_t: mass of the top
y_ttBar: rapidity of the top-antitop pair
m_ttBar: mass of the top-antitop pair
etay : rapidity or pseudorapidity
m_jj: mass of an invariant pair (of jets)
m_ll: mass of an invariant pair (of leptons)
sqrts: square root of the center of mass energy
pT: transverse momementum
M: invariant mass of a gauge boson
ydiff: a difference of rapidity

DIS variables: Q2, x

Squared quantities will simply have a 2 at the end (i.e., mass squared m_ttBar2, or pt squared pT2).

This covers all variables used, while keeping some conventions that are used across validphys (like sqrts or 2 for the squares) and it seems not to break anything (the most important big!)
There are a few more but are really variations or combinations of these (like the pt of the top: pT_ttbar)

Once agreed I'll simply do a search-and-replace.

@t7phy
Copy link
Copy Markdown
Collaborator

t7phy commented Feb 14, 2024

It seems a good idea, consistency is good. although I do think that for invariant dijet mass, m_jj would be a better name, as it's very clear what it means by looking.

@scarlehoff
Copy link
Copy Markdown
Member Author

agreed, I've updated the list. It will also make it consistent with m_ll currently used for leptons.

@Radonirinaunimi
Copy link
Copy Markdown
Member

Proposal to unify the variables, to be discussed later today:

pT_t: pT of the top quark y_t: rapidity of the top quark y_ttBar: rapidity of the top-antitop pair m_ttBar: mass of the top-antitop pair y : rapidity eta: pseudorapidity m_jj: mass of an invariant pair (of jets) m_ll: mass of an invariant pair (of leptons) sqrts: square root of the center of mass energy pT: transverse momementum M: invariant mass of a gauge boson

DIS variables: Q2, x

Squared quantities will simply have a 2 at the end (i.e., mass squared m_ttBar2, or pt squared pT2).

This covers all variables used, while keeping some conventions that are used across validphys (like sqrts or 2 for the squares) and it seems not to break anything (the most important big!) There are a few more but are really variations or combinations of these (like the pt of the top: pT_ttbar)

Once agreed I'll simply do a search-and-replace.

Maybe it is better to choose a different name for the rapidity as it can easily be confused with the inelasticity in DIS.

@t7phy
Copy link
Copy Markdown
Collaborator

t7phy commented Feb 14, 2024

But in principal wouldn't it be better to always have y_<some_particle_name> and so on for other variables like pT. This way we have a consistent name that also works for same process types but is also verbose for readers. That also avoids it's confusion with DIS y.

@scarlehoff
Copy link
Copy Markdown
Member Author

We should at the very least use V for both Z and W when relevant.

Another option is to use etay since most of the time the distinction is not that relevant (and it is what was used before)

@scarlehoff
Copy link
Copy Markdown
Member Author

scarlehoff commented Feb 15, 2024

I've standardized now the variable names.

Regarding the problem with y/eta/etay problem, I've decided not to touch those. I would suggest we use etay moving forward but I've preferred not to touch the ones that are already there.

If nobody has any complains, I will merge the new commondata files into master later today.
This does not include the reader (so the files will be in master, but still usable only with the reader) but it is a way of canonising the current implementation.

It would allo also people to start fixing the bugs directly as PR to master, or implementing new data as PR to master. I'll keep keeping the reader on top of master which hopefully be only for a few weeks more.

@scarlehoff scarlehoff merged commit 970cac1 into new_commondata_collected Feb 15, 2024
@scarlehoff scarlehoff deleted the automatic_port_old_commondata branch February 15, 2024 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants