Docs for new commondata format by scarlehoff · Pull Request #1708 · NNPDF/nnpdf

scarlehoff · 2023-03-30T15:50:02Z

I've added a first draft of the new commondata format. This deals with #1691.

At the moment I've only added a description that follows the implementation of @t7phy in #1684. I believe the format is now more or less final (we are at a point where we are deciding whether it should be written definition or definitions so really the details).

Since I was at it I've changed what @cschwan had added about the naming convention to rst. I've added in the new-commondata.rst file the sentence:

In all cases one can reconstruct the name of the folder by separating the observable name on the last ``_``

which si what we discussed yesterday.

Please feel free to add new documents to this PR, there are many things missing (a TODO below, feel free to add more items by editing this comment).

TODO:

Format of the theory definition
Motivation etc
Tools to facilitate the hepdata integration
Integrating it with the rest of the documentation
Utilities and implementation of new files

Zaharid · 2023-04-03T10:00:27Z

How many versions and version comments can there be? Is it a complete changelog or do we forget about everything but the last one?

Zaharid · 2023-04-03T10:01:36Z

+Organization following the naming convention
+--------------------------------------------
+
+All dataset in the new data format follow the exact same naming convention::


We should probably find a word different from "new" to refer to this format.

Zaharid · 2023-04-03T10:02:05Z

+
+    <experiment>_<process>_<energy>{_<extras>}_<observable>
+
+The data is contained in folders, each containing one single hepdata publication. 


I assume we can do other things that are not hepdata?

Of course vp doesn't know where the data come from but I think the idea is that you open a folder and it comes from one single hepdata publication.

In the odd case in which it doesn't we hope the person implement it add comments or something explaining what happened there.

The comment was more in the direction that these docs should be rather general.

Zaharid · 2023-04-03T10:02:44Z

@@ -0,0 +1,155 @@
+Organization following the naming convention


Is the convention enforced mechanically? Is it needed to process the names?

Is it needed to process the names?

Yes. The folder in which the data is will always be name.rsplit("_", 1)[0] and the choice of the observable name.rsplit("_", 1)[1]

Zaharid · 2023-04-03T10:05:22Z

We should discuss things like versions vs variants? I do remember being persuaded we needed both but completely forgot why.

Zaharid · 2023-04-03T10:48:09Z

If I remember correctly now, we were supposed to keep all the old version metadata files and add a full new one when needed.

scarlehoff · 2023-04-03T11:06:03Z

How many versions and version comments can there be? Is it a complete changelog or do we forget about everything but the last one?

The versions are for actual fixes and the comment should address whatever the change was. Since we have version control I think we can forget about everything but the last one.

We should discuss things like versions vs variants?

The version is a fix of whatever thing. The variant is a variation of the dataset (for instance, a variation of the dataset in which a particular uncertainty is also included).

I'll add some lines about that.

scarlehoff · 2023-04-03T11:29:54Z

I've added some information on variants and theory.

Btw, please feel (@ everyone) free to modify this branch to complete this document.

Zaharid · 2023-04-03T11:41:40Z

@enocera @scarlehoff At some point one goal of this effort was mainly that it should be understandable and hopefully implementable by external people. I think it works pretty well for that (in principle, less clear to me in practice), with the exception of the theory section. It requires external names which index into external tool chains. The keys themselves "fktables" are NNPDF jargon. (And there are the ugly double lists...). I believe it that section would look intimidating to an external person trying to understand what is going on.

With that, I think that if the strategic goal is to publicize the format, we'd be better off if this section was in a separate nnpdf_theory.yaml or similar.

scarlehoff · 2023-04-03T11:45:53Z

With that, I think that if the strategic goal is to publicize the format, we'd be better off if this section was in a separate nnpdf_theory.yaml or similar.

I guess in practice this is already the case with the yamldb folder in the new theories. We might want to leave it as part of the theory and never move it to the data.

On the flip side, people wanting to implement new data to use them in a fit will have to deal with that section anyway so I'm not sure whether having it separate is really an advantage at all.

felixhekhorn

I added some comments

felixhekhorn · 2023-06-14T15:20:38Z

+Data
+----
+
+The format of the data is a ``yaml`` file with an entry ```data_central``` which is a list for all values for all bins.


I guess we should stress somehow more that this is the crucial part

scarlehoff · 2023-06-14T15:27:14Z

I added some comments

I've applied the suggestions. Feel free to add the other changes to the docs directly!

scarlehoff · 2023-10-09T14:11:39Z

Adding here a few notes RE the latest changes to the commondata format (/cc @Radonirinaunimi )

I will also update the documentation, but just in case I don't finish today.

Plotting:
The information in PLOTTING_ and corresponding PLOTTINGTYPE_ should be under a plotting: section in the metadata.yaml for each dataset.
The nnpdf31_process at the top of the metadata should match the old names

Kinematics:
It will the order of the kinematics from kinematics_coverage. That means that if a dataset has more than 3, only the 3 defined in kinematics_coverage will be used

Datasets_names.yaml
There will be a datasets_names.yaml file which will define the mapping between the old and the new datasets.

scarlehoff · 2023-10-10T09:42:30Z

cc @t7phy @Radonirinaunimi
An update with respect to the discussion of yesterday:

kinematic_coverage should not go inside the plotting dictionary but rather be at the same level. This is used to define which set of kinematics to use and is not part of the plotting. Also, the kinematic coverage can (and maybe should) be a list.

In the old plotting file you will see the variable x which is usually (k1, k2, k3 or idat). Please change x to plot_x: <variable>.

And to simplify the part of the cuts, which use the process type, I'd add a process_type as well to the observable...

Radonirinaunimi · 2023-10-10T09:44:13Z

cc @t7phy @Radonirinaunimi An update with respect to the discussion of yesterday:

kinematic_coverage should not go inside the plotting dictionary but rather be at the same level. This is used to define which set of kinematics to use and is not part of the plotting.

In the old plotting file you will see the variable x which is usually (k1, k2, k3 or idat). Please change x to plot_x: <variable>.

Thanks for the heads up @scarlehoff!

scarlehoff · 2023-10-20T11:27:19Z

For the people implementing new commondata:

the process_type key can usually be found in the old commondata files (it's the second column, e.g):
https://github.com/NNPDF/nnpdf/blob/master/nnpdfcpp/data/commondata/DATA_ATLASTTBARTOT8TEV.dat

if you are implementing a new one, so you don't have a reference, the best thing you can do is have a look at the list and use the one that looks the closest. This is used for some internal vp operations (for instance for internal cuts) so it is a required key.

nnpdf/validphys2/src/validphys/plotoptions/kintransforms.py

Line 35 in e005a42

{'DIS': ('$x$', '$Q^2 (GeV^2)$', '$y$'),

comane · 2023-10-27T12:33:44Z

+    implemented_observables:
+    - observable_name: "OBS"


Suggested change

implemented_observables:

- observable_name: "OBS"

implemented_observables:

- observable_name: "OBS"

I think that there should be a tab here.

first draft of the documented new commondata format add definition of the version key add explanation for the variants and the theory Apply suggestions from code review Co-authored-by: Felix Hekhorn <felixhekhorn@users.noreply.github.com> Update new-commondata.rst update docs Update doc/sphinx/source/data/new-commondata.rst Update doc/sphinx/source/data/new-commondata.rst Update doc/sphinx/source/data/new-commondata.rst Update new-commondata.rst update docs with the definition of the old:new mapping Update doc/sphinx/source/data/new-commondata.rst

…n for the old format

scarlehoff · 2024-03-03T12:14:27Z

This is not perfect, but since we are merging the new commondata format today I wanted to have the documentation in a state in which it can be merged.

I've removed some stuff that was plainly wrong or referred to the old format.

I've added a (more or less detailed) explanation to every entry but plotting. The plotting metadata hasn't really changed that much so I've left a reference to the old plotting metadata with a warning that this information is outdated.

The changes in the last commit I need for the docs to compile correctly in my laptop with py3.12.

Radonirinaunimi

Thanks a lot @scarlehoff for this extensive documentation. I haven't read it fully but I only have two main comments regarding the:

filter: so far, there is no mention of how the commondata are generated from input hepdata. A (couple of) sentence(s) should suffice.
positivity and integrability which mainly affects the Observable specific information section.

Co-authored-by: Felix Hekhorn <felixhekhorn@users.noreply.github.com>

scarlehoff · 2024-03-04T12:53:58Z

filter: so far, there is no mention of how the commondata are generated from input hepdata. A (couple of) sentence(s) should suffice.

I agree but I haven't added any filters so his should be added by people who did it...

RE positivity and integrability, that should be added for whoever ends up in charge of updating the commondata docs (we can assign tasks in the next code meeting).

scarlehoff added documentation Issues and PRs related to documentation data toolchain labels Mar 30, 2023

scarlehoff assigned enocera, scarlehoff and t7phy Mar 30, 2023

t7phy reviewed Mar 30, 2023

View reviewed changes

Comment thread doc/sphinx/source/data/dataset-naming-convention.rst Outdated

scarlehoff changed the title ~~[WIP] New commondata format~~ [WIP] Docs for new commondata format Mar 31, 2023

Zaharid reviewed Apr 3, 2023

View reviewed changes

Comment thread doc/sphinx/source/data/new-commondata.rst Outdated

felixhekhorn reviewed Jun 14, 2023

View reviewed changes

felixhekhorn mentioned this pull request Jun 19, 2023

Implement generators for nDIS processes NNPDF/yadism#160

Closed

scarlehoff mentioned this pull request Oct 9, 2023

New Common Datasets #1813

Merged

scarlehoff force-pushed the document_new_commondata_format branch from 55763a5 to c3afca3 Compare October 9, 2023 16:15

scarlehoff commented Oct 10, 2023

View reviewed changes

Comment thread doc/sphinx/source/data/new-commondata.rst Outdated

scarlehoff commented Oct 11, 2023

View reviewed changes

Comment thread doc/sphinx/source/data/new-commondata.rst Outdated

scarlehoff commented Oct 20, 2023

View reviewed changes

Comment thread doc/sphinx/source/data/new-commondata.rst Outdated

Radonirinaunimi mentioned this pull request Oct 26, 2023

Re-implementation of the LHCb Collider DY Datasets #1826

Closed

comane reviewed Oct 27, 2023

View reviewed changes

scarlehoff mentioned this pull request Nov 12, 2023

Status of the new commondata format implementation #1709

Closed

77 tasks

scarlehoff force-pushed the document_new_commondata_format branch from 0ac4832 to b593578 Compare November 17, 2023 16:20

scarlehoff commented Nov 27, 2023

View reviewed changes

Comment thread doc/sphinx/source/data/new-commondata.rst Outdated

RoyStegeman mentioned this pull request Mar 1, 2024

New CommonData Reader #1678

Merged

3 tasks

scarlehoff added 2 commits March 3, 2024 10:37

add documentation for the new commondata format ; remove documentatio…

af1be35

…n for the old format

scarlehoff force-pushed the document_new_commondata_format branch from f403223 to af1be35 Compare March 3, 2024 11:55

scarlehoff changed the title ~~[WIP] Docs for new commondata format~~ Docs for new commondata format Mar 3, 2024

fix miscelaneous warnings and errors

1f1f676

Radonirinaunimi mentioned this pull request Mar 3, 2024

Removing traces of Old Commondata #1974

Open

19 tasks

scarlehoff mentioned this pull request Mar 4, 2024

Changes needed to enable Python 3.12 #1970

Closed

3 tasks