Skip to content

Docs for new commondata format#1708

Merged
scarlehoff merged 4 commits into
masterfrom
document_new_commondata_format
Mar 4, 2024
Merged

Docs for new commondata format#1708
scarlehoff merged 4 commits into
masterfrom
document_new_commondata_format

Conversation

@scarlehoff
Copy link
Copy Markdown
Member

I've added a first draft of the new commondata format. This deals with #1691.

At the moment I've only added a description that follows the implementation of @t7phy in #1684. I believe the format is now more or less final (we are at a point where we are deciding whether it should be written definition or definitions so really the details).

Since I was at it I've changed what @cschwan had added about the naming convention to rst. I've added in the new-commondata.rst file the sentence:

In all cases one can reconstruct the name of the folder by separating the observable name on the last ``_``

which si what we discussed yesterday.

Please feel free to add new documents to this PR, there are many things missing (a TODO below, feel free to add more items by editing this comment).

TODO:

  • Format of the theory definition
  • Motivation etc
  • Tools to facilitate the hepdata integration
  • Integrating it with the rest of the documentation
  • Utilities and implementation of new files

@scarlehoff scarlehoff added documentation Issues and PRs related to documentation data toolchain labels Mar 30, 2023
Comment thread doc/sphinx/source/data/dataset-naming-convention.rst Outdated
@scarlehoff scarlehoff changed the title [WIP] New commondata format [WIP] Docs for new commondata format Mar 31, 2023
@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Apr 3, 2023

How many versions and version comments can there be? Is it a complete changelog or do we forget about everything but the last one?

Organization following the naming convention
--------------------------------------------

All dataset in the new data format follow the exact same naming convention::
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably find a word different from "new" to refer to this format.


<experiment>_<process>_<energy>{_<extras>}_<observable>

The data is contained in folders, each containing one single hepdata publication.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we can do other things that are not hepdata?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course vp doesn't know where the data come from but I think the idea is that you open a folder and it comes from one single hepdata publication.

In the odd case in which it doesn't we hope the person implement it add comments or something explaining what happened there.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment was more in the direction that these docs should be rather general.

@@ -0,0 +1,155 @@
Organization following the naming convention
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the convention enforced mechanically? Is it needed to process the names?

Copy link
Copy Markdown
Member Author

@scarlehoff scarlehoff Apr 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it needed to process the names?

Yes. The folder in which the data is will always be name.rsplit("_", 1)[0] and the choice of the observable name.rsplit("_", 1)[1]

Comment thread doc/sphinx/source/data/new-commondata.rst Outdated
@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Apr 3, 2023

We should discuss things like versions vs variants? I do remember being persuaded we needed both but completely forgot why.

@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Apr 3, 2023

If I remember correctly now, we were supposed to keep all the old version metadata files and add a full new one when needed.

@scarlehoff
Copy link
Copy Markdown
Member Author

How many versions and version comments can there be? Is it a complete changelog or do we forget about everything but the last one?

The versions are for actual fixes and the comment should address whatever the change was. Since we have version control I think we can forget about everything but the last one.

We should discuss things like versions vs variants?

The version is a fix of whatever thing. The variant is a variation of the dataset (for instance, a variation of the dataset in which a particular uncertainty is also included).

I'll add some lines about that.

@scarlehoff
Copy link
Copy Markdown
Member Author

I've added some information on variants and theory.

Btw, please feel (@ everyone) free to modify this branch to complete this document.

@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Apr 3, 2023

@enocera @scarlehoff At some point one goal of this effort was mainly that it should be understandable and hopefully implementable by external people. I think it works pretty well for that (in principle, less clear to me in practice), with the exception of the theory section. It requires external names which index into external tool chains. The keys themselves "fktables" are NNPDF jargon. (And there are the ugly double lists...). I believe it that section would look intimidating to an external person trying to understand what is going on.

With that, I think that if the strategic goal is to publicize the format, we'd be better off if this section was in a separate nnpdf_theory.yaml or similar.

@scarlehoff
Copy link
Copy Markdown
Member Author

With that, I think that if the strategic goal is to publicize the format, we'd be better off if this section was in a separate nnpdf_theory.yaml or similar.

I guess in practice this is already the case with the yamldb folder in the new theories. We might want to leave it as part of the theory and never move it to the data.

On the flip side, people wanting to implement new data to use them in a fit will have to deal with that section anyway so I'm not sure whether having it separate is really an advantage at all.

Copy link
Copy Markdown
Contributor

@felixhekhorn felixhekhorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments

Comment thread doc/sphinx/source/data/new-commondata.rst Outdated
Comment thread doc/sphinx/source/data/new-commondata.rst Outdated
Comment thread doc/sphinx/source/data/new-commondata.rst Outdated
Comment thread doc/sphinx/source/data/new-commondata.rst Outdated
Comment thread doc/sphinx/source/data/new-commondata.rst Outdated
Comment thread doc/sphinx/source/data/new-commondata.rst Outdated
Comment thread doc/sphinx/source/data/new-commondata.rst Outdated
Comment thread doc/sphinx/source/data/new-commondata.rst Outdated
Data
----

The format of the data is a ``yaml`` file with an entry ```data_central``` which is a list for all values for all bins.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we should stress somehow more that this is the crucial part

Comment thread doc/sphinx/source/data/new-commondata.rst Outdated
@scarlehoff
Copy link
Copy Markdown
Member Author

I added some comments

I've applied the suggestions. Feel free to add the other changes to the docs directly!

@scarlehoff
Copy link
Copy Markdown
Member Author

Adding here a few notes RE the latest changes to the commondata format (/cc @Radonirinaunimi )

I will also update the documentation, but just in case I don't finish today.

Plotting:
The information in PLOTTING_ and corresponding PLOTTINGTYPE_ should be under a plotting: section in the metadata.yaml for each dataset.
The nnpdf31_process at the top of the metadata should match the old names

Kinematics:
It will the order of the kinematics from kinematics_coverage. That means that if a dataset has more than 3, only the 3 defined in kinematics_coverage will be used

Datasets_names.yaml
There will be a datasets_names.yaml file which will define the mapping between the old and the new datasets.

@scarlehoff scarlehoff mentioned this pull request Oct 9, 2023
@scarlehoff scarlehoff force-pushed the document_new_commondata_format branch from 55763a5 to c3afca3 Compare October 9, 2023 16:15
@scarlehoff
Copy link
Copy Markdown
Member Author

scarlehoff commented Oct 10, 2023

cc @t7phy @Radonirinaunimi
An update with respect to the discussion of yesterday:

kinematic_coverage should not go inside the plotting dictionary but rather be at the same level. This is used to define which set of kinematics to use and is not part of the plotting. Also, the kinematic coverage can (and maybe should) be a list.

In the old plotting file you will see the variable x which is usually (k1, k2, k3 or idat). Please change x to plot_x: <variable>.

And to simplify the part of the cuts, which use the process type, I'd add a process_type as well to the observable...

Comment thread doc/sphinx/source/data/new-commondata.rst Outdated
@Radonirinaunimi
Copy link
Copy Markdown
Member

cc @t7phy @Radonirinaunimi An update with respect to the discussion of yesterday:

kinematic_coverage should not go inside the plotting dictionary but rather be at the same level. This is used to define which set of kinematics to use and is not part of the plotting.

In the old plotting file you will see the variable x which is usually (k1, k2, k3 or idat). Please change x to plot_x: <variable>.

Thanks for the heads up @scarlehoff!

Comment thread doc/sphinx/source/data/new-commondata.rst Outdated
Comment thread doc/sphinx/source/data/new-commondata.rst Outdated
@scarlehoff
Copy link
Copy Markdown
Member Author

For the people implementing new commondata:

the process_type key can usually be found in the old commondata files (it's the second column, e.g):
https://github.com/NNPDF/nnpdf/blob/master/nnpdfcpp/data/commondata/DATA_ATLASTTBARTOT8TEV.dat

if you are implementing a new one, so you don't have a reference, the best thing you can do is have a look at the list and use the one that looks the closest. This is used for some internal vp operations (for instance for internal cuts) so it is a required key.

{'DIS': ('$x$', '$Q^2 (GeV^2)$', '$y$'),

Comment on lines +47 to +48
implemented_observables:
- observable_name: "OBS"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
implemented_observables:
- observable_name: "OBS"
implemented_observables:
- observable_name: "OBS"

I think that there should be a tab here.

Comment thread doc/sphinx/source/data/new-commondata.rst Outdated
@RoyStegeman RoyStegeman mentioned this pull request Mar 1, 2024
3 tasks
first draft of the documented new commondata format

add definition of the version key

add explanation for the variants and the theory

Apply suggestions from code review

Co-authored-by: Felix Hekhorn <felixhekhorn@users.noreply.github.com>

Update new-commondata.rst

update docs

Update doc/sphinx/source/data/new-commondata.rst

Update doc/sphinx/source/data/new-commondata.rst

Update doc/sphinx/source/data/new-commondata.rst

Update new-commondata.rst

update docs with the definition of the old:new mapping

Update doc/sphinx/source/data/new-commondata.rst
@scarlehoff scarlehoff force-pushed the document_new_commondata_format branch from f403223 to af1be35 Compare March 3, 2024 11:55
@scarlehoff scarlehoff changed the title [WIP] Docs for new commondata format Docs for new commondata format Mar 3, 2024
@scarlehoff
Copy link
Copy Markdown
Member Author

scarlehoff commented Mar 3, 2024

This is not perfect, but since we are merging the new commondata format today I wanted to have the documentation in a state in which it can be merged.

I've removed some stuff that was plainly wrong or referred to the old format.

I've added a (more or less detailed) explanation to every entry but plotting. The plotting metadata hasn't really changed that much so I've left a reference to the old plotting metadata with a warning that this information is outdated.

The changes in the last commit I need for the docs to compile correctly in my laptop with py3.12.

Comment thread doc/sphinx/source/data/commondata.rst Outdated
Comment thread doc/sphinx/source/data/commondata.rst Outdated
Comment thread doc/sphinx/source/data/commondata.rst
Comment thread doc/sphinx/source/data/commondata.rst Outdated
Comment thread doc/sphinx/source/data/commondata.rst Outdated
Comment thread doc/sphinx/source/data/commondata.rst Outdated
Comment thread doc/sphinx/source/data/commondata.rst Outdated
Comment thread doc/sphinx/source/data/commondata.rst Outdated
Comment thread doc/sphinx/source/data/commondata.rst Outdated
Copy link
Copy Markdown
Member

@Radonirinaunimi Radonirinaunimi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @scarlehoff for this extensive documentation. I haven't read it fully but I only have two main comments regarding the:

  • filter: so far, there is no mention of how the commondata are generated from input hepdata. A (couple of) sentence(s) should suffice.
  • positivity and integrability which mainly affects the Observable specific information section.

Comment thread doc/sphinx/source/data/commondata.rst
Comment thread doc/sphinx/source/data/commondata.rst
Co-authored-by: Felix Hekhorn <felixhekhorn@users.noreply.github.com>
@scarlehoff
Copy link
Copy Markdown
Member Author

filter: so far, there is no mention of how the commondata are generated from input hepdata. A (couple of) sentence(s) should suffice.

I agree but I haven't added any filters so his should be added by people who did it...

RE positivity and integrability, that should be added for whoever ends up in charge of updating the commondata docs (we can assign tasks in the next code meeting).

@scarlehoff scarlehoff merged commit 255a13a into master Mar 4, 2024
@scarlehoff scarlehoff deleted the document_new_commondata_format branch March 4, 2024 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data toolchain documentation Issues and PRs related to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants