[New commondata format] FTDY by scarlehoff · Pull Request #1679 · NNPDF/nnpdf

scarlehoff · 2023-02-23T13:45:23Z

I'm reopening #1610 here @tgiani
This is just a rebase of your commits there.

These metadata.yaml files however contains some errors (and some things that we might want to change?)

The npoints field in hepdata. Do we need this field @enocera ? (if so I'll add it to the reader)
The theory field must always come with an operation even if it is NULL
The FK_tables are actually a list of list.

So, the theory right now is:

theory:
   FK_tables:
     - DYE605

but it should be

theory:
   FK_tables:
     - - DYE605
   operation: "NULL"

(this is due to some FKTables being a concatenation of grids now)

Point 2 is a bit silly in hindsight but for now it is like that...

Zaharid · 2023-02-23T14:50:01Z

FWIW I find

     - - DYE605

ugly. Can we at least make the examples read like

     - [DYE605]

?

More generally, I don't know if I have the appetite to reopen this, but I feel the theory metadata has deviated quite a bit from the "we would like experimentalist to be able to write this" goal that we had at some point. In addition the current definition has a bunch of backwards compatible and "transactional" semantics. I do wonder if we are better off having all that in a separate file, that would be less stable for the moment.

scarlehoff · 2023-02-23T14:56:27Z

I don't know if I have the appetite to reopen this, but I feel the theory metadata has deviated quite a bit from the "we would like experimentalist to be able to write this" goal that we had at some point

What do you mean? I don't think it has changed since we defined it in Milan one year ago?

In addition the current definition has a bunch of backwards compatible and "transactional" semantics

like?

Zaharid · 2023-02-23T15:01:04Z

I don't know if I have the appetite to reopen this, but I feel the theory metadata has deviated quite a bit from the "we would like experimentalist to be able to write this" goal that we had at some point

What do you mean? I don't think it has changed since we defined it in Milan one year ago?

Probably not, but then again one year has passed.

In addition the current definition has a bunch of backwards compatible and "transactional" semantics

like?

https://github.com/NNPDF/nnpdf/pull/1678/files#diff-eb4ccea229f5af5e39ccb1cc2984780e879f823a50477da80cc65ff859717c8bR112-R115

scarlehoff · 2023-02-23T15:04:41Z

https://github.com/NNPDF/nnpdf/pull/1678/files#diff-eb4ccea229f5af5e39ccb1cc2984780e879f823a50477da80cc65ff859717c8bR112-R115

Ah, but only on the theory part. Not sure how transactional they will be though since we need them for the old theories, unless we completely regenerate the affected grids (specially the ones that skip points in the middle).

enocera · 2023-02-23T15:57:50Z

The npoints field in hepdata. Do we need this field @enocera ? (if so I'll add it to the reader)

This is not required and should be deleted.

scarlehoff · 2023-03-23T12:31:15Z

I've updated E605, E866P/R and E906 to follow the new format proposed by @t7phy

For the name convention I'm following what has been proposed by @cschwan, i.e.,

EXPERIMENT_PROCESS_ENERGY_OBSERVABLE_EXTRA

stopping at the point in which the information makes the dataset unique. In this case for E605 and E906, the commondata is under the E605/E906 folders. Instead for E866 there were two publications and so the data is under the folders E866_FTDY_800GEV_PXSEC and E866_FTDY_800GEV_PDXSECRATIO. Please @cschwan and @felixhekhorn have a look at the name convention and let me know whether it is correct.

I've also added two keys under implemented_observables:

The first is dataset which contains the full name of the dataset to be used in runcards. Note that it is at the top of the observable. This doesn't matter for yaml but it might be a nice way of separating the observables.
I think this is useful since otherwise the user needs to "reconstruct" by hand the name which can lead to mistakes (cc @t7phy @enocera I think this would be useful to add in all new commondata).

The second is process, which is necessary in order for the name convention to work.

And one top-level key with energy (same, it is necessary for the naming convention).

I also think the file key under data_uncertainties and data_central should be removed since it is redundant (these two keys already refer to files).

@tgiani one question, you added also a folder DYE866P_old_kinematic but this is in yet a different format. Was this just a leftover? Should I remove it? Or is it a valid variant of DYE866P?

Please, let me know whether all this is correct and whether I can move forward with the reader using these files as target.

cschwan · 2023-03-23T12:58:47Z

I'd

always give at least the experiment and the process (maybe also the energy?). For E605 there are also datasets in which the muon-pair comes from an upsilon resonance, so specifying the process makes it unique again;
use DY instead FTDY because the process part is supposed to specify the final state and the experiment/variant usually specify the initial states. Drell-Yan at Tevatron (ppbar) we also call Drell-Yan.

cschwan · 2023-03-23T13:16:17Z

   version: 1
-   tables: [1,2,3,4,5,6,7]
-   npoints: [17,18,18,18,18,18,12] # number of datapoints in each table
+energy: "38.8EV"


Should probably be GEV instead of EV.

t7phy · 2023-03-23T13:18:01Z

@scarlehoff if you have a look at https://github.com/NNPDF/nnpdf/blob/more_efficient_metadata_for_new_commondata/buildmaster/CMS_ttBar_8TeV_lj_dif/metadata.yaml#L3 and also have a look at line 23, 48, 73, etc. which contain the observable_name, if you were to concat the line 3 setname with observable_name, you can uniquely identify every observable. In the above example, I will have to capitalize them ofc. In your example, the observable_name would just be empty as the implementation is only with a single observable which is uniquely identified by the setname itself.
I would therefore suggest to drop the dataset key in every implemented observable, which is redundant (due to setname). I would also suggest dropping of process and energy from every single implemented observable, because again, the setname at the top should explain everything. I do not think it would be a good to over-complicate the reader by requiring to to read so many different keys just to identify a single variable.
Also, the user is expected to be able to construct the dataset name without making errors because how else do you name the directory where the implementation will be made in the first place?
Concerning the file key, I guess it helps main consistency with the kinematics part of the metadata block, but I don't have any strong preference there.

scarlehoff · 2023-03-23T13:30:17Z

I do not think it would be a good to over-complicate the reader by requiring to to read so many different keys just to identify a single variable.

Actually for the reader I would very much like to have all fields: experiment, energy, process and observable. And 4 of these 3 are required anyway (the process is needed for MHOU). Energy is new but I'm guessing it will be necessary in all LHC datasets.

Of course, we can say that the minimum name for the commondata is EXPERIMENT_ENERGY_PROCESS* and then it is not necessary to have it in the metadata. That will also be fine. Can we have one paper with two different processes/energies?

*with extra information being added in a case-by-case basis

t7phy · 2023-03-23T13:33:18Z

I do not think it would be a good to over-complicate the reader by requiring to to read so many different keys just to identify a single variable.

Actually for the reader I would very much like to have all fields: experiment, energy, process and observable. And 4 of these 3 are required anyway (the process is needed for MHOU). Energy is new but I'm guessing it will be necessary in all LHC datasets.

Of course, we can say that the minimum name for the commondata is EXPERIMENT_ENERGY_PROCESS and then it is not necessary to have it in the metadata. That will also be fine. Can we have one paper with two different processes/energies?

Actually I have implemented both cases. Papers with different processes and papers with different energies. For example H1 datasets containing jets and dijets, or LHC papers containing both 7 and 8 TeV datsets. In these cases we separate the implementations for the processes or energies (suggest by @enocera), so the setname+observable_name can be used for unique identification without any issues.

scarlehoff · 2023-03-23T13:35:16Z

So I should rename:

E605 -> E605_DY_38P8GEV
E906 -> E906_DY_120GEV

And if one wants to add another process from E605 then that will be a separate folder even if it happens to be the same paper.

Did I understand it correctly?

t7phy · 2023-03-23T13:36:26Z

So I should rename:

E605 -> E605_DY_38P8GEV E906 -> E906_DY_120GEV

And if one wants to add another process from E605 then that will be a separate folder even if it happens to be the same paper.

Did I understand it correctly?

Si

t7phy · 2023-03-23T13:48:37Z

I would say even the dataset key in every observable is redundant. All you need is the global key: setname, and the observable specific key: observable_name for unique identification

scarlehoff · 2023-03-23T13:51:06Z

Ok, I've updated now with all the suggested changes.

I've left the dataset key since in the current naming convention it is necessary due to the following:

In general the name of the dataset will be: _ and is EXPERIMENT_PROCESS_ENERGY, so the full name is EXPERIMENT_PROCESS_ENERGY_OBSERVABLE

However, in E866 the folder also includes the observable name, therefore the observable would get added twice. Note that this is also a problem currently in #1684 which is called CMS_TTBAR_8TEV_LJ_DIFF where under this convention LJ_DIFF would be the observable however the actual observables still need to concatenate things at the end: CMS_TTBAR_8TEV_LJ_DIFF_PT, etc

What is the proposed solution for this? As I've said before I would like to avoid having keys that appear in some metadata files and not in others regarding the names.

(I don't have any strong opinion regarding what the correct way should be, I just want to know for sure that I have only one rule to create the name and that I can apply the same rule to all commondata).

t7phy · 2023-03-23T13:58:11Z

Ok, I've updated now with all the suggested changes.

I've left the dataset key since in the current naming convention it is necessary due to the following:

In general the name of the dataset will be: _ and is EXPERIMENT_PROCESS_ENERGY, so the full name is EXPERIMENT_PROCESS_ENERGY_OBSERVABLE

However, in E866 the folder also includes the observable name, therefore the observable would get added twice. Note that this is also a problem currently in #1684 which is called CMS_TTBAR_8TEV_LJ_DIFF where under this convention LJ_DIFF would be the observable however the actual observables still need to concatenate things at the end: CMS_TTBAR_8TEV_LJ_DIFF_PT, etc

What is the proposed solution for this? As I've said before I would like to avoid having keys that appear in some metadata files and not in others regarding the names.

(I don't have any strong opinion regarding what the correct way should be, I just want to know for sure that I have only one rule to create the name and that I can apply the same rule to all commondata).

As I wrote above, for the E866, the simplest solution is to just have an empty value for observable_name as the setname itself uniquely identifies it. As for the convention, it is EXPERIMENT_PROCESS_ENERGY_MISC.INFO_OBSERVABLENAME, this misc info is important as there are datasets in different papers with slight variations (luminosity for instance). The reader could get the exp name by looking at the what comes before 1st _, process before 2nd _ and so on, however the observable name could be obtained by looking at what comes after the last _. I will need to make lots of changes too, because I usually named observables as M_TTBAR or Y_TTBAR which I will modify to MTTBAR and YTTBAR to ensure this convention. Does this seem fine?

tgiani · 2023-03-23T14:00:15Z

@tgiani one question, you added also a folder DYE866P_old_kinematic but this is in yet a different format. Was this just a leftover? Should I remove it? Or is it a valid variant of DYE866P?

Please, let me know whether all this is correct and whether I can move forward with the reader using these files as target.

@scarlehoff that was implemented to cross-check some old implementation of the data. I added it for completeness as asked by @enocera, not sure if we want to keep it

scarlehoff · 2023-03-23T14:05:41Z

Ok, that will basically leave the following convention:

<EXPERIMENT>_<PROCESS>_<ENERGY>_<FREE FIELDS>_<OBSERVABLENAME>

where OBSERVABLENAME is always included in the metadata.yaml and if necessary (case of E866) left empty. The free fields can separate only at the topmost level but I think that's ok, basically any possible ambiguity is "casted" into a separation of folders.

I'm happy with it :D

If @cschwan and @felixhekhorn (also @enocera ofc but I'm assuming he's also ok with it) confirm I will update the datasets here and add it to the docs when dealing with #1691.

scarlehoff · 2023-03-24T07:56:27Z

@t7phy please have a look at the current version, if everything is ok I'll use these as the "examples" for the reader.

@enocera do you want me to keep DYE866P_old_kinematic or can I just remove it?

t7phy · 2023-03-24T08:02:36Z

@t7phy please have a look at the current version, if everything is ok I'll use these as the "examples" for the reader.

@enocera do you want me to keep DYE866P_old_kinematic or can I just remove it?

I think it's good to go.

enocera · 2023-03-24T08:14:14Z

@t7phy please have a look at the current version, if everything is ok I'll use these as the "examples" for the reader.

@enocera do you want me to keep DYE866P_old_kinematic or can I just remove it?

@scarlehoff Let me check what that was about before dropping it, please.

Co-authored-by: Felix Hekhorn <felixhekhorn@users.noreply.github.com>

scarlehoff · 2023-03-29T15:20:32Z

I've updated the names of the two E866 observables. I've also disconnected this branch from the reader (it's easy enough to copy the relevant folders every time I want to test the reader).

scarlehoff · 2023-03-30T12:16:30Z

@enocera @tgiani I think the MULT and ADD systematics uncertainties are swapped (at least in the E605 dataset). Do to that I think the ADD uncertainties are computed wrong (since I guess the single ADD is a combination of all the other ADDs).

However, I don't see the 'add in quadrature' operation in the filter.py (and there are not many comments there...) so I cannot help but wonder... why are the number of systematics uncertainties different in the old and new metadata? Which one of the two is bugged?

enocera · 2023-03-30T14:20:54Z

@enocera @tgiani I think the MULT and ADD systematics uncertainties are swapped (at least in the E605 dataset). Do to that I think the ADD uncertainties are computed wrong (since I guess the single ADD is a combination of all the other ADDs).

However, I don't see the 'add in quadrature' operation in the filter.py (and there are not many comments there...) so I cannot help but wonder... why are the number of systematics uncertainties different in the old and new metadata? Which one of the two is bugged?

@scarlehoff Let me also check this. However, as I mentioned in #1684 may it be that the difference in the number of sys uncertainties is that we only have absolute uncs (in the new format) instead of absolute and relative (in the old format)?

scarlehoff · 2023-03-30T14:49:53Z

Upon further inspection, they are not swapped as (MULT <-> ADD) but rather as (1st <-> 2nd) (I was comparing old vs new, so they came in a different order).

The MULT is ok.

But the ADD still looks wrong (scratch the added in quadrature, it's just a factor of two...)
Also, the values of stat seem slightly different:

For the first 10 points we have:

new:

entry
1     106.0
2      45.9
3      40.3
4      45.5
5      21.5
6      31.2
7      28.5
8      19.8
9      13.4
10     13.9

old:

entry
1     106.0
2      46.0
3      40.0
4      46.0
5      22.0
6      31.0
7      28.0
8      20.0
9      13.0
10     14.0

It's like in the new one we keep 3 digit while the old ones keep only 2?

enocera · 2023-03-30T15:10:43Z

It's like in the new one we keep 3 digit while the old ones keep only 2?

@tgiani can correct me if I'm wrong, but I think that this difference is due to the input rawdata which, in the new format, is taken from Hepdata tables, while in the old format comes either from the paper or from some files of unknown origin.

enocera · 2023-03-30T15:11:36Z

@t7phy please have a look at the current version, if everything is ok I'll use these as the "examples" for the reader.
@enocera do you want me to keep DYE866P_old_kinematic or can I just remove it?

@scarlehoff Let me check what that was about before dropping it, please.

You can drop it.

scarlehoff · 2023-03-30T15:13:27Z

Ah, ok! I thought they should match the old ones. Are these kinds of differences always expected? To know what level of accuracy should I expect when testing... I can just do "ok, if it's ok at 5% then I'm ok with it". I would like to automatize this testing as much as possible.

enocera · 2023-03-30T15:15:39Z

Ah, ok! I thought they should match the old ones. Are these kinds of differences always expected? To know what level of accuracy should I expect when testing... I can just do "ok, if it's ok at 5% then I'm ok with it". I would like to automatize this testing as much as possible.

For old data sets (that is data sets NOT taken from Hepdata) these differences can happen.

scarlehoff · 2023-03-31T09:23:32Z

@enocera follow up questions

DYE866P is completely different. I guess this is the reason an older ersion was also done. Do you know the reason for these differences? So that I can add them as comments.

For now the differences for this branch that I don't understand are:

DYE605: a factor of two in the additive uncertainties.
DYE886P: see above
DYE886R: only kinematics are different
DYE906R: there is a factor of 1 for some of the uncertainties

Kinematics are always different. I'll redo the kinematics.yaml table so that it is the same as the old ones.

Of course, this is the "raw" datasets... but all the dw (or dw_ite) are still to be done, right?

enocera · 2023-03-31T11:19:05Z

@enocera follow up questions

* DYE866P is completely different. I guess this is the reason an older ersion was also done. Do you know the reason for these differences? So that I can add them as comments.

Not on top of my head. Let me check.

For now the differences for this branch that I don't understand are:

* DYE605: a factor of two in the additive uncertainties.

* DYE886P: see above

* DYE886R: only kinematics are different

* DYE906R: there is a factor of 1 for some of the uncertainties

Let me check all these.

Of course, this is the "raw" datasets... but all the dw (or dw_ite) are still to be done, right?

Correct. And that's because I told @tgiani to ignore variants for the time being. Let me implement these variants.

scarlehoff · 2023-03-31T12:19:01Z

Just to clarify, the factor of 1 is actually of -1 (otherwise it would be silly of me to mention it!) and it happens for sys_4 and sys_5 and I think they are to the ambiguity of taking + or - when taking the square root (so at the end of the day the produced covmat would be the same).
The question is whether this differences between old and new are ok.

scarlehoff · 2023-10-10T10:08:07Z

Hi @enocera, are these datasets finished? (I remember you mentioned there were some issues that you might want to update, but I don't remember exactly what was the situation...)

If these are finished I was thinking of updating them with the plotting options etc to start collecting datasets in #1813

scarlehoff · 2024-12-06T20:26:55Z

At this point this is very much obsolete. The information is not lost in case it is necessary so I'm just closing the PR.

scarlehoff assigned enocera and tgiani Feb 23, 2023

scarlehoff added the data toolchain label Feb 23, 2023

scarlehoff mentioned this pull request Feb 23, 2023

[New commondata format] DY #1610

Closed

scarlehoff force-pushed the E605 branch from 96ea438 to b53c0ed Compare February 23, 2023 15:00

Zaharid mentioned this pull request Feb 23, 2023

New CommonData Reader #1678

Merged

3 tasks

scarlehoff self-assigned this Mar 22, 2023

cschwan reviewed Mar 23, 2023

View reviewed changes

felixhekhorn reviewed Mar 23, 2023

View reviewed changes

Comment thread buildmaster/E605_DY_38P8GEV/metadata.yaml Outdated

felixhekhorn reviewed Mar 24, 2023

View reviewed changes

Comment thread buildmaster/E906_DY_120GEV/metadata.yaml Outdated

scarlehoff and others added 6 commits March 29, 2023 17:18

applied cs comments

bcc7b63

update names and removed redundant information

a0dc7ab

forgot E906

571d0d8

update the naming

02c0ca9

Update buildmaster/E906_DY_120GEV/metadata.yaml

5f8ae78

Co-authored-by: Felix Hekhorn <felixhekhorn@users.noreply.github.com>

change the name of E866

d6c9298

scarlehoff force-pushed the E605 branch from ad10063 to d6c9298 Compare March 29, 2023 15:18

scarlehoff changed the base branch from final_reader_for_new_commondata_mk2 to master March 29, 2023 15:19

definition to definitions and remove old data

eddf579

scarlehoff mentioned this pull request Mar 31, 2023

Status of the new commondata format implementation #1709

Closed

77 tasks

fix to E866

431ee4f

scarlehoff added 2 commits March 31, 2023 12:42

update E605 and filter script

ca1dfda

update filter files and data

2ce7bfd

scarlehoff changed the title ~~[New commondata format] DY~~ [New commondata format] FTDY Oct 11, 2023

scarlehoff mentioned this pull request Nov 27, 2023

Re-implementation of some of the ATLAS Collider DY Datasets #1866

Closed

1 task

scarlehoff closed this Dec 6, 2024

scarlehoff deleted the E605 branch December 6, 2024 20:26

Conversation

scarlehoff commented Feb 23, 2023

Uh oh!

Zaharid commented Feb 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scarlehoff commented Feb 23, 2023

Uh oh!

Zaharid commented Feb 23, 2023

Uh oh!

scarlehoff commented Feb 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enocera commented Feb 23, 2023

Uh oh!

scarlehoff commented Mar 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cschwan commented Mar 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cschwan Mar 23, 2023

Choose a reason for hiding this comment

Uh oh!

t7phy commented Mar 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scarlehoff commented Mar 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

t7phy commented Mar 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scarlehoff commented Mar 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

t7phy commented Mar 23, 2023

Uh oh!

t7phy commented Mar 23, 2023

Uh oh!

scarlehoff commented Mar 23, 2023

Uh oh!

t7phy commented Mar 23, 2023

Uh oh!

tgiani commented Mar 23, 2023

Uh oh!

scarlehoff commented Mar 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

scarlehoff commented Mar 24, 2023

Uh oh!

t7phy commented Mar 24, 2023

Uh oh!

enocera commented Mar 24, 2023

Uh oh!

Uh oh!

scarlehoff commented Mar 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scarlehoff commented Mar 30, 2023

Uh oh!

enocera commented Mar 30, 2023

Uh oh!

scarlehoff commented Mar 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enocera commented Mar 30, 2023

Uh oh!

enocera commented Mar 30, 2023

Uh oh!

scarlehoff commented Mar 30, 2023

Uh oh!

enocera commented Mar 30, 2023

Uh oh!

scarlehoff commented Mar 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enocera commented Mar 31, 2023

Uh oh!

Zaharid commented Feb 23, 2023 •

edited

Loading

scarlehoff commented Feb 23, 2023 •

edited

Loading

scarlehoff commented Mar 23, 2023 •

edited

Loading

cschwan commented Mar 23, 2023 •

edited

Loading

t7phy commented Mar 23, 2023 •

edited

Loading

scarlehoff commented Mar 23, 2023 •

edited

Loading

t7phy commented Mar 23, 2023 •

edited

Loading

scarlehoff commented Mar 23, 2023 •

edited

Loading

scarlehoff commented Mar 23, 2023 •

edited

Loading

scarlehoff commented Mar 29, 2023 •

edited

Loading

scarlehoff commented Mar 30, 2023 •

edited

Loading

scarlehoff commented Mar 31, 2023 •

edited

Loading

scarlehoff commented Mar 31, 2023 •

edited

Loading