Skip to content

[New commondata format] FTDY#1679

Closed
scarlehoff wants to merge 22 commits into
masterfrom
E605
Closed

[New commondata format] FTDY#1679
scarlehoff wants to merge 22 commits into
masterfrom
E605

Conversation

@scarlehoff
Copy link
Copy Markdown
Member

I'm reopening #1610 here @tgiani
This is just a rebase of your commits there.

These metadata.yaml files however contains some errors (and some things that we might want to change?)

  1. The npoints field in hepdata. Do we need this field @enocera ? (if so I'll add it to the reader)
  2. The theory field must always come with an operation even if it is NULL
  3. The FK_tables are actually a list of list.

So, the theory right now is:

theory:
   FK_tables:
     - DYE605

but it should be

theory:
   FK_tables:
     - - DYE605
   operation: "NULL"

(this is due to some FKTables being a concatenation of grids now)

Point 2 is a bit silly in hindsight but for now it is like that...

@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Feb 23, 2023

FWIW I find

     - - DYE605

ugly. Can we at least make the examples read like

     - [DYE605]

?

More generally, I don't know if I have the appetite to reopen this, but I feel the theory metadata has deviated quite a bit from the "we would like experimentalist to be able to write this" goal that we had at some point. In addition the current definition has a bunch of backwards compatible and "transactional" semantics. I do wonder if we are better off having all that in a separate file, that would be less stable for the moment.

@scarlehoff
Copy link
Copy Markdown
Member Author

I don't know if I have the appetite to reopen this, but I feel the theory metadata has deviated quite a bit from the "we would like experimentalist to be able to write this" goal that we had at some point

What do you mean? I don't think it has changed since we defined it in Milan one year ago?

In addition the current definition has a bunch of backwards compatible and "transactional" semantics

like?

@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Feb 23, 2023

I don't know if I have the appetite to reopen this, but I feel the theory metadata has deviated quite a bit from the "we would like experimentalist to be able to write this" goal that we had at some point

What do you mean? I don't think it has changed since we defined it in Milan one year ago?

Probably not, but then again one year has passed.

In addition the current definition has a bunch of backwards compatible and "transactional" semantics

like?

https://github.com/NNPDF/nnpdf/pull/1678/files#diff-eb4ccea229f5af5e39ccb1cc2984780e879f823a50477da80cc65ff859717c8bR112-R115

@scarlehoff
Copy link
Copy Markdown
Member Author

scarlehoff commented Feb 23, 2023

https://github.com/NNPDF/nnpdf/pull/1678/files#diff-eb4ccea229f5af5e39ccb1cc2984780e879f823a50477da80cc65ff859717c8bR112-R115

Ah, but only on the theory part. Not sure how transactional they will be though since we need them for the old theories, unless we completely regenerate the affected grids (specially the ones that skip points in the middle).

@Zaharid Zaharid mentioned this pull request Feb 23, 2023
3 tasks
@enocera
Copy link
Copy Markdown
Contributor

enocera commented Feb 23, 2023

  1. The npoints field in hepdata. Do we need this field @enocera ? (if so I'll add it to the reader)

This is not required and should be deleted.

@scarlehoff scarlehoff self-assigned this Mar 22, 2023
@scarlehoff
Copy link
Copy Markdown
Member Author

scarlehoff commented Mar 23, 2023

I've updated E605, E866P/R and E906 to follow the new format proposed by @t7phy

For the name convention I'm following what has been proposed by @cschwan, i.e.,

EXPERIMENT_PROCESS_ENERGY_OBSERVABLE_EXTRA

stopping at the point in which the information makes the dataset unique. In this case for E605 and E906, the commondata is under the E605/E906 folders. Instead for E866 there were two publications and so the data is under the folders E866_FTDY_800GEV_PXSEC and E866_FTDY_800GEV_PDXSECRATIO. Please @cschwan and @felixhekhorn have a look at the name convention and let me know whether it is correct.

I've also added two keys under implemented_observables:

The first is dataset which contains the full name of the dataset to be used in runcards. Note that it is at the top of the observable. This doesn't matter for yaml but it might be a nice way of separating the observables.
I think this is useful since otherwise the user needs to "reconstruct" by hand the name which can lead to mistakes (cc @t7phy @enocera I think this would be useful to add in all new commondata).

The second is process, which is necessary in order for the name convention to work.

And one top-level key with energy (same, it is necessary for the naming convention).

I also think the file key under data_uncertainties and data_central should be removed since it is redundant (these two keys already refer to files).

@tgiani one question, you added also a folder DYE866P_old_kinematic but this is in yet a different format. Was this just a leftover? Should I remove it? Or is it a valid variant of DYE866P?

Please, let me know whether all this is correct and whether I can move forward with the reader using these files as target.

@cschwan
Copy link
Copy Markdown
Contributor

cschwan commented Mar 23, 2023

I'd

  • always give at least the experiment and the process (maybe also the energy?). For E605 there are also datasets in which the muon-pair comes from an upsilon resonance, so specifying the process makes it unique again;
  • use DY instead FTDY because the process part is supposed to specify the final state and the experiment/variant usually specify the initial states. Drell-Yan at Tevatron (ppbar) we also call Drell-Yan.

Comment thread buildmaster/E605/metadata.yaml Outdated
version: 1
tables: [1,2,3,4,5,6,7]
npoints: [17,18,18,18,18,18,12] # number of datapoints in each table
energy: "38.8EV"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably be GEV instead of EV.

@t7phy
Copy link
Copy Markdown
Collaborator

t7phy commented Mar 23, 2023

@scarlehoff if you have a look at https://github.com/NNPDF/nnpdf/blob/more_efficient_metadata_for_new_commondata/buildmaster/CMS_ttBar_8TeV_lj_dif/metadata.yaml#L3 and also have a look at line 23, 48, 73, etc. which contain the observable_name, if you were to concat the line 3 setname with observable_name, you can uniquely identify every observable. In the above example, I will have to capitalize them ofc. In your example, the observable_name would just be empty as the implementation is only with a single observable which is uniquely identified by the setname itself.
I would therefore suggest to drop the dataset key in every implemented observable, which is redundant (due to setname). I would also suggest dropping of process and energy from every single implemented observable, because again, the setname at the top should explain everything. I do not think it would be a good to over-complicate the reader by requiring to to read so many different keys just to identify a single variable.
Also, the user is expected to be able to construct the dataset name without making errors because how else do you name the directory where the implementation will be made in the first place?
Concerning the file key, I guess it helps main consistency with the kinematics part of the metadata block, but I don't have any strong preference there.

@scarlehoff
Copy link
Copy Markdown
Member Author

scarlehoff commented Mar 23, 2023

I do not think it would be a good to over-complicate the reader by requiring to to read so many different keys just to identify a single variable.

Actually for the reader I would very much like to have all fields: experiment, energy, process and observable. And 4 of these 3 are required anyway (the process is needed for MHOU). Energy is new but I'm guessing it will be necessary in all LHC datasets.

Of course, we can say that the minimum name for the commondata is EXPERIMENT_ENERGY_PROCESS* and then it is not necessary to have it in the metadata. That will also be fine. Can we have one paper with two different processes/energies?

*with extra information being added in a case-by-case basis

@t7phy
Copy link
Copy Markdown
Collaborator

t7phy commented Mar 23, 2023

I do not think it would be a good to over-complicate the reader by requiring to to read so many different keys just to identify a single variable.

Actually for the reader I would very much like to have all fields: experiment, energy, process and observable. And 4 of these 3 are required anyway (the process is needed for MHOU). Energy is new but I'm guessing it will be necessary in all LHC datasets.

Of course, we can say that the minimum name for the commondata is EXPERIMENT_ENERGY_PROCESS and then it is not necessary to have it in the metadata. That will also be fine. Can we have one paper with two different processes/energies?

Actually I have implemented both cases. Papers with different processes and papers with different energies. For example H1 datasets containing jets and dijets, or LHC papers containing both 7 and 8 TeV datsets. In these cases we separate the implementations for the processes or energies (suggest by @enocera), so the setname+observable_name can be used for unique identification without any issues.

@scarlehoff
Copy link
Copy Markdown
Member Author

scarlehoff commented Mar 23, 2023

So I should rename:

E605 -> E605_DY_38P8GEV
E906 -> E906_DY_120GEV

And if one wants to add another process from E605 then that will be a separate folder even if it happens to be the same paper.

Did I understand it correctly?

@t7phy
Copy link
Copy Markdown
Collaborator

t7phy commented Mar 23, 2023

So I should rename:

E605 -> E605_DY_38P8GEV E906 -> E906_DY_120GEV

And if one wants to add another process from E605 then that will be a separate folder even if it happens to be the same paper.

Did I understand it correctly?

Si

@t7phy
Copy link
Copy Markdown
Collaborator

t7phy commented Mar 23, 2023

I would say even the dataset key in every observable is redundant. All you need is the global key: setname, and the observable specific key: observable_name for unique identification

@scarlehoff
Copy link
Copy Markdown
Member Author

Ok, I've updated now with all the suggested changes.

I've left the dataset key since in the current naming convention it is necessary due to the following:

In general the name of the dataset will be: _ and is EXPERIMENT_PROCESS_ENERGY, so the full name is EXPERIMENT_PROCESS_ENERGY_OBSERVABLE

However, in E866 the folder also includes the observable name, therefore the observable would get added twice. Note that this is also a problem currently in #1684 which is called CMS_TTBAR_8TEV_LJ_DIFF where under this convention LJ_DIFF would be the observable however the actual observables still need to concatenate things at the end: CMS_TTBAR_8TEV_LJ_DIFF_PT, etc

What is the proposed solution for this? As I've said before I would like to avoid having keys that appear in some metadata files and not in others regarding the names.

(I don't have any strong opinion regarding what the correct way should be, I just want to know for sure that I have only one rule to create the name and that I can apply the same rule to all commondata).

@t7phy
Copy link
Copy Markdown
Collaborator

t7phy commented Mar 23, 2023

Ok, I've updated now with all the suggested changes.

I've left the dataset key since in the current naming convention it is necessary due to the following:

In general the name of the dataset will be: _ and is EXPERIMENT_PROCESS_ENERGY, so the full name is EXPERIMENT_PROCESS_ENERGY_OBSERVABLE

However, in E866 the folder also includes the observable name, therefore the observable would get added twice. Note that this is also a problem currently in #1684 which is called CMS_TTBAR_8TEV_LJ_DIFF where under this convention LJ_DIFF would be the observable however the actual observables still need to concatenate things at the end: CMS_TTBAR_8TEV_LJ_DIFF_PT, etc

What is the proposed solution for this? As I've said before I would like to avoid having keys that appear in some metadata files and not in others regarding the names.

(I don't have any strong opinion regarding what the correct way should be, I just want to know for sure that I have only one rule to create the name and that I can apply the same rule to all commondata).

As I wrote above, for the E866, the simplest solution is to just have an empty value for observable_name as the setname itself uniquely identifies it. As for the convention, it is EXPERIMENT_PROCESS_ENERGY_MISC.INFO_OBSERVABLENAME, this misc info is important as there are datasets in different papers with slight variations (luminosity for instance). The reader could get the exp name by looking at the what comes before 1st _, process before 2nd _ and so on, however the observable name could be obtained by looking at what comes after the last _. I will need to make lots of changes too, because I usually named observables as M_TTBAR or Y_TTBAR which I will modify to MTTBAR and YTTBAR to ensure this convention. Does this seem fine?

@tgiani
Copy link
Copy Markdown
Contributor

tgiani commented Mar 23, 2023

@tgiani one question, you added also a folder DYE866P_old_kinematic but this is in yet a different format. Was this just a leftover? Should I remove it? Or is it a valid variant of DYE866P?

Please, let me know whether all this is correct and whether I can move forward with the reader using these files as target.

@scarlehoff that was implemented to cross-check some old implementation of the data. I added it for completeness as asked by @enocera, not sure if we want to keep it

@scarlehoff
Copy link
Copy Markdown
Member Author

scarlehoff commented Mar 23, 2023

Ok, that will basically leave the following convention:

<EXPERIMENT>_<PROCESS>_<ENERGY>_<FREE FIELDS>_<OBSERVABLENAME>

where OBSERVABLENAME is always included in the metadata.yaml and if necessary (case of E866) left empty. The free fields can separate only at the topmost level but I think that's ok, basically any possible ambiguity is "casted" into a separation of folders.

I'm happy with it :D

If @cschwan and @felixhekhorn (also @enocera ofc but I'm assuming he's also ok with it) confirm I will update the datasets here and add it to the docs when dealing with #1691.

Comment thread buildmaster/E605_DY_38P8GEV/metadata.yaml Outdated
@scarlehoff
Copy link
Copy Markdown
Member Author

@t7phy please have a look at the current version, if everything is ok I'll use these as the "examples" for the reader.

@enocera do you want me to keep DYE866P_old_kinematic or can I just remove it?

@t7phy
Copy link
Copy Markdown
Collaborator

t7phy commented Mar 24, 2023

@t7phy please have a look at the current version, if everything is ok I'll use these as the "examples" for the reader.

@enocera do you want me to keep DYE866P_old_kinematic or can I just remove it?

I think it's good to go.

@enocera
Copy link
Copy Markdown
Contributor

enocera commented Mar 24, 2023

@t7phy please have a look at the current version, if everything is ok I'll use these as the "examples" for the reader.

@enocera do you want me to keep DYE866P_old_kinematic or can I just remove it?

@scarlehoff Let me check what that was about before dropping it, please.

Comment thread buildmaster/E906_DY_120GEV/metadata.yaml Outdated
@scarlehoff scarlehoff changed the base branch from final_reader_for_new_commondata_mk2 to master March 29, 2023 15:19
@scarlehoff
Copy link
Copy Markdown
Member Author

scarlehoff commented Mar 29, 2023

I've updated the names of the two E866 observables. I've also disconnected this branch from the reader (it's easy enough to copy the relevant folders every time I want to test the reader).

@scarlehoff
Copy link
Copy Markdown
Member Author

@enocera @tgiani I think the MULT and ADD systematics uncertainties are swapped (at least in the E605 dataset). Do to that I think the ADD uncertainties are computed wrong (since I guess the single ADD is a combination of all the other ADDs).

However, I don't see the 'add in quadrature' operation in the filter.py (and there are not many comments there...) so I cannot help but wonder... why are the number of systematics uncertainties different in the old and new metadata? Which one of the two is bugged?

@enocera
Copy link
Copy Markdown
Contributor

enocera commented Mar 30, 2023

@enocera @tgiani I think the MULT and ADD systematics uncertainties are swapped (at least in the E605 dataset). Do to that I think the ADD uncertainties are computed wrong (since I guess the single ADD is a combination of all the other ADDs).

However, I don't see the 'add in quadrature' operation in the filter.py (and there are not many comments there...) so I cannot help but wonder... why are the number of systematics uncertainties different in the old and new metadata? Which one of the two is bugged?

@scarlehoff Let me also check this. However, as I mentioned in #1684 may it be that the difference in the number of sys uncertainties is that we only have absolute uncs (in the new format) instead of absolute and relative (in the old format)?

@scarlehoff
Copy link
Copy Markdown
Member Author

scarlehoff commented Mar 30, 2023

Upon further inspection, they are not swapped as (MULT <-> ADD) but rather as (1st <-> 2nd) (I was comparing old vs new, so they came in a different order).

The MULT is ok.

But the ADD still looks wrong (scratch the added in quadrature, it's just a factor of two...)
Also, the values of stat seem slightly different:

For the first 10 points we have:

new:

entry
1     106.0
2      45.9
3      40.3
4      45.5
5      21.5
6      31.2
7      28.5
8      19.8
9      13.4
10     13.9

old:

entry
1     106.0
2      46.0
3      40.0
4      46.0
5      22.0
6      31.0
7      28.0
8      20.0
9      13.0
10     14.0

It's like in the new one we keep 3 digit while the old ones keep only 2?

@enocera
Copy link
Copy Markdown
Contributor

enocera commented Mar 30, 2023

It's like in the new one we keep 3 digit while the old ones keep only 2?

@tgiani can correct me if I'm wrong, but I think that this difference is due to the input rawdata which, in the new format, is taken from Hepdata tables, while in the old format comes either from the paper or from some files of unknown origin.

@enocera
Copy link
Copy Markdown
Contributor

enocera commented Mar 30, 2023

@t7phy please have a look at the current version, if everything is ok I'll use these as the "examples" for the reader.
@enocera do you want me to keep DYE866P_old_kinematic or can I just remove it?

@scarlehoff Let me check what that was about before dropping it, please.

You can drop it.

@scarlehoff
Copy link
Copy Markdown
Member Author

Ah, ok! I thought they should match the old ones. Are these kinds of differences always expected? To know what level of accuracy should I expect when testing... I can just do "ok, if it's ok at 5% then I'm ok with it". I would like to automatize this testing as much as possible.

@enocera
Copy link
Copy Markdown
Contributor

enocera commented Mar 30, 2023

Ah, ok! I thought they should match the old ones. Are these kinds of differences always expected? To know what level of accuracy should I expect when testing... I can just do "ok, if it's ok at 5% then I'm ok with it". I would like to automatize this testing as much as possible.

For old data sets (that is data sets NOT taken from Hepdata) these differences can happen.

@scarlehoff
Copy link
Copy Markdown
Member Author

scarlehoff commented Mar 31, 2023

@enocera follow up questions

  • DYE866P is completely different. I guess this is the reason an older ersion was also done. Do you know the reason for these differences? So that I can add them as comments.

Screenshot 2023-03-31 at 11 07 56

For now the differences for this branch that I don't understand are:

  • DYE605: a factor of two in the additive uncertainties.
  • DYE886P: see above
  • DYE886R: only kinematics are different
  • DYE906R: there is a factor of 1 for some of the uncertainties

Kinematics are always different. I'll redo the kinematics.yaml table so that it is the same as the old ones.

Of course, this is the "raw" datasets... but all the dw (or dw_ite) are still to be done, right?

@enocera
Copy link
Copy Markdown
Contributor

enocera commented Mar 31, 2023

@enocera follow up questions

* DYE866P is completely different. I guess this is the reason an older ersion was also done. Do you know the reason for these differences? So that I can add them as comments.

Not on top of my head. Let me check.

For now the differences for this branch that I don't understand are:

* DYE605: a factor of two in the additive uncertainties.

* DYE886P: see above

* DYE886R: only kinematics are different

* DYE906R: there is a factor of 1 for some of the uncertainties

Let me check all these.

Of course, this is the "raw" datasets... but all the dw (or dw_ite) are still to be done, right?

Correct. And that's because I told @tgiani to ignore variants for the time being. Let me implement these variants.

@scarlehoff
Copy link
Copy Markdown
Member Author

scarlehoff commented Mar 31, 2023

Just to clarify, the factor of 1 is actually of -1 (otherwise it would be silly of me to mention it!) and it happens for sys_4 and sys_5 and I think they are to the ambiguity of taking + or - when taking the square root (so at the end of the day the produced covmat would be the same).
The question is whether this differences between old and new are ok.

@scarlehoff
Copy link
Copy Markdown
Member Author

Hi @enocera, are these datasets finished? (I remember you mentioned there were some issues that you might want to update, but I don't remember exactly what was the situation...)

If these are finished I was thinking of updating them with the plotting options etc to start collecting datasets in #1813

@scarlehoff scarlehoff changed the title [New commondata format] DY [New commondata format] FTDY Oct 11, 2023
@scarlehoff
Copy link
Copy Markdown
Member Author

At this point this is very much obsolete. The information is not lost in case it is necessary so I'm just closing the PR.

@scarlehoff scarlehoff closed this Dec 6, 2024
@scarlehoff scarlehoff deleted the E605 branch December 6, 2024 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants