Skip to content

Polish legacy jet datasets#2056

Merged
comane merged 26 commits into
masterfrom
polish_legacy_jet_datasets
Jun 10, 2024
Merged

Polish legacy jet datasets#2056
comane merged 26 commits into
masterfrom
polish_legacy_jet_datasets

Conversation

@comane
Copy link
Copy Markdown
Member

@comane comane commented Apr 17, 2024

The scope of this PR is to (try to) polish the new implementation of the legacy jet datasets, namely, CMS_2JET_7TEV, CMS_1JET_8TEV, ATLAS_1JET_8TEV_R06, ATLAS_2JET_7TEV_R06.

What the PR does:

  1. Adds a filter_utils.utils.py module. This module should collect the functions that are generalisable and could, in principle, be used for all datasets.
  2. Adds a filter_utils.legacy_jets_utils.py module that collects all of the utils functions for the filter files. Polishing these is going to be much harder given the nature of the raw data.

Copy link
Copy Markdown
Member

@Radonirinaunimi Radonirinaunimi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @comane for this! I will have a detailed look soon, but when skimming through I think there are at least two main issues with the CMS_1JET_8TEV_PTY dataset:

  • the variable in the kinematic has changed from p_T to p_T2 while in the metadata it's still kept as p_T
  • the file uncertainties_bugged.yaml no longer exists but it is still referenced in the metadata

@RoyStegeman
Copy link
Copy Markdown
Member

RoyStegeman commented Apr 24, 2024

This was not done in this PR, but I changed it since I don't think we should start getting in the habit of suppressing warnings if there is another option

@comane
Copy link
Copy Markdown
Member Author

comane commented Apr 24, 2024

This was not done in this PR, but I changed it since I don't think we should start getting in the habit of suppressing warnings if there is another option

Thanks @RoyStegeman, just to be sure I understand: removing the read "r" from with open avoids the pandas warnings?

@RoyStegeman
Copy link
Copy Markdown
Member

RoyStegeman commented Apr 24, 2024

No, removing "r" is just because that's the default option anyway.

The warnings are fixed around line 900 in legacy_jets_utils.py.

A similar error, with similar solution, that I didn't fix still exists for ATLAS_2JET_7TEV_R06. I didn't fix it now because that filter.py is broken anyway.

@RoyStegeman
Copy link
Copy Markdown
Member

Since filter files will now share utilities, and we may need to change/fix those shared utilities at some point - do you think it's feasible to run all filters in the CI to test if the output remains unchanged?

@RoyStegeman
Copy link
Copy Markdown
Member

p_T2 = "p_T2" # This one is wrong, should be pT2

@scarlehoff what should be done in this PR?

@RoyStegeman RoyStegeman force-pushed the polish_legacy_jet_datasets branch from 7138ebe to 517b046 Compare April 24, 2024 18:37
@scarlehoff
Copy link
Copy Markdown
Member

scarlehoff commented Apr 24, 2024

@scarlehoff what should be done in this PR?

Change it to pT2 I'd say.

@RoyStegeman
Copy link
Copy Markdown
Member

@comane I see you addressed the CI fails (as pointed out by @Radonirinaunimi), but could you rename p_T2 to pT2?

@comane
Copy link
Copy Markdown
Member Author

comane commented May 7, 2024

@comane I see you addressed the CI fails (as pointed out by @Radonirinaunimi), but could you rename p_T2 to pT2?

Sure, I am not sure why though?
Who wrote that comment there and why is p_T2 wrong? The tests seem to be checking for p_T2 at the moment, not for pT2

So maybe I can simply remove that comment?

@RoyStegeman
Copy link
Copy Markdown
Member

RoyStegeman commented May 7, 2024

I assume @scarlehoff wrote it. Either way I agree with it for consistency with pT and pT_t

@comane comane force-pushed the polish_legacy_jet_datasets branch from 1276509 to 8cf8687 Compare May 7, 2024 16:43
Comment thread validphys2/src/validphys/cuts/filters.yaml Outdated
Comment thread validphys2/src/validphys/cuts/lockfiles/31_filters.lock.yaml Outdated
label: $|y|$
units: ''
pT:
pT2:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did was this change made (I see the corresponding change was also done in kinematics.yaml)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not completely sure, but I think this was discussed with @scarlehoff at some point.
The reason is probably that tanishq started using pT2 instead of pT, hence just to uniform to that.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set it back to pT. I think this makes more sense as is the kin var for the process JET

return {"min": min, "mid": mid, "max": max}


# ==================================================================== CMS_2JET_7TEV ====================================================================#
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these are only used for a single dataset, why are they not inside the folder of the corresponding dataset?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is how it was before, where I had a filter_utils.py module in each of the datasets.

I guess that the idea is that in the dataset folders we only have a filter.py that produces the dataset, all of the utils instead go in nnpdf_data/filter_utils.
Otherwise we would have some utils in nnpdf_data/filter_utils and other within the dataset folder, which would be more confusing.

Comment thread nnpdf_data/nnpdf_data/filter_utils/legacy_jets_utils.py
@scarlehoff scarlehoff mentioned this pull request May 20, 2024
33 tasks
Copy link
Copy Markdown
Member

@scarlehoff scarlehoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a __init__.py file to filter_utils as well.

In all honesty, I'm not entirely sure I see the point of having a legacy_jet_utils.py function where the functions inside are actually specific for each dataset.

However, I'm ok with concentrating all that stuff into filter_utils so that it can be easily skipped by the installation with the right rules in pyproject.toml.

Comment thread nnpdf_data/nnpdf_data/new_commondata/ATLAS_2JET_7TEV_R06/uncertainties.yaml Outdated
Comment thread nnpdf_data/nnpdf_data/filter_utils/legacy_jets_utils.py Outdated
@comane comane force-pushed the polish_legacy_jet_datasets branch from f7dd15a to 88fd513 Compare June 6, 2024 13:56
Copy link
Copy Markdown
Member

@Radonirinaunimi Radonirinaunimi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM now! Thanks @comane.

@scarlehoff, are you fine with merging this now? Such that I can continue with #2100 and #2099.

@scarlehoff
Copy link
Copy Markdown
Member

Yes. I think it should be ok.

@comane comane merged commit 4472155 into master Jun 10, 2024
@comane comane deleted the polish_legacy_jet_datasets branch June 10, 2024 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants