Add similarity cuts by Zaharid · Pull Request #1057 · NNPDF/nnpdf

Zaharid · 2021-01-20T14:10:47Z

Add a cut option that throws away data such that predictions two
different namespaces are too different. This used the existing cuts from intersection functionality and further refines to filter points
that do not fulfill

(threshold < ratio) & (ratio < 1/threshold)

where the threshold is set as the cut_similarity_threshold parameter
in the runcard. The two namespaces are defined as thee
cuts_intersection_spec list and the ratio is the first over the second namespace.

Some discussion:

This isn't the smartest code I have ever written.
I am using actual predictions instead of cfactors. This is because cfactors are incidental rather than physical meaningful but more importantly because it starts looking very cumbersome once one starts considering things like compound observables. I the end it pays off to do the proper job and solve the problem in general.
The use of central_predictions inside load is a bit of a hack, but
it is not clear to me it can be done much better in a more
"reportengine idiomatic" style. The logic is rather convoluted if you
sit down to think about it. In any case validphys.convolution is too
convenient for this to use anything else.
Moreover central_predictions is slow because it needs to load
fktables as csv. We should probably stick calls lru_cahces in various
places.
Tests and documentation are missing at the moment.

Zaharid · 2021-01-20T14:11:26Z

Example usage

dataset_input: {dataset: NMC}

theoryid: 53
pdf: NNPDF31_nnlo_as_0118

use_cuts: fromsimilarpredictions
cut_similarity_threshold: 0.99

cuts_intersection_spec:
    - theoryid: 52
      pdf: NNPDF31_nlo_as_0118
    - theoryid: 53
      pdf: NNPDF31_nlo_as_0118



template_text: |
    {@plot_fancy@}

actions_:
    - report(main=True)

Zaharid · 2021-01-20T14:15:12Z

@tgiani could you perhaps test this running some kind of fit.
@RosalynLP @wilsonmr could you see if there is an obviously better way to do this?
@enocera does the logic of the cuts look sound to you?

Zaharid · 2021-01-21T12:39:11Z

A test case that didn't work and works now is

experiments:
  - experiment: N
    datasets:  [{dataset: NMC},  {dataset: ATLAS1JET11}]

theoryid: 53
pdf: NNPDF31_nnlo_as_0118

use_cuts: fromsimilarpredictions
cut_similarity_threshold: 0.99

cuts_intersection_spec:
    - theoryid: 52
      pdf: NNPDF31_nlo_as_0118
    - theoryid: 53
      pdf: NNPDF31_nlo_as_0118



template_text: |
    {@experiments::experiment plot_fancy@}

actions_:
    - report(main=True)

Zaharid · 2021-01-21T21:10:54Z

@scarrazza @scarlehoff the fit failure doesn't seem my fault. Would you have a look?

scarlehoff · 2021-01-21T21:51:58Z

Interesting, the fit is in the server. How long does it take the server to update the list of fits? Maybe it would be good to add a few seconds of sleep to avoid race conditions there? @scarrazza

scarrazza · 2021-01-22T07:50:36Z

I think there is some bug in the script which updates the json containing the list of fits, sometimes I had to restart it manually. @Zaharid any idea?

Zaharid · 2021-01-22T10:11:10Z

Open a new issue describing the problem and I'll have a look. In any case there is #223

RosalynLP · 2021-01-25T11:37:52Z

+            raise ValueError("Expecting cuts to be the same for all datasets")
+        self.inputs = inputs
+        self.threshold = threshold
+        super().__init__(self.inputs, self.threshold)


@Zaharid just a general question about why we are inheriting from TupleComp here and in the other related classes because I can't see anywhere where the methods from TupleComp are used?

The methods of TupleComp mean, amongst other things, that the subclasses are hashable. For example you can initiate two such classes with the same arguments and you will see that they are equal. See with PDF:

>>> from validphys.core import PDF >>> pdf1 = PDF("foo") >>> pdf2 = PDF("foo") >>> pdf1 == pdf2 True

Which is useful because as far as we are concerned in this case a PDF is fully defined by its name (there should only be one NNPDF31_nnlo_as_0118 and that's all I need to know to get the correct pdf from the LHAPDF share folder).

Answering where this gets used: probably by the reportengine resource builder at some point to find unique dependencies (but I don't know - Zahari can answer). However, also all the classes which have their load method decorated with @functools.lru_cache must be hashable because the lru_cache stores the result of calling a function with a set of arguments, and so to check if it already cached the result you need to be able to see whether the arguments are actually the same. Since load always takes self as the first positional, you need to be able to check that self is indeed the same for two subsequent calls, which TupleComp facilitates.

There is also another double underscore method which means that classes print nicely:

>>> pdf1 PDF(name='foo')

Oh and I should say by default the first thing I said isn't the case:

>>> class TestClass: ... def __init__(self, a, b): self.a = a; self.b = b ... >>> cls1 = TestClass(1, 2) >>> cls2 = TestClass(1, 2) >>> cls1 == cls2 False

Perhaps I misunderstood something but that's my understanding anyway

TupleComp is mostly a simpleminded attempt to implement something like dataclasses from the standard library, before they existed (and with only the required properties). I would probably use that nowadays, despite some unwelcome complexity (and in fact this is done in various newer interfaces already such as the fkparser). Note there is #408.

Thank you both!

wilsonmr

I can't help but feel like the cuts should be an explicit node because there is a lot of functionality in the production rule.

The main hurdle is that loading data with the similarity cuts would require matched cuts -> dataset with matched cuts -> similarity cuts -> dataset with similarity cuts.

I suppose this could be achieved by creating a prod rule which set use_cuts: fromintersection, then collected the dataset over that and then the similarity cuts would take that as input. I'm not sure in the end whether this would actually make the code better but since the original comment talks about discussing a more reportengine idiomatic way of doing this, that's the best I can come up with.

Zaharid · 2021-01-25T13:27:30Z

I can't help but feel like the cuts should be an explicit node because there is a lot of functionality in the production rule.

The main hurdle is that loading data with the similarity cuts would require matched cuts -> dataset with matched cuts -> similarity cuts -> dataset with similarity cuts.

I suppose this could be achieved by creating a prod rule which set use_cuts: fromintersection, then collected the dataset over that and then the similarity cuts would take that as input. I'm not sure in the end whether this would actually make the code better but since the original comment talks about discussing a more reportengine idiomatic way of doing this, that's the best I can come up with.

This was the first thing I tried and found it to be unworkable, even with some pretty ideal version of reportengine that was designed to handle explicit nodes from the start. There is a bigger problem, namely that at the moment we have cuts that is a production rule and dataset that is also a production rule. If cuts was made a node instead then everything that depends on it would have to follow, which in turn would break a lot of existing code (e.g. we couldn't have checks that depend on dataset). This alone makes it impossible (AFAICT), but with the current version of reportengine one would also need to come up with the proper namespaces for these things (since the nodes currently depend on that) and that is also not so easy (because of the dataset > cuts > dataset thing). As you say, can be done with enough hacks (a collect over the intersection namespace) but it is not pretty, and when I did it, I realized that things collapse massively when one tries to use dataset.

I think in order to have this behaviour one would need to be able to say: "I want this node to be attached to all the nodes where production rules containing cuts are used as input and attach its result to the load method". Which sounds... hard and I don't think it's something a mythological reportengine 1.0 would support. In addition overall ISTM that the current method is rather simpler than all that and only comes at a theoretical cost of not taking advantage of some previous computation (which at the moment we can't handle better than with an lru_cache anyway).

wilsonmr · 2021-01-25T14:04:15Z

There is a bigger problem, namely that at the moment we have cuts that is a production rule and dataset that is also a production rule. If cuts was made a node instead then everything that depends on it would have to follow, which in turn would break a lot of existing code (e.g. we couldn't have checks that depend on dataset).

Right, I hadn't even thought of that! Ok well from my perspective this gets an approval, subject to the test fits.

Zaharid · 2021-01-25T15:12:50Z

As a side note, a possible workaround is to give nodes the ability to present themselves as "fake compiletime objects" that is useful for check purposes. Would also be useful in a context like:

# action

# Mostly making things up here
def fake_rw(reweighted_name, output_path, pdf, magic_context):
    spec =  make_pdf_spec(name=reweighted_name, path=f"{output_path}/pdfs", **pdf.to_spec())
    magic_context["fake_pdfs"][reweighted_name] = spec
    return spec

@fake(fake_rw)
def reweighted_pdf(pdf, reweighted_name):
    """Produce pdf set"""

Rw:
    pdf: NNPDF
    reweighred_name: reweighted

Plots:
    Q: 10
    pdfs:
      - NNPDF
      - reweighted
actions_:
   - Rw reweighted_pdf
   - Plots plot_pdfs

Then every production rule touched by this would become nodes that set themselves as "fake".
Not really sure the complexity is worth the trouble.

github-actions · 2021-01-27T17:30:01Z

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Fit Name: NNBOT-465c0dbd-2021-01-27
Fit Report: https://vp.nnpdf.science/xY0lRx_uSqeDe1N92i7dXw==
Fit Data: https://data.nnpdf.science/fits/NNBOT-465c0dbd-2021-01-27.tar.gz

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

Zaharid · 2021-01-27T17:37:46Z

@scarlehoff I am confused, should I expect any differences at all here?

scarlehoff · 2021-01-27T17:39:57Z

This is why I did #1025, to reduce a bit the fluctuations, the change to 1.37 is expected.

Zaharid · 2021-01-29T10:38:20Z

Right. So it seems that this needs to be tweaked a bit: specifically it should look at the ratio between the difference in predictions and the experimental uncertainty. This could be done with pretty much the same structure but changing the .load() function.

Zaharid · 2021-02-03T21:52:35Z

This is even more complicated: We need different data specifications for the NLO and NNLO namespace, because of the cfactors.

Zaharid · 2021-02-05T14:01:50Z

Right, this is not going to be particularly pretty (example runcard https://vp.nnpdf.science/5m_N3Dl-SAO9W__2rvlIrw==/input/runcard.yaml) and certainly not fast, but it works in a reproducible way.

Add a cut option that throws away data such that predictions two different namespaces are too different, specifically that do not fulfill (threshold < ratio) & (ratio < 1/threshold) where the threshold is set as the `cut_similarity_threshold` parameter in the runcard. The two namespaces are defined as thee cuts_intersection_spec list. Some discussion: - This isn't the smartest code I have ever written. - The use of central_predictions inside load is a bit of a hack, but it is not clear to me it can be done much better in a more "reportengine idiomatic" style. The logic is rather convoluted if you sit down to think about it. In any case validphys.convolution is too convenient for this to use anything else. - Moreover central_predictions is slow because it needs to load fktables as csv. We should probably stick calls lru_cahces in various places. - Tests and documentation are missing at the moment.

Use from_: None with the right namespace override instead of calling the production rule directly. There are a few reasons for this: - The rather trivial problem that the cut_similarity_threshold parameter was not propagated because it was not being passed to the function call. The alternative would have been to add the parameter everywhere `cuts` is required. - The more serious problem that we do want to resolve things like dataset with some funky namespace settings. For this to work properly we need to execute the production rule withing the right context, namely one containing the relevant dataset_input within each experiment/data_input. Alternatives are not clear since the construction of `dataset` for the similarity cuts requires inspecting for the cuts.

Trust automated mechanisms to resolve each dataset given the correct context instead of resolving some dependencies manually for no particular reason. This also allows to clean up the signatures. Take advantage of the code to process `data` for the handling of experiments.

Co-authored-by: wilsonmr <33907451+wilsonmr@users.noreply.github.com>

Read datasets from a dataset_inputs list defined for each of the namespaces. Currently this is much slower that it could (since we are parsing the list every time) but I cannot think of a good way of doing it quickly.

wilsonmr · 2021-02-24T12:59:35Z

+                    ):
+                        _, ds = self.parse_from_(None, "dataset", write=False)
+                        _, pdf = self.parse_from_(None, "pdf", write=False)
+                    print(ds, ds.fkspecs[0].cfactors)


you seem quite determined to include a print here

wilsonmr · 2021-02-24T13:02:58Z

+                        di = next(d for d in dins if d.name == name)
+                    except StopIteration as e:
+                        raise ConfigError(
+                            f"cuts_intersection_spec dataset inputs must define {name}"


Seeing as the input is quite long and a bit hard to parse, I wonder if we enumerated the namespaces and told them which one was raising this error it might help?

wilsonmr · 2021-02-24T13:04:27Z

+                        ns=self._curr_ns.new_child(
+                            {
+                                "dataset_input": di,
+                                "use_cuts": CutsPolicy.FROM_CUT_INTERSECTION_NAMESPACE,


can we not just set the "cuts":matched_cuts or does that break something?

seems sub optimal to have to reproduce the exact same set of cuts, especially considering we saved them earlier but I don't know about patching objects into the namespace directly.

wilsonmr

left some minor comments. Overall it is a bit ugly but I also think it's the most logical way to reliably do this.

Improve error message a bit and re-use the already computed cuts.

This is added to the guide, which is otherwise outdated, but this studd isn't anywhere at the moment.

Zaharid · 2021-03-03T14:53:20Z

Ok, I would be happy for this to be merged and not happy to think about it more.

Zaharid requested review from RosalynLP, enocera, tgiani and wilsonmr January 20, 2021 14:10

Zaharid added the run-fit-bot Starts fit bot from a PR. label Jan 21, 2021

Zaharid force-pushed the morecuts branch from 3bbf2c9 to df83ae0 Compare January 21, 2021 12:53

Zaharid added run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Jan 21, 2021

wilsonmr reviewed Jan 22, 2021

View reviewed changes

Comment thread validphys2/src/validphys/config.py Outdated

RosalynLP reviewed Jan 25, 2021

View reviewed changes

wilsonmr reviewed Jan 25, 2021

View reviewed changes

Comment thread validphys2/src/validphys/config.py Outdated

Zaharid added run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Jan 27, 2021

scarlehoff mentioned this pull request Jan 27, 2021

4.0 tag #1065

Closed

wilsonmr mentioned this pull request Feb 9, 2021

Reading lockfiles if specified in the runcard #818

Closed

Zaharid and others added 6 commits February 10, 2021 14:37

Update validphys2/src/validphys/config.py

9213e08

Co-authored-by: wilsonmr <33907451+wilsonmr@users.noreply.github.com>

Change the logic so the ratio is to exp unc

77c7df4

Read datasets properly

df01433

Read datasets from a dataset_inputs list defined for each of the namespaces. Currently this is much slower that it could (since we are parsing the list every time) but I cannot think of a good way of doing it quickly.

Zaharid force-pushed the morecuts branch from ce40dd2 to df01433 Compare February 10, 2021 14:40

wilsonmr reviewed Feb 24, 2021

View reviewed changes

wilsonmr approved these changes Feb 24, 2021

View reviewed changes

Zaharid added 3 commits March 3, 2021 14:16

Address various code review comments

d9c6e63

Improve error message a bit and re-use the already computed cuts.

Add minimal tests for similarity cuts

608191c

Add some docs on similarity cuts

3f92282

This is added to the guide, which is otherwise outdated, but this studd isn't anywhere at the moment.

scarrazza merged commit 10e7b8d into master Mar 3, 2021

scarrazza deleted the morecuts branch March 3, 2021 15:52

Conversation

Zaharid commented Jan 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zaharid commented Jan 20, 2021

Uh oh!

Zaharid commented Jan 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zaharid commented Jan 21, 2021

Uh oh!

Zaharid commented Jan 21, 2021

Uh oh!

scarlehoff commented Jan 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scarrazza commented Jan 22, 2021

Uh oh!

Zaharid commented Jan 22, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wilsonmr Jan 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wilsonmr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Zaharid commented Jan 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wilsonmr commented Jan 25, 2021

Uh oh!

Zaharid commented Jan 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jan 27, 2021

Uh oh!

Zaharid commented Jan 27, 2021

Uh oh!

scarlehoff commented Jan 27, 2021

Uh oh!

Zaharid commented Jan 29, 2021

Uh oh!

Zaharid commented Feb 3, 2021

Uh oh!

Zaharid commented Feb 5, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wilsonmr left a comment

Choose a reason for hiding this comment

Uh oh!

Zaharid commented Mar 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Zaharid commented Jan 20, 2021 •

edited

Loading

Zaharid commented Jan 20, 2021 •

edited

Loading

scarlehoff commented Jan 21, 2021 •

edited

Loading

wilsonmr Jan 25, 2021 •

edited

Loading

Zaharid commented Jan 25, 2021 •

edited

Loading

Zaharid commented Jan 25, 2021 •

edited

Loading