Skip to content

Add similarity cuts#1057

Merged
scarrazza merged 9 commits into
masterfrom
morecuts
Mar 3, 2021
Merged

Add similarity cuts#1057
scarrazza merged 9 commits into
masterfrom
morecuts

Conversation

@Zaharid
Copy link
Copy Markdown
Contributor

@Zaharid Zaharid commented Jan 20, 2021

Add a cut option that throws away data such that predictions two
different namespaces are too different. This used the existing cuts from intersection functionality and further refines to filter points
that do not fulfill

(threshold < ratio) & (ratio < 1/threshold)

where the threshold is set as the cut_similarity_threshold parameter
in the runcard. The two namespaces are defined as thee
cuts_intersection_spec list and the ratio is the first over the second namespace.

Some discussion:

  • This isn't the smartest code I have ever written.

  • I am using actual predictions instead of cfactors. This is because cfactors are incidental rather than physical meaningful but more importantly because it starts looking very cumbersome once one starts considering things like compound observables. I the end it pays off to do the proper job and solve the problem in general.

  • The use of central_predictions inside load is a bit of a hack, but
    it is not clear to me it can be done much better in a more
    "reportengine idiomatic" style. The logic is rather convoluted if you
    sit down to think about it. In any case validphys.convolution is too
    convenient for this to use anything else.

  • Moreover central_predictions is slow because it needs to load
    fktables as csv. We should probably stick calls lru_cahces in various
    places.

  • Tests and documentation are missing at the moment.

@Zaharid
Copy link
Copy Markdown
Contributor Author

Zaharid commented Jan 20, 2021

Example usage

dataset_input: {dataset: NMC}

theoryid: 53
pdf: NNPDF31_nnlo_as_0118

use_cuts: fromsimilarpredictions
cut_similarity_threshold: 0.99

cuts_intersection_spec:
    - theoryid: 52
      pdf: NNPDF31_nlo_as_0118
    - theoryid: 53
      pdf: NNPDF31_nlo_as_0118



template_text: |
    {@plot_fancy@}

actions_:
    - report(main=True)

@Zaharid
Copy link
Copy Markdown
Contributor Author

Zaharid commented Jan 20, 2021

@tgiani could you perhaps test this running some kind of fit.
@RosalynLP @wilsonmr could you see if there is an obviously better way to do this?
@enocera does the logic of the cuts look sound to you?

@Zaharid
Copy link
Copy Markdown
Contributor Author

Zaharid commented Jan 21, 2021

A test case that didn't work and works now is

experiments:
  - experiment: N
    datasets:  [{dataset: NMC},  {dataset: ATLAS1JET11}]

theoryid: 53
pdf: NNPDF31_nnlo_as_0118

use_cuts: fromsimilarpredictions
cut_similarity_threshold: 0.99

cuts_intersection_spec:
    - theoryid: 52
      pdf: NNPDF31_nlo_as_0118
    - theoryid: 53
      pdf: NNPDF31_nlo_as_0118



template_text: |
    {@experiments::experiment plot_fancy@}

actions_:
    - report(main=True)

@Zaharid Zaharid added the run-fit-bot Starts fit bot from a PR. label Jan 21, 2021
@Zaharid Zaharid added run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Jan 21, 2021
@Zaharid
Copy link
Copy Markdown
Contributor Author

Zaharid commented Jan 21, 2021

@scarrazza @scarlehoff the fit failure doesn't seem my fault. Would you have a look?

@scarlehoff
Copy link
Copy Markdown
Member

scarlehoff commented Jan 21, 2021

Interesting, the fit is in the server. How long does it take the server to update the list of fits? Maybe it would be good to add a few seconds of sleep to avoid race conditions there? @scarrazza

@scarrazza
Copy link
Copy Markdown
Member

I think there is some bug in the script which updates the json containing the list of fits, sometimes I had to restart it manually. @Zaharid any idea?

@Zaharid
Copy link
Copy Markdown
Contributor Author

Zaharid commented Jan 22, 2021

Open a new issue describing the problem and I'll have a look. In any case there is #223

Comment thread validphys2/src/validphys/config.py Outdated
raise ValueError("Expecting cuts to be the same for all datasets")
self.inputs = inputs
self.threshold = threshold
super().__init__(self.inputs, self.threshold)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Zaharid just a general question about why we are inheriting from TupleComp here and in the other related classes because I can't see anywhere where the methods from TupleComp are used?

Copy link
Copy Markdown
Contributor

@wilsonmr wilsonmr Jan 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The methods of TupleComp mean, amongst other things, that the subclasses are hashable. For example you can initiate two such classes with the same arguments and you will see that they are equal. See with PDF:

>>> from validphys.core import PDF
>>> pdf1 = PDF("foo")
>>> pdf2 = PDF("foo")
>>> pdf1 == pdf2
True

Which is useful because as far as we are concerned in this case a PDF is fully defined by its name (there should only be one NNPDF31_nnlo_as_0118 and that's all I need to know to get the correct pdf from the LHAPDF share folder).

Answering where this gets used: probably by the reportengine resource builder at some point to find unique dependencies (but I don't know - Zahari can answer). However, also all the classes which have their load method decorated with @functools.lru_cache must be hashable because the lru_cache stores the result of calling a function with a set of arguments, and so to check if it already cached the result you need to be able to see whether the arguments are actually the same. Since load always takes self as the first positional, you need to be able to check that self is indeed the same for two subsequent calls, which TupleComp facilitates.

There is also another double underscore method which means that classes print nicely:

>>> pdf1
PDF(name='foo')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and I should say by default the first thing I said isn't the case:

>>> class TestClass:
...  def __init__(self, a, b): self.a = a; self.b = b
... 
>>> cls1 = TestClass(1, 2)
>>> cls2 = TestClass(1, 2)
>>> cls1 == cls2
False

Perhaps I misunderstood something but that's my understanding anyway

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TupleComp is mostly a simpleminded attempt to implement something like dataclasses from the standard library, before they existed (and with only the required properties). I would probably use that nowadays, despite some unwelcome complexity (and in fact this is done in various newer interfaces already such as the fkparser). Note there is #408.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you both!

Copy link
Copy Markdown
Contributor

@wilsonmr wilsonmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't help but feel like the cuts should be an explicit node because there is a lot of functionality in the production rule.

The main hurdle is that loading data with the similarity cuts would require matched cuts -> dataset with matched cuts -> similarity cuts -> dataset with similarity cuts.

I suppose this could be achieved by creating a prod rule which set use_cuts: fromintersection, then collected the dataset over that and then the similarity cuts would take that as input. I'm not sure in the end whether this would actually make the code better but since the original comment talks about discussing a more reportengine idiomatic way of doing this, that's the best I can come up with.

Comment thread validphys2/src/validphys/config.py Outdated
@Zaharid
Copy link
Copy Markdown
Contributor Author

Zaharid commented Jan 25, 2021

I can't help but feel like the cuts should be an explicit node because there is a lot of functionality in the production rule.

The main hurdle is that loading data with the similarity cuts would require matched cuts -> dataset with matched cuts -> similarity cuts -> dataset with similarity cuts.

I suppose this could be achieved by creating a prod rule which set use_cuts: fromintersection, then collected the dataset over that and then the similarity cuts would take that as input. I'm not sure in the end whether this would actually make the code better but since the original comment talks about discussing a more reportengine idiomatic way of doing this, that's the best I can come up with.

This was the first thing I tried and found it to be unworkable, even with some pretty ideal version of reportengine that was designed to handle explicit nodes from the start. There is a bigger problem, namely that at the moment we have cuts that is a production rule and dataset that is also a production rule. If cuts was made a node instead then everything that depends on it would have to follow, which in turn would break a lot of existing code (e.g. we couldn't have checks that depend on dataset). This alone makes it impossible (AFAICT), but with the current version of reportengine one would also need to come up with the proper namespaces for these things (since the nodes currently depend on that) and that is also not so easy (because of the dataset > cuts > dataset thing). As you say, can be done with enough hacks (a collect over the intersection namespace) but it is not pretty, and when I did it, I realized that things collapse massively when one tries to use dataset.

I think in order to have this behaviour one would need to be able to say: "I want this node to be attached to all the nodes where production rules containing cuts are used as input and attach its result to the load method". Which sounds... hard and I don't think it's something a mythological reportengine 1.0 would support. In addition overall ISTM that the current method is rather simpler than all that and only comes at a theoretical cost of not taking advantage of some previous computation (which at the moment we can't handle better than with an lru_cache anyway).

@wilsonmr
Copy link
Copy Markdown
Contributor

There is a bigger problem, namely that at the moment we have cuts that is a production rule and dataset that is also a production rule. If cuts was made a node instead then everything that depends on it would have to follow, which in turn would break a lot of existing code (e.g. we couldn't have checks that depend on dataset).

Right, I hadn't even thought of that! Ok well from my perspective this gets an approval, subject to the test fits.

@Zaharid
Copy link
Copy Markdown
Contributor Author

Zaharid commented Jan 25, 2021

As a side note, a possible workaround is to give nodes the ability to present themselves as "fake compiletime objects" that is useful for check purposes. Would also be useful in a context like:

# action

# Mostly making things up here
def fake_rw(reweighted_name, output_path, pdf, magic_context):
    spec =  make_pdf_spec(name=reweighted_name, path=f"{output_path}/pdfs", **pdf.to_spec())
    magic_context["fake_pdfs"][reweighted_name] = spec
    return spec

@fake(fake_rw)
def reweighted_pdf(pdf, reweighted_name):
    """Produce pdf set"""
Rw:
    pdf: NNPDF
    reweighred_name: reweighted

Plots:
    Q: 10
    pdfs:
      - NNPDF
      - reweighted
actions_:
   - Rw reweighted_pdf
   - Plots plot_pdfs

Then every production rule touched by this would become nodes that set themselves as "fake".
Not really sure the complexity is worth the trouble.

@Zaharid Zaharid added run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Jan 27, 2021
@scarlehoff scarlehoff mentioned this pull request Jan 27, 2021
@github-actions
Copy link
Copy Markdown

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

@Zaharid
Copy link
Copy Markdown
Contributor Author

Zaharid commented Jan 27, 2021

@scarlehoff I am confused, should I expect any differences at all here?

@scarlehoff
Copy link
Copy Markdown
Member

This is why I did #1025, to reduce a bit the fluctuations, the change to 1.37 is expected.

@Zaharid
Copy link
Copy Markdown
Contributor Author

Zaharid commented Jan 29, 2021

Right. So it seems that this needs to be tweaked a bit: specifically it should look at the ratio between the difference in predictions and the experimental uncertainty. This could be done with pretty much the same structure but changing the .load() function.

@Zaharid
Copy link
Copy Markdown
Contributor Author

Zaharid commented Feb 3, 2021

This is even more complicated: We need different data specifications for the NLO and NNLO namespace, because of the cfactors.

@Zaharid
Copy link
Copy Markdown
Contributor Author

Zaharid commented Feb 5, 2021

Right, this is not going to be particularly pretty (example runcard https://vp.nnpdf.science/5m_N3Dl-SAO9W__2rvlIrw==/input/runcard.yaml) and certainly not fast, but it works in a reproducible way.

Zaharid and others added 6 commits February 10, 2021 14:37
Add a cut option that throws away data such that predictions two
different namespaces are too different, specifically that do not fulfill

(threshold < ratio) & (ratio < 1/threshold)

where the threshold is set as the `cut_similarity_threshold` parameter
in the runcard. The two namespaces are defined as thee
cuts_intersection_spec list.

Some discussion:

  - This isn't the smartest code I have ever written.

  - The use of central_predictions inside load is a bit of a hack, but
  it is not clear to me it can be done much better in a more
  "reportengine idiomatic" style. The logic is rather convoluted if you
  sit down to think about it. In any case validphys.convolution is too
  convenient for this to use anything else.

  - Moreover central_predictions is slow because it needs to load
  fktables as csv. We should probably stick calls lru_cahces in various
  places.

  - Tests and documentation are missing at the moment.
Use from_: None with the right namespace override instead of calling the
production rule directly.

There are a few reasons for this:

  - The rather trivial problem that the cut_similarity_threshold
  parameter was not propagated because it was not being passed to the
  function call. The alternative would have been to add the parameter
  everywhere `cuts` is required.

  - The more serious problem that we do want to resolve things like
  dataset with some funky namespace settings. For this to work properly
  we need to execute the production rule withing the right context,
  namely one containing the relevant dataset_input within each
  experiment/data_input. Alternatives are not clear since the
  construction of `dataset` for the similarity cuts requires inspecting
  for the cuts.
Trust automated mechanisms to resolve each dataset given the correct
context instead of resolving some dependencies manually for no
particular reason. This also allows to clean up the signatures.

Take advantage of the code to process `data` for the handling of
experiments.
Co-authored-by: wilsonmr <33907451+wilsonmr@users.noreply.github.com>
Read datasets from a dataset_inputs list defined for each of the
namespaces. Currently this is much slower that it could (since we are
parsing the list every time) but I cannot think of a good way of doing
it quickly.
Comment thread validphys2/src/validphys/config.py Outdated
):
_, ds = self.parse_from_(None, "dataset", write=False)
_, pdf = self.parse_from_(None, "pdf", write=False)
print(ds, ds.fkspecs[0].cfactors)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you seem quite determined to include a print here

Comment thread validphys2/src/validphys/config.py Outdated
di = next(d for d in dins if d.name == name)
except StopIteration as e:
raise ConfigError(
f"cuts_intersection_spec dataset inputs must define {name}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing as the input is quite long and a bit hard to parse, I wonder if we enumerated the namespaces and told them which one was raising this error it might help?

ns=self._curr_ns.new_child(
{
"dataset_input": di,
"use_cuts": CutsPolicy.FROM_CUT_INTERSECTION_NAMESPACE,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not just set the "cuts":matched_cuts or does that break something?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems sub optimal to have to reproduce the exact same set of cuts, especially considering we saved them earlier but I don't know about patching objects into the namespace directly.

Copy link
Copy Markdown
Contributor

@wilsonmr wilsonmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some minor comments. Overall it is a bit ugly but I also think it's the most logical way to reliably do this.

Zaharid added 3 commits March 3, 2021 14:16
Improve error message a bit and re-use the already computed cuts.
This is added to the guide, which is otherwise outdated, but this studd
isn't anywhere at the moment.
@Zaharid
Copy link
Copy Markdown
Contributor Author

Zaharid commented Mar 3, 2021

Ok, I would be happy for this to be merged and not happy to think about it more.

@scarrazza scarrazza merged commit 10e7b8d into master Mar 3, 2021
@scarrazza scarrazza deleted the morecuts branch March 3, 2021 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-fit-bot Starts fit bot from a PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants