MakeReplica in python by siranipour · Pull Request #866 · NNPDF/nnpdf

siranipour · 2020-08-19T14:01:20Z

This is a direct translation of the C++ code for now, I will optimize it using numpy next, hopefully should be fairly straightforward.

I've tried it with NMC so far and if you initalize the RNG with the same seed it will generate exactly the same pseudodata.

Also, I've not implemented the final check on the positivity of the pseudodata

siranipour · 2020-08-20T10:16:01Z

@Zaharid I think you were thinking of issue #379 which this will close

siranipour · 2020-08-20T15:01:59Z

import numpy as np

from validphys.covmats import make_replica
from validphys.loader import Loader
l = Loader()

import NNPDF

for dataset in l.available_datasets:
    try:
        cd = l.check_commondata(dataset)
        ds = l.check_dataset(dataset, theoryid=53, cuts=None)
        exp = l.check_experiment('foo', [ds])
        ld = exp.load()
        NNPDF.RandomGenerator.InitRNG(0, 1)
        ld.MakeReplica()
        new_code = make_replica(cd)
        np.testing.assert_allclose(ld.get_cv(), new_code)
    except FileNotFoundError:
        continue

seems to be working

voisey · 2020-08-20T15:37:04Z

Will take a look soon! Out of interest, have you tested the difference in speed between the python and C++?

siranipour · 2020-08-20T15:39:20Z

Yeah I checked very briefly. It's difficult to make an apples to apples comparison because the question is do you time the exp.load() line (I would be inclined to say yes)?

In general it's fairly comparable, but it's not blisteringly quicker.

Btw, I may have been premature with the review requests, maybe just wait until the next commit since I've yet to sort out the docstring

siranipour · 2020-08-24T14:26:47Z

Ok this is ready for review now.

I was wondering, what do people think on having some sort of temporary tests we can use to ensure the python replacement replicates the old c++ code? We can then delete this test once we fully switch to python

voisey · 2020-08-24T16:48:31Z

I don't think it would hurt to have something like that so I'd be for it. We have something kind of similar in here https://github.com/NNPDF/nnpdf/blob/master/validphys2/src/validphys/tests/test_covmats.py in that we currently test that both the C++ and the python work, where one of the tests will eventually be deleted. I know what you suggest is slightly different, but the point is that we already have tests that we plan on deleting

Zaharid · 2020-08-24T17:23:27Z

More tests don't hurt especially in that they increase coverage.

Zaharid · 2020-08-24T17:29:12Z

I like how the code is nice and well commented. It will certainly help any efforts for making it fast.

siranipour · 2020-08-24T18:15:50Z

I don't think it would hurt to have something like that so I'd be for it. We have something kind of similar in here https://github.com/NNPDF/nnpdf/blob/master/validphys2/src/validphys/tests/test_covmats.py in that we currently test that both the C++ and the python work, where one of the tests will eventually be deleted. I know what you suggest is slightly different, but the point is that we already have tests that we plan on deleting

Ah yes very good point! Forgot we implemented that test.

Ok sure, I guess we can add a new test altogether that does all the comparisons between c++ and python. I'll do this in another PR so they're all kept together

Zaharid · 2020-08-25T11:00:21Z

FWIW there is a 30 pages paper (that I have no patience to read) on how to sample from a truncated gaussian

https://www4.stat.ncsu.edu/~boos/library/papers/mimeo2649_Li.pdf

On a somewhat related note, the fact that this is so complicated makes me think it's not such a great idea: If we are imposing positivity in PDFs anyway, why are we also doing it in the sampling? More importantly, does it make any difference other than increasing the complexity and the surface for bugs?

siranipour · 2020-08-27T09:35:33Z

I've quickly glanced at that paper and it certainly looks like it could be fun to take a look at. I just worry how easy it would be to accommodate some of the edge cases we have going on. Indeed, it's what forced me to resign in trying to understand the papers and blindly translate the C++ code.

I've also wondered whether having positive definite pseudodata is essential. It'd be interesting to see how many times the algorithm stumbles across a sample that fails this criterion. Even if it does, it could be that the fit doesn't really care too much. I guess the only way to answer that question would be to construct pseudodata that has negative entries and then run a fit to see.

siranipour · 2021-02-08T18:29:37Z

Just a quick series of commits that makes make_replica a provider, a provider that indexes the data and a regression test.

wilsonmr · 2021-02-15T10:58:31Z

@tgiani would you be able to take another look at this, since you have experience with the replica generation in c++?

wilsonmr · 2021-02-15T11:54:21Z

I double checked and what I was saying at the code meeting is true: we do not generate shifts for THEORY systematics (corr or uncorr) in the old code, which btw shouldn't be confused with the theory covariance matrix.

Looking at datasets which performed badly on the KS test: ATLAS ZpT rapidity and WZ rapidity
they both only have multiplicative systematics, I ran them through the non-regressions tests and they passed so I don't think there is a funny bug happening.

@RosalynLP how exactly did you generate the old data replicas? It occurs to me that because exp.MakeReplica() mutates the underlying c++ Experiment you have to explicitly call load on the DataGroupSpec without caching each time you generate a replica because otherwise it uses the shifted central value. Or make sure you copied the Experiment as per 28-34 of mc_gen_checks

Provided that's all fine, and that there were no errors in matching data point index between new and old replicas then one suggestion is that the purely multiplicative sets might require more replicas the get a stable result

tgiani · 2021-02-15T12:44:04Z

@tgiani would you be able to take another look at this, since you have experience with the replica generation in c++?

yes sure, I ll look at this before the next code meeting

siranipour · 2021-02-15T13:30:00Z

Yeah if you look at #1087 and line 43 of mc_gen_checks.py a copy of the Experiment object is made before calling MakeReplica, so it should be fine

wilsonmr · 2021-02-15T13:54:40Z

sure, I was more wondering if all of the results presented were definitely produced using that code or equivalent

RosalynLP · 2021-02-15T14:35:42Z

@wilsonmr I used adapted versions of those in mc_gen_checks, so copied the experiment

wilsonmr · 2021-02-15T14:41:47Z

Great, just checking. The test isn't particularly simple here so when the results look weird it seems worth looking for errors on both ends 😆

I just tried generating for just ATLASWZRAP with both codes and I get similar result (expected since the ATLAS wide systematic is hardly going to be dominant here) I'm trying with 10000 reps now since it doesn't take as long as a global generation. I also am producing the histograms for each bin. On the first pass with 1000 reps the histograms looked close enough to me. Results for 10000 reps should be ready in 15 mins.

EDIT: I should say I also picked 10000 because it uses a different method in auto mode of ks_2samp

wilsonmr · 2021-02-15T15:20:55Z

huh so the KS with 10000 reps was about 0.025 which seems pretty low but the p value was pretty similar to your plot.

Having said that look at the distributions for each bin - I really don't think there is a problem here. So I think we can stop worrying https://vp.nnpdf.science/J2wE4X3jRuWGaBjHZXKM-A==

RosalynLP · 2021-02-15T15:26:17Z

@wilsonmr thanks yeah this looks good re bin by bin distributions - to me this check makes more intuitive sense than the p-values do

siranipour · 2021-02-15T16:04:08Z

Is each histogram a bin? Forgive my ignorance but I'm a bit confused about what the x axis refers to here, nor what qualifies as a count for the histogram

wilsonmr · 2021-02-15T16:15:23Z

Each histogram is for a data point which I was referring to as a bin. ATLASWZRAP11 has 34 data points so there are 34 histograms.

The histograms are made up of 10000 replicas and the x axis is the raw value of that particular data point. So the first histogram there were ~550 replicas whose first data point had a value of ~580000.

So in other words each histogram is an unnormalised approximation of the (marginalised*) distribution from which the pseudodata has been drawn from.

*marginalised because we are looking data point by data point so each plot is as though we have integrated over all other datapoints.

wilsonmr · 2021-02-15T16:18:53Z

Also, the variance estimated from the samples which were used to produce these histograms are the variance which Rosalyn calculated.

wilsonmr · 2021-02-15T16:20:59Z

I was a bit sloppy with the heading and the x labels

siranipour · 2021-02-15T17:08:33Z

Ah I follow!! In which case this is a very convincing comparison!! Weird about the p-value situation though I wonder what that's about. Thanks for these histograms though!

Zaharid · 2021-02-18T11:03:08Z

+                1 + mult_corr_errors * rng.normal(size=(1, mult_corr_errors.shape[1])) / 100
+            ).prod(axis=1)
+
+            mult_shifts.append(mult_shift)


I don't understand the logic here. I don't see how these things are being correlated across datasets. Random nubers for thing like lumi should only be sampled once, and here it seems it is once per dataset... That logic only seems to be there for the "special" systematics.

mult_corr_errors = mult_errors.loc[:, mult_errors.columns == "CORR"].to_numpy()

this only gets applied to systematics which have a systype row of MULT, CORR. Everything that needs to be correlated across datasets i.e lumi will have a special name which is not CORR and, as you point out, is handled below

Ah, I see. I had missed the ~ in mult_errors.loc[:, ~mult_errors.columns.isin(INTRA_DATASET_SYS_NAME)]

yeah, perhaps the comments could be made more obvious

i.e

# all mult errors must be converted to from percent to fraction # generate all uncorrelated multiplicative shifts mult_shift = ( 1 + mult_uncorr_errors * rng.normal(size=mult_uncorr_errors.shape) / 100 ).prod(axis=1)

# generate shifts for systematics which are only correlated within this dataset mult_corr_errors = mult_errors.loc[:, mult_errors.columns == "CORR"].to_numpy()

# store multiplicative shifts to apply after all additive shifts. mult_shifts.append(mult_shift)

etc.

Zaharid · 2021-02-18T12:48:49Z

Ok, I went over this, I tested it and I think that the tests are sufficient.

Zaharid · 2021-02-26T11:50:01Z

Good, about time we merged this!

siranipour force-pushed the python-pseudodata branch 2 times, most recently from 8c94e4f to 0108a0e Compare August 19, 2020 14:06

siranipour marked this pull request as draft August 19, 2020 14:46

siranipour force-pushed the python-pseudodata branch 2 times, most recently from d7013ba to 1c1633f Compare August 20, 2020 14:49

siranipour force-pushed the python-pseudodata branch from 1c1633f to 5c665b1 Compare August 20, 2020 15:13

siranipour requested review from Zaharid, tgiani and voisey August 20, 2020 15:19

siranipour added the destroyingc++ label Aug 20, 2020

siranipour marked this pull request as ready for review August 20, 2020 15:21

siranipour force-pushed the python-pseudodata branch 2 times, most recently from 256f873 to a26ba80 Compare August 24, 2020 14:39

voisey reviewed Aug 25, 2020

View reviewed changes

Comment thread validphys2/src/validphys/covmats.py Outdated

Comment thread validphys2/src/validphys/covmats.py Outdated

siranipour force-pushed the python-pseudodata branch 5 times, most recently from 1f0275c to 3792b8b Compare August 27, 2020 14:39

siranipour added 4 commits February 8, 2021 16:41

iterable -> list in docstring

432063d

using API to load cd

e00e8d0

making make replica a provider

ae2623e

index pseudodata

bbbd226

siranipour force-pushed the python-pseudodata branch from 36126cf to bbbd226 Compare February 8, 2021 18:06

Adding regression test for make_replica

6a12bd6

RosalynLP mentioned this pull request Feb 9, 2021

Update mc_gen_checks module for C++ rep gen to data keyword #1087

Merged

wilsonmr mentioned this pull request Feb 17, 2021

NEW Restructure python data IO/actions #1098

Merged

Zaharid reviewed Feb 18, 2021

View reviewed changes

Zaharid approved these changes Feb 18, 2021

View reviewed changes

Merge branch 'master' into python-pseudodata

5638e49

Zaharid merged commit a449869 into master Feb 26, 2021

Zaharid deleted the python-pseudodata branch February 26, 2021 11:50

Conversation

siranipour commented Aug 19, 2020 • edited by Zaharid Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

siranipour commented Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

siranipour commented Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

voisey commented Aug 20, 2020

Uh oh!

siranipour commented Aug 20, 2020

Uh oh!

siranipour commented Aug 24, 2020

Uh oh!

voisey commented Aug 24, 2020

Uh oh!

Zaharid commented Aug 24, 2020

Uh oh!

Zaharid commented Aug 24, 2020

Uh oh!

siranipour commented Aug 24, 2020

Uh oh!

Uh oh!

Uh oh!

Zaharid commented Aug 25, 2020

Uh oh!

siranipour commented Aug 27, 2020

Uh oh!

siranipour commented Feb 8, 2021

Uh oh!

wilsonmr commented Feb 15, 2021

Uh oh!

wilsonmr commented Feb 15, 2021

Uh oh!

tgiani commented Feb 15, 2021

Uh oh!

siranipour commented Feb 15, 2021

Uh oh!

wilsonmr commented Feb 15, 2021

Uh oh!

RosalynLP commented Feb 15, 2021

Uh oh!

wilsonmr commented Feb 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wilsonmr commented Feb 15, 2021

Uh oh!

RosalynLP commented Feb 15, 2021

Uh oh!

siranipour commented Feb 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wilsonmr commented Feb 15, 2021

Uh oh!

wilsonmr commented Feb 15, 2021

Uh oh!

wilsonmr commented Feb 15, 2021

Uh oh!

siranipour commented Feb 15, 2021

Uh oh!

Zaharid Feb 18, 2021

Choose a reason for hiding this comment

Uh oh!

wilsonmr Feb 18, 2021

Choose a reason for hiding this comment

Uh oh!

Zaharid Feb 18, 2021

Choose a reason for hiding this comment

Uh oh!

wilsonmr Feb 18, 2021

Choose a reason for hiding this comment

Uh oh!

wilsonmr Feb 18, 2021

Choose a reason for hiding this comment

Uh oh!

Zaharid commented Feb 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

siranipour commented Aug 19, 2020 •

edited by Zaharid

Loading

siranipour commented Aug 20, 2020 •

edited

Loading

siranipour commented Aug 20, 2020 •

edited

Loading

wilsonmr commented Feb 15, 2021 •

edited

Loading

siranipour commented Feb 15, 2021 •

edited

Loading

Zaharid commented Feb 18, 2021 •

edited

Loading