Skip to content

MakeReplica in python#866

Merged
Zaharid merged 38 commits into
masterfrom
python-pseudodata
Feb 26, 2021
Merged

MakeReplica in python#866
Zaharid merged 38 commits into
masterfrom
python-pseudodata

Conversation

@siranipour
Copy link
Copy Markdown
Contributor

@siranipour siranipour commented Aug 19, 2020

This is a direct translation of the C++ code for now, I will optimize it using numpy next, hopefully should be fairly straightforward.

I've tried it with NMC so far and if you initalize the RNG with the same seed it will generate exactly the same pseudodata.

Also, I've not implemented the final check on the positivity of the pseudodata

@siranipour siranipour force-pushed the python-pseudodata branch 2 times, most recently from 8c94e4f to 0108a0e Compare August 19, 2020 14:06
@siranipour siranipour marked this pull request as draft August 19, 2020 14:46
@siranipour
Copy link
Copy Markdown
Contributor Author

siranipour commented Aug 20, 2020

@Zaharid I think you were thinking of issue #379 which this will close

@siranipour siranipour force-pushed the python-pseudodata branch 2 times, most recently from d7013ba to 1c1633f Compare August 20, 2020 14:49
@siranipour
Copy link
Copy Markdown
Contributor Author

siranipour commented Aug 20, 2020

import numpy as np

from validphys.covmats import make_replica
from validphys.loader import Loader
l = Loader()

import NNPDF

for dataset in l.available_datasets:
    try:
        cd = l.check_commondata(dataset)
        ds = l.check_dataset(dataset, theoryid=53, cuts=None)
        exp = l.check_experiment('foo', [ds])
        ld = exp.load()
        NNPDF.RandomGenerator.InitRNG(0, 1)
        ld.MakeReplica()
        new_code = make_replica(cd)
        np.testing.assert_allclose(ld.get_cv(), new_code)
    except FileNotFoundError:
        continue

seems to be working

@siranipour siranipour marked this pull request as ready for review August 20, 2020 15:21
@voisey
Copy link
Copy Markdown
Contributor

voisey commented Aug 20, 2020

Will take a look soon! Out of interest, have you tested the difference in speed between the python and C++?

@siranipour
Copy link
Copy Markdown
Contributor Author

Yeah I checked very briefly. It's difficult to make an apples to apples comparison because the question is do you time the exp.load() line (I would be inclined to say yes)?

In general it's fairly comparable, but it's not blisteringly quicker.

Btw, I may have been premature with the review requests, maybe just wait until the next commit since I've yet to sort out the docstring

@siranipour
Copy link
Copy Markdown
Contributor Author

Ok this is ready for review now.

I was wondering, what do people think on having some sort of temporary tests we can use to ensure the python replacement replicates the old c++ code? We can then delete this test once we fully switch to python

@siranipour siranipour force-pushed the python-pseudodata branch 2 times, most recently from 256f873 to a26ba80 Compare August 24, 2020 14:39
@voisey
Copy link
Copy Markdown
Contributor

voisey commented Aug 24, 2020

I don't think it would hurt to have something like that so I'd be for it. We have something kind of similar in here https://github.com/NNPDF/nnpdf/blob/master/validphys2/src/validphys/tests/test_covmats.py in that we currently test that both the C++ and the python work, where one of the tests will eventually be deleted. I know what you suggest is slightly different, but the point is that we already have tests that we plan on deleting

@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Aug 24, 2020

More tests don't hurt especially in that they increase coverage.

@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Aug 24, 2020

I like how the code is nice and well commented. It will certainly help any efforts for making it fast.

@siranipour
Copy link
Copy Markdown
Contributor Author

I don't think it would hurt to have something like that so I'd be for it. We have something kind of similar in here https://github.com/NNPDF/nnpdf/blob/master/validphys2/src/validphys/tests/test_covmats.py in that we currently test that both the C++ and the python work, where one of the tests will eventually be deleted. I know what you suggest is slightly different, but the point is that we already have tests that we plan on deleting

Ah yes very good point! Forgot we implemented that test.

Ok sure, I guess we can add a new test altogether that does all the comparisons between c++ and python. I'll do this in another PR so they're all kept together

Comment thread validphys2/src/validphys/covmats.py Outdated
Comment thread validphys2/src/validphys/covmats.py Outdated
@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Aug 25, 2020

FWIW there is a 30 pages paper (that I have no patience to read) on how to sample from a truncated gaussian

https://www4.stat.ncsu.edu/~boos/library/papers/mimeo2649_Li.pdf

On a somewhat related note, the fact that this is so complicated makes me think it's not such a great idea: If we are imposing positivity in PDFs anyway, why are we also doing it in the sampling? More importantly, does it make any difference other than increasing the complexity and the surface for bugs?

@siranipour
Copy link
Copy Markdown
Contributor Author

I've quickly glanced at that paper and it certainly looks like it could be fun to take a look at. I just worry how easy it would be to accommodate some of the edge cases we have going on. Indeed, it's what forced me to resign in trying to understand the papers and blindly translate the C++ code.

I've also wondered whether having positive definite pseudodata is essential. It'd be interesting to see how many times the algorithm stumbles across a sample that fails this criterion. Even if it does, it could be that the fit doesn't really care too much. I guess the only way to answer that question would be to construct pseudodata that has negative entries and then run a fit to see.

@siranipour siranipour force-pushed the python-pseudodata branch 5 times, most recently from 1f0275c to 3792b8b Compare August 27, 2020 14:39
@siranipour
Copy link
Copy Markdown
Contributor Author

Just a quick series of commits that makes make_replica a provider, a provider that indexes the data and a regression test.

@wilsonmr
Copy link
Copy Markdown
Contributor

@tgiani would you be able to take another look at this, since you have experience with the replica generation in c++?

@wilsonmr
Copy link
Copy Markdown
Contributor

I double checked and what I was saying at the code meeting is true: we do not generate shifts for THEORY systematics (corr or uncorr) in the old code, which btw shouldn't be confused with the theory covariance matrix.

Looking at datasets which performed badly on the KS test: ATLAS ZpT rapidity and WZ rapidity
they both only have multiplicative systematics, I ran them through the non-regressions tests and they passed so I don't think there is a funny bug happening.

@RosalynLP how exactly did you generate the old data replicas? It occurs to me that because exp.MakeReplica() mutates the underlying c++ Experiment you have to explicitly call load on the DataGroupSpec without caching each time you generate a replica because otherwise it uses the shifted central value. Or make sure you copied the Experiment as per 28-34 of mc_gen_checks

Provided that's all fine, and that there were no errors in matching data point index between new and old replicas then one suggestion is that the purely multiplicative sets might require more replicas the get a stable result

@tgiani
Copy link
Copy Markdown
Contributor

tgiani commented Feb 15, 2021

@tgiani would you be able to take another look at this, since you have experience with the replica generation in c++?

yes sure, I ll look at this before the next code meeting

@siranipour
Copy link
Copy Markdown
Contributor Author

Yeah if you look at #1087 and line 43 of mc_gen_checks.py a copy of the Experiment object is made before calling MakeReplica, so it should be fine

@wilsonmr
Copy link
Copy Markdown
Contributor

sure, I was more wondering if all of the results presented were definitely produced using that code or equivalent

@RosalynLP
Copy link
Copy Markdown
Contributor

@wilsonmr I used adapted versions of those in mc_gen_checks, so copied the experiment

@wilsonmr
Copy link
Copy Markdown
Contributor

wilsonmr commented Feb 15, 2021

Great, just checking. The test isn't particularly simple here so when the results look weird it seems worth looking for errors on both ends 😆

I just tried generating for just ATLASWZRAP with both codes and I get similar result (expected since the ATLAS wide systematic is hardly going to be dominant here) I'm trying with 10000 reps now since it doesn't take as long as a global generation. I also am producing the histograms for each bin. On the first pass with 1000 reps the histograms looked close enough to me. Results for 10000 reps should be ready in 15 mins.

EDIT: I should say I also picked 10000 because it uses a different method in auto mode of ks_2samp

@wilsonmr
Copy link
Copy Markdown
Contributor

huh so the KS with 10000 reps was about 0.025 which seems pretty low but the p value was pretty similar to your plot.

Having said that look at the distributions for each bin - I really don't think there is a problem here. So I think we can stop worrying https://vp.nnpdf.science/J2wE4X3jRuWGaBjHZXKM-A==

@RosalynLP
Copy link
Copy Markdown
Contributor

@wilsonmr thanks yeah this looks good re bin by bin distributions - to me this check makes more intuitive sense than the p-values do

@siranipour
Copy link
Copy Markdown
Contributor Author

siranipour commented Feb 15, 2021

Is each histogram a bin? Forgive my ignorance but I'm a bit confused about what the x axis refers to here, nor what qualifies as a count for the histogram

@wilsonmr
Copy link
Copy Markdown
Contributor

Each histogram is for a data point which I was referring to as a bin. ATLASWZRAP11 has 34 data points so there are 34 histograms.

The histograms are made up of 10000 replicas and the x axis is the raw value of that particular data point. So the first histogram there were ~550 replicas whose first data point had a value of ~580000.

So in other words each histogram is an unnormalised approximation of the (marginalised*) distribution from which the pseudodata has been drawn from.

*marginalised because we are looking data point by data point so each plot is as though we have integrated over all other datapoints.

@wilsonmr
Copy link
Copy Markdown
Contributor

Also, the variance estimated from the samples which were used to produce these histograms are the variance which Rosalyn calculated.

@wilsonmr
Copy link
Copy Markdown
Contributor

I was a bit sloppy with the heading and the x labels

@siranipour
Copy link
Copy Markdown
Contributor Author

Ah I follow!! In which case this is a very convincing comparison!! Weird about the p-value situation though I wonder what that's about. Thanks for these histograms though!

1 + mult_corr_errors * rng.normal(size=(1, mult_corr_errors.shape[1])) / 100
).prod(axis=1)

mult_shifts.append(mult_shift)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the logic here. I don't see how these things are being correlated across datasets. Random nubers for thing like lumi should only be sampled once, and here it seems it is once per dataset... That logic only seems to be there for the "special" systematics.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mult_corr_errors = mult_errors.loc[:, mult_errors.columns == "CORR"].to_numpy()

this only gets applied to systematics which have a systype row of MULT, CORR. Everything that needs to be correlated across datasets i.e lumi will have a special name which is not CORR and, as you point out, is handled below

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. I had missed the ~ in mult_errors.loc[:, ~mult_errors.columns.isin(INTRA_DATASET_SYS_NAME)]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, perhaps the comments could be made more obvious

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e

            # all mult errors must be converted to from percent to fraction
            # generate all uncorrelated multiplicative shifts
            mult_shift = (
                1 + mult_uncorr_errors * rng.normal(size=mult_uncorr_errors.shape) / 100
            ).prod(axis=1)
            # generate shifts for systematics which are only correlated within this dataset
            mult_corr_errors = mult_errors.loc[:, mult_errors.columns == "CORR"].to_numpy()
            # store multiplicative shifts to apply after all additive shifts.
            mult_shifts.append(mult_shift)

etc.

@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Feb 18, 2021

Ok, I went over this, I tested it and I think that the tests are sufficient.

@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Feb 26, 2021

Good, about time we merged this!

@Zaharid Zaharid merged commit a449869 into master Feb 26, 2021
@Zaharid Zaharid deleted the python-pseudodata branch February 26, 2021 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants