Correlated covmats in python by siranipour · Pull Request #955 · NNPDF/nnpdf

siranipour · 2020-10-08T12:06:33Z

I've added a function that takes in a list of CommondData objects and combines them into one 'effective' CommonData where the (possible) correlations between the datasets are correctly accounted for.

So you can now do something like

import numpy as np


from validphys.covmats import covmat_from_systematics
from validphys.commondataparser import combine_commondata, load_commondata
from validphys.loader import Loader
l = Loader()

cd1 = l.check_commondata("ATLASLOMASSDY11EXT")
cd2 = l.check_commondata("ATLASZHIGHMASS49FB")

commondata_list = list(map(load_commondata, (cd1, cd2)))

cd = combine_commondata(commondata_list)

new_covmat = covmat_from_systematics(cd)

ds1 = l.check_dataset("ATLASLOMASSDY11EXT", theoryid=53, cuts=None)
ds2 = l.check_dataset("ATLASZHIGHMASS49FB", theoryid=53, cuts=None)

exp = l.check_experiment("FOO", [ds1, ds2])
ld = exp.load()

old_covmat = ld.get_covmat()

np.testing.assert_allclose(old_covmat, new_covmat)

This will in principle remedy #866 (comment) too, though I've not tested it yet.

The relevant piece of C++ code is here

nnpdf/libnnpdf/src/experiments.cc

Line 437 in f038c95

void Experiment::PullData()

And accounting for correlations

siranipour · 2020-10-08T12:13:43Z

I need to finish docstrings and tests

siranipour · 2020-10-08T15:20:01Z

This would be the test we would like to run eventually but it melts my computer

Full test

import numpy as np

from validphys.covmats import covmat_from_systematics
from validphys.commondataparser import combine_commondata, load_commondata
from validphys.loader import Loader
l = Loader()
    
cds = []
dss = []
for ds in l.available_datasets:
    try:
        cds.append(l.check_commondata(ds))
        dss.append(l.check_dataset(ds, theoryid=53, cuts=None))
        print(ds)
    except:
        print("failed")
        continue

commondata_list = list(map(load_commondata, cds))

cd = combine_commondata(commondata_list)

 new_covmat = covmat_from_systematics(cd)

exp = l.check_experiment("FOO", dss)
ld = exp.load()

old_covmat = ld.get_covmat()

assert np.allclose(old_covmat, new_covmat)

for now this test seems to work nicely:

Quick test

import random

import numpy as np

from validphys.covmats import covmat_from_systematics
from validphys.commondataparser import combine_commondata, load_commondata
from validphys.loader import Loader
l = Loader()


cds = []
dss = []
avails = l.available_datasets
samples = random.sample(avails, min(20, len(avails)))
for ds in samples:
    try:
        cd_spec = l.check_commondata(ds)
        ds_spec = l.check_dataset(ds, theoryid=53, cuts=None)
        cds.append(cd_spec)
        dss.append(ds_spec)
        print(ds)
    except:
        print("failed")
        continue

commondata_list = list(map(load_commondata, cds))

cd = combine_commondata(commondata_list)

new_covmat = covmat_from_systematics(cd)

exp = l.check_experiment("FOO", dss)
ld = exp.load()

old_covmat = ld.get_covmat()

try:
    assert np.allclose(old_covmat, new_covmat)
except AssertionError:
    from IPython import embed; embed()

wilsonmr · 2020-10-08T17:33:39Z

I'm not sure I'd randomly choose the datasets, as far as I can tell this could put together two datasets which are the same but one has had a fix applied or something

I think we should systematically choose some datasets which cover all basis, the current testing datasets cover some different configurations like sys: 10 and using cfactors, we then additionally need datasets with correlated multiplicative and additive systematics across datasets. Probably it would be sufficient to contruct NMC (both datasets) and then a selection from ATLAS and CMS which fulfill above criteria.

I wouldn't try to do it for too many datasets at a time because the c++ code won't cope, and likely neither will python if you are constructing a dataframe with all the systematics..

siranipour · 2020-10-08T19:03:43Z

Ah this was just to make sure it was doing the global covmat correctly since it's the function I'm least sure about. I'll come up with a quicker CI test tomorrow

If the number of kins are not equal then raise ValueError

siranipour · 2020-10-09T16:03:33Z

Would say this is ready for review

siranipour · 2020-10-12T09:35:17Z

Also I just tried a time comparison using 19 random (but having some correlation) datasets:

python: 107.28161001205444 s
C++: 372.70763421058655 s

edit: If i use 40 datasets the improvement is even more marked:

python: 364.6011164188385 s
C++ 1675.7826426029205 s

wilsonmr · 2020-10-12T09:39:07Z

Good! To be expected because the way the covmat is constructed in c++ is quite inefficient.

Zaharid · 2020-10-13T14:37:54Z

Isn't it easier to have covmats working with lists of commondata than introducing some "combined commondata" object that is not quite the same?

siranipour · 2020-10-13T14:57:42Z

We could do, but I had always envisaged these functions operating on CommonData objects. So if you wrote a function that worked on a per dataset basis, you could be certain that it would work on a collection of datasets automatically because you'd give it an "effective commondata".

…hout creating fake commondata object of multiple datasets

…reshuffling of functions

correlated covmats from lists of commondata

Making tests use these fixtures

siranipour · 2020-10-29T16:10:31Z

Ok doke, ive implemented the fixtures, think this is ready for reviewing/merging?

Zaharid · 2020-10-29T16:31:45Z

I have to look at the code in detail, but the changeset feels right when you scroll down.

wilsonmr

Just one small change to the new tests. @Zaharid did you get a chance to look at this yet?

Changing `conftest` datasets to use the data keyword Co-authored-by: wilsonmr <33907451+wilsonmr@users.noreply.github.com>

Zaharid

I only have relatively minor comments (except perhaps the one related to performance, but I will survive).

I did not go over the math, but the tests look sensible to me so I am going to trust those.

Zaharid · 2020-11-03T18:58:45Z

+
+def split_uncertainties(commondata):
+    """Take the statistical uncertainty and systematics table from
+    a :py:class:`validphys.coredata.CommonData` object


Could you please add this as a documented parameter? The reason is that I don't know how to read and didn't see it on the first pass (or a type annotation, it would have some value here).

If I run something like

In [22]: cd = l.check_commondata("NMC") In [23]: lcd = load_commondata(cd) In [24]: %timeit split_uncertainties(lcd) 293 ms ± 3.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

which is a bit much (or maybe I just need a new laptop). Profiling this a bit and seeing e.g. if some king of group by operation (possibly supplemented by indexing the original coredata dataframe by the systype) would be more efficient that the repeated use of things like

thunc = abs_sys_errors[:, sys_name == "THEORYUNCORR"] unc = abs_sys_errors[:, sys_name == "UNCORR"] thcorr = abs_sys_errors[:, sys_name == "THEORYCORR"] corr = abs_sys_errors[:, sys_name == "CORR"]

would definitively be in the nice to have category.

Perhaps we could use an isin, let me take a look, btw what did you use to profile the function?

Just the %timeit magic of IPython.

Hmm the awkward thing with this indexing is dealing with the cases when the key isn't present. The groupby seems promising i.e

isspecial = lambda x: x if x in INTRA_DATASET_SYS_NAME else "SPECIAL" split_dict = {group: df for group, df in abs_sys_errors_df.groupby(by=isspecial, axis=1)}

But perhaps the covmat can be constructed directly from the groupby rather than having this split uncertainty function?

I also wonder if this is a little slow:

abs_sys_errors_df = sys_errors_df.apply( lambda x: [ i.add if i.sys_type == "ADD" else (i.mult * j / 100) for i, j in zip(x, commondata.central_values) ] )

Although I don't see much alternative with the current way we store the errors. I do half wonder if storing the uncertainties as two dataframes (one for mult, one for add) with sysnames as the column index and then the mult dataframe would just be the percentages (as a raw number) and the additive would be the absolute values would make slightly more sense in the context of what we do with the systematics?

The current method of storage is quite fancy but the raw data which we want to access ends up being nested quite far in

Ah apologies, the posts crossed. Ok, I guess there is no way to avoid this? So we could speed up the df.apply then, since it seems to be the bottleneck

So an alternative to the df.apply is the following, but it doesn't give all that big a saving:

sys_type = commondata.systype_table["type"] mult_errors = sys_type == "MULT" mult_sys_errors = sys_errors_df.loc[:, mult_errors].applymap(lambda x: x.mult) converted_mult_sys_errors = mult_sys_errors.multiply(commondata.central_values, axis=0) abs_sys_errors_df = sys_errors_df.applymap(lambda x: x.add) abs_sys_errors_df.loc[:, mult_errors] = converted_mult_sys_errors

lmk what you think

Edit: in fact, if anything, it's slower....

Hmm well returning to my previous comment. I wonder if a slight refactoring of how the systematics are stored would benefit us here? I mean AFAIK applymap is a for column in columns: map(func, column) so it's not super far away from a nested for loop which I think we want to avoid in python. When I think about the systematic file, it's unclear to me why we want every element to be an object, because all elements in the same column have the same systematic name and type. When we load the data we have to manually convert it into this format only to have to unpack it like this, which seems a little suboptimal.

In the end there is this historic doubling of information in the commondata files, however changing that would involve changing build master which I don't think anybody wants to do. With that in mind, I think that just storing the multiplicative columns for MULT uncertainties and the additive (or absolute uncertainties) for the ADD uncertainties as dataframes, with column index as the sysname would mean we could then do something like:

mat = np.diag(commondata.stat_errors.to_numpy()) is_special = lambda x: x if x in INTRA_DATASET_SYS_NAME else "SPECIAL" converted_mult = commondata.mult * central_values[:, np.newaxis] / 100 for abs_sys_df in (converted_mult, commondata.add): for sys_name, sys_table in commondata.mult.groupby(by=is_special, axis=1): if sys_name == "UNCORR": mat += np.diag((sys_table.values ** 2).sum(axis=1)) ...

Which I think would be quite a bit faster than what we have because we convert the mult uncertainties in a vectorised way and we pick out the different cases using groupby. BTW we don't even have to drop SKIP here we just don't add to the mat, in some sense this is more similar to the c++ code but we avoid explicit nested loops (and let numpy handle it)

I should note that INTRA_DATASET_SYS_NAME would need to include SKIP in that example or else they would be treated as experimental uncertainties erroneously

another note is that sys_errors property of the coredata.CommonData is calling the function which does a nested for loop which constructs the complicated format so I wonder if it's really the indexing or it's the sys_errors taking ages to construct

Updating docstring to reflect this

Combining list of commondatas

653791c

And accounting for correlations

wilsonmr reviewed Oct 8, 2020

View reviewed changes

Comment thread validphys2/src/validphys/commondataparser.py Outdated

Removing toolz dependency

a1412f4

siranipour force-pushed the python-correlated-covmats branch from d1e8c81 to c69a9aa Compare October 8, 2020 15:53

Resetting labels for systematics after correlating

c69a9aa

Raise exception

2c78e7b

If the number of kins are not equal then raise ValueError

siranipour force-pushed the python-correlated-covmats branch 3 times, most recently from 1f8ad7d to 4fad3e2 Compare October 9, 2020 16:00

siranipour requested a review from Zaharid October 9, 2020 16:03

siranipour requested a review from voisey October 12, 2020 13:06

siranipour force-pushed the python-correlated-covmats branch from 4fad3e2 to ddac5ef Compare October 13, 2020 10:09

siranipour added 2 commits October 13, 2020 11:10

Adding docstrings

1221d93

Adding CI tests

ddac5ef

siranipour mentioned this pull request Oct 13, 2020

MakeReplica in python #866

Merged

wilsonmr and others added 5 commits October 14, 2020 19:08

propose way of constructing covmat of list of commondata objects, wit…

3b44fc8

…hout creating fake commondata object of multiple datasets

Update names and docstrings of covmat construction functions

98fccc1

Add back the docstring from shayan editing to make sense with slight …

b601377

…reshuffling of functions

undo erroneous removal of blank lines

d8fb043

use list of commondata

4a8d3a3

correlated covmats from lists of commondata

wilsonmr reviewed Oct 28, 2020

View reviewed changes

Comment thread validphys2/src/validphys/tests/test_covmats.py Outdated

wilsonmr reviewed Oct 28, 2020

View reviewed changes

Comment thread validphys2/src/validphys/tests/test_covmats.py Outdated

Removing unused imports

589074c

siranipour force-pushed the python-correlated-covmats branch 3 times, most recently from 4e6697b to fadc917 Compare October 29, 2020 14:07

Adding new fixtures for cut data and correlated data

1b67c30

Making tests use these fixtures

siranipour force-pushed the python-correlated-covmats branch from fadc917 to 1b67c30 Compare October 29, 2020 15:33

wilsonmr reviewed Nov 3, 2020

View reviewed changes

Comment thread validphys2/src/validphys/tests/conftest.py

Comment thread validphys2/src/validphys/tests/conftest.py Outdated

Apply suggestions from code review

e58f443

Changing `conftest` datasets to use the data keyword Co-authored-by: wilsonmr <33907451+wilsonmr@users.noreply.github.com>

Zaharid reviewed Nov 3, 2020

View reviewed changes

siranipour added 2 commits November 4, 2020 10:56

Adding Parameters header to split_uncertainties function

780bcaf

Making return type namedtuple

1c45843

Updating docstring to reflect this

Zaharid merged commit 31b993f into master Nov 4, 2020

Zaharid deleted the python-correlated-covmats branch November 4, 2020 16:26

wilsonmr mentioned this pull request Nov 4, 2020

Strategy for destroying C++ #952

Closed

Conversation

siranipour commented Oct 8, 2020

Uh oh!

siranipour commented Oct 8, 2020

Uh oh!

Uh oh!

siranipour commented Oct 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wilsonmr commented Oct 8, 2020

Uh oh!

siranipour commented Oct 8, 2020

Uh oh!

siranipour commented Oct 9, 2020

Uh oh!

siranipour commented Oct 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wilsonmr commented Oct 12, 2020

Uh oh!

Zaharid commented Oct 13, 2020

Uh oh!

siranipour commented Oct 13, 2020

Uh oh!

Uh oh!

Uh oh!

siranipour commented Oct 29, 2020

Uh oh!

Zaharid commented Oct 29, 2020

Uh oh!

wilsonmr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Zaharid left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

siranipour Nov 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wilsonmr Nov 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

siranipour commented Oct 8, 2020 •

edited

Loading

siranipour commented Oct 12, 2020 •

edited

Loading

siranipour Nov 4, 2020 •

edited

Loading

wilsonmr Nov 4, 2020 •

edited

Loading