Skip to content

Correlated covmats in python#955

Merged
Zaharid merged 16 commits into
masterfrom
python-correlated-covmats
Nov 4, 2020
Merged

Correlated covmats in python#955
Zaharid merged 16 commits into
masterfrom
python-correlated-covmats

Conversation

@siranipour
Copy link
Copy Markdown
Contributor

I've added a function that takes in a list of CommondData objects and combines them into one 'effective' CommonData where the (possible) correlations between the datasets are correctly accounted for.

So you can now do something like

import numpy as np


from validphys.covmats import covmat_from_systematics
from validphys.commondataparser import combine_commondata, load_commondata
from validphys.loader import Loader
l = Loader()

cd1 = l.check_commondata("ATLASLOMASSDY11EXT")
cd2 = l.check_commondata("ATLASZHIGHMASS49FB")

commondata_list = list(map(load_commondata, (cd1, cd2)))

cd = combine_commondata(commondata_list)

new_covmat = covmat_from_systematics(cd)

ds1 = l.check_dataset("ATLASLOMASSDY11EXT", theoryid=53, cuts=None)
ds2 = l.check_dataset("ATLASZHIGHMASS49FB", theoryid=53, cuts=None)

exp = l.check_experiment("FOO", [ds1, ds2])
ld = exp.load()

old_covmat = ld.get_covmat()

np.testing.assert_allclose(old_covmat, new_covmat)

This will in principle remedy #866 (comment) too, though I've not tested it yet.

The relevant piece of C++ code is here

void Experiment::PullData()

And accounting for correlations
@siranipour
Copy link
Copy Markdown
Contributor Author

I need to finish docstrings and tests

Comment thread validphys2/src/validphys/commondataparser.py Outdated
@siranipour
Copy link
Copy Markdown
Contributor Author

siranipour commented Oct 8, 2020

This would be the test we would like to run eventually but it melts my computer

Full test
import numpy as np

from validphys.covmats import covmat_from_systematics
from validphys.commondataparser import combine_commondata, load_commondata
from validphys.loader import Loader
l = Loader()
    
cds = []
dss = []
for ds in l.available_datasets:
    try:
        cds.append(l.check_commondata(ds))
        dss.append(l.check_dataset(ds, theoryid=53, cuts=None))
        print(ds)
    except:
        print("failed")
        continue

commondata_list = list(map(load_commondata, cds))

cd = combine_commondata(commondata_list)

 new_covmat = covmat_from_systematics(cd)

exp = l.check_experiment("FOO", dss)
ld = exp.load()

old_covmat = ld.get_covmat()

assert np.allclose(old_covmat, new_covmat)

for now this test seems to work nicely:

Quick test
import random

import numpy as np

from validphys.covmats import covmat_from_systematics
from validphys.commondataparser import combine_commondata, load_commondata
from validphys.loader import Loader
l = Loader()


cds = []
dss = []
avails = l.available_datasets
samples = random.sample(avails, min(20, len(avails)))
for ds in samples:
    try:
        cd_spec = l.check_commondata(ds)
        ds_spec = l.check_dataset(ds, theoryid=53, cuts=None)
        cds.append(cd_spec)
        dss.append(ds_spec)
        print(ds)
    except:
        print("failed")
        continue

commondata_list = list(map(load_commondata, cds))

cd = combine_commondata(commondata_list)

new_covmat = covmat_from_systematics(cd)

exp = l.check_experiment("FOO", dss)
ld = exp.load()

old_covmat = ld.get_covmat()

try:
    assert np.allclose(old_covmat, new_covmat)
except AssertionError:
    from IPython import embed; embed()

@siranipour siranipour force-pushed the python-correlated-covmats branch from d1e8c81 to c69a9aa Compare October 8, 2020 15:53
@wilsonmr
Copy link
Copy Markdown
Contributor

wilsonmr commented Oct 8, 2020

I'm not sure I'd randomly choose the datasets, as far as I can tell this could put together two datasets which are the same but one has had a fix applied or something

I think we should systematically choose some datasets which cover all basis, the current testing datasets cover some different configurations like sys: 10 and using cfactors, we then additionally need datasets with correlated multiplicative and additive systematics across datasets. Probably it would be sufficient to contruct NMC (both datasets) and then a selection from ATLAS and CMS which fulfill above criteria.

I wouldn't try to do it for too many datasets at a time because the c++ code won't cope, and likely neither will python if you are constructing a dataframe with all the systematics..

@siranipour
Copy link
Copy Markdown
Contributor Author

Ah this was just to make sure it was doing the global covmat correctly since it's the function I'm least sure about. I'll come up with a quicker CI test tomorrow

If the number of kins are not equal then raise ValueError
@siranipour siranipour force-pushed the python-correlated-covmats branch 3 times, most recently from 1f8ad7d to 4fad3e2 Compare October 9, 2020 16:00
@siranipour siranipour requested a review from Zaharid October 9, 2020 16:03
@siranipour
Copy link
Copy Markdown
Contributor Author

Would say this is ready for review

@siranipour
Copy link
Copy Markdown
Contributor Author

siranipour commented Oct 12, 2020

Also I just tried a time comparison using 19 random (but having some correlation) datasets:

python: 107.28161001205444 s
C++: 372.70763421058655 s

edit: If i use 40 datasets the improvement is even more marked:

python: 364.6011164188385 s
C++ 1675.7826426029205 s

@wilsonmr
Copy link
Copy Markdown
Contributor

Good! To be expected because the way the covmat is constructed in c++ is quite inefficient.

@siranipour siranipour requested a review from voisey October 12, 2020 13:06
@siranipour siranipour force-pushed the python-correlated-covmats branch from 4fad3e2 to ddac5ef Compare October 13, 2020 10:09
@siranipour siranipour mentioned this pull request Oct 13, 2020
@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Oct 13, 2020

Isn't it easier to have covmats working with lists of commondata than introducing some "combined commondata" object that is not quite the same?

@siranipour
Copy link
Copy Markdown
Contributor Author

We could do, but I had always envisaged these functions operating on CommonData objects. So if you wrote a function that worked on a per dataset basis, you could be certain that it would work on a collection of datasets automatically because you'd give it an "effective commondata".

Comment thread validphys2/src/validphys/tests/test_covmats.py Outdated
Comment thread validphys2/src/validphys/tests/test_covmats.py Outdated
@siranipour siranipour force-pushed the python-correlated-covmats branch 3 times, most recently from 4e6697b to fadc917 Compare October 29, 2020 14:07
@siranipour siranipour force-pushed the python-correlated-covmats branch from fadc917 to 1b67c30 Compare October 29, 2020 15:33
@siranipour
Copy link
Copy Markdown
Contributor Author

Ok doke, ive implemented the fixtures, think this is ready for reviewing/merging?

@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Oct 29, 2020

I have to look at the code in detail, but the changeset feels right when you scroll down.

Copy link
Copy Markdown
Contributor

@wilsonmr wilsonmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small change to the new tests. @Zaharid did you get a chance to look at this yet?

Comment thread validphys2/src/validphys/tests/conftest.py
Comment thread validphys2/src/validphys/tests/conftest.py Outdated
Changing `conftest` datasets to use the data keyword

Co-authored-by: wilsonmr <33907451+wilsonmr@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@Zaharid Zaharid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only have relatively minor comments (except perhaps the one related to performance, but I will survive).

I did not go over the math, but the tests look sensible to me so I am going to trust those.


def split_uncertainties(commondata):
"""Take the statistical uncertainty and systematics table from
a :py:class:`validphys.coredata.CommonData` object
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add this as a documented parameter? The reason is that I don't know how to read and didn't see it on the first pass (or a type annotation, it would have some value here).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I run something like

In [22]: cd = l.check_commondata("NMC")

In [23]: lcd = load_commondata(cd)

In [24]: %timeit split_uncertainties(lcd)
293 ms ± 3.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

which is a bit much (or maybe I just need a new laptop). Profiling this a bit and seeing e.g. if some king of group by operation (possibly supplemented by indexing the original coredata dataframe by the systype) would be more efficient that the repeated use of things like

    thunc = abs_sys_errors[:, sys_name == "THEORYUNCORR"]
    unc = abs_sys_errors[:, sys_name == "UNCORR"]
    thcorr = abs_sys_errors[:, sys_name == "THEORYCORR"]
    corr = abs_sys_errors[:, sys_name == "CORR"]

would definitively be in the nice to have category.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could use an isin, let me take a look, btw what did you use to profile the function?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the %timeit magic of IPython.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm the awkward thing with this indexing is dealing with the cases when the key isn't present. The groupby seems promising i.e

isspecial = lambda x: x if x in INTRA_DATASET_SYS_NAME else "SPECIAL"
split_dict = {group: df for group, df in
    abs_sys_errors_df.groupby(by=isspecial, axis=1)}

But perhaps the covmat can be constructed directly from the groupby rather than having this split uncertainty function?

I also wonder if this is a little slow:

    abs_sys_errors_df = sys_errors_df.apply(
        lambda x: [
            i.add if i.sys_type == "ADD" else (i.mult * j / 100)
            for i, j in zip(x, commondata.central_values)
        ]
    )

Although I don't see much alternative with the current way we store the errors. I do half wonder if storing the uncertainties as two dataframes (one for mult, one for add) with sysnames as the column index and then the mult dataframe would just be the percentages (as a raw number) and the additive would be the absolute values would make slightly more sense in the context of what we do with the systematics?

The current method of storage is quite fancy but the raw data which we want to access ends up being nested quite far in

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah apologies, the posts crossed. Ok, I guess there is no way to avoid this? So we could speed up the df.apply then, since it seems to be the bottleneck

Copy link
Copy Markdown
Contributor Author

@siranipour siranipour Nov 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So an alternative to the df.apply is the following, but it doesn't give all that big a saving:

    sys_type = commondata.systype_table["type"]
    mult_errors = sys_type == "MULT"
    mult_sys_errors = sys_errors_df.loc[:, mult_errors].applymap(lambda x: x.mult)
    converted_mult_sys_errors = mult_sys_errors.multiply(commondata.central_values, axis=0)
    abs_sys_errors_df = sys_errors_df.applymap(lambda x: x.add)
    abs_sys_errors_df.loc[:, mult_errors] = converted_mult_sys_errors

lmk what you think

Edit: in fact, if anything, it's slower....

Copy link
Copy Markdown
Contributor

@wilsonmr wilsonmr Nov 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm well returning to my previous comment. I wonder if a slight refactoring of how the systematics are stored would benefit us here? I mean AFAIK applymap is a for column in columns: map(func, column) so it's not super far away from a nested for loop which I think we want to avoid in python. When I think about the systematic file, it's unclear to me why we want every element to be an object, because all elements in the same column have the same systematic name and type. When we load the data we have to manually convert it into this format only to have to unpack it like this, which seems a little suboptimal.

In the end there is this historic doubling of information in the commondata files, however changing that would involve changing build master which I don't think anybody wants to do. With that in mind, I think that just storing the multiplicative columns for MULT uncertainties and the additive (or absolute uncertainties) for the ADD uncertainties as dataframes, with column index as the sysname would mean we could then do something like:

mat = np.diag(commondata.stat_errors.to_numpy())
is_special = lambda x: x if x in INTRA_DATASET_SYS_NAME else "SPECIAL"
converted_mult = commondata.mult * central_values[:, np.newaxis] / 100
for abs_sys_df in (converted_mult, commondata.add):
    for sys_name, sys_table in commondata.mult.groupby(by=is_special, axis=1):
        if sys_name == "UNCORR":
            mat += np.diag((sys_table.values ** 2).sum(axis=1))
    ...

Which I think would be quite a bit faster than what we have because we convert the mult uncertainties in a vectorised way and we pick out the different cases using groupby. BTW we don't even have to drop SKIP here we just don't add to the mat, in some sense this is more similar to the c++ code but we avoid explicit nested loops (and let numpy handle it)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should note that INTRA_DATASET_SYS_NAME would need to include SKIP in that example or else they would be treated as experimental uncertainties erroneously

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another note is that sys_errors property of the coredata.CommonData is calling the function which does a nested for loop which constructs the complicated format so I wonder if it's really the indexing or it's the sys_errors taking ages to construct

Comment thread validphys2/src/validphys/covmats.py Outdated
Comment thread validphys2/src/validphys/covmats.py
@Zaharid Zaharid merged commit 31b993f into master Nov 4, 2020
@Zaharid Zaharid deleted the python-correlated-covmats branch November 4, 2020 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants