correlated covmats from lists of commondata by wilsonmr · Pull Request #970 · NNPDF/nnpdf

wilsonmr · 2020-10-14T18:14:39Z

I think that this method is easier to understand, in theory the split systematics could also split into multiplicative and additive systematics which might be useful for generating the pseudodata. The special experimental systematics are correlated by using the systematics names as columns and concatenating a dataframe. The rest only contribute to block diagonals anyway.

note that this implicitly handles the empty cases because:

>>> a = np.empty((34, 0))
>>> a
array([], shape=(34, 0), dtype=float64)
>>> a @ a.T
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])
>>> (a**2).sum(axis=1)
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

Since it's a rather large change I thought a PR might be more appropriate.

Let me know what you think. It probs needs better docs and reformatting but you get the idea. I shortened considerably the docs of constructing the covmat because I think having a dump of the old c++ code is a bit much..

…hout creating fake commondata object of multiple datasets

wilsonmr · 2020-10-14T18:15:22Z

+    """Take the systematics table from a commondata and remove any systematics
+    which have name "SKIP". Next transform the systematics into the correct units.
+    Additive (ADD) systematics are left unchanged, whereas multiplicative (MULT)
+    systematics need to be converted from a percentage into an uncertainty
+    by multiplying by the central value and dividing by 100.

-    We now pick datapoint i and loop over all other datapoints j. For each such pair of points,
-    loop over all the systematics, if the l'th systematic of data point i has name ``SKIP`` we ignore
-    it. This is handled by scanning over all sytematic errors in the ``sys_errors`` dataframe and dropping
-    any columns which correspond to a systematic error name of ``SKIP``, thus the ``sys_errors`` dataframe
-    defined below contains only the systematics without the ``SKIP`` name.
+    Finally the systematics are split into the different components which
+    are treated differently when constructing the covariance matrix.

-    Note that in the switch statement an ADDitive or MULTiplicative systype of systematic i is handled
-    by either multiplying the additive or multiplicative uncertainties respectively. We instead opt to
-    create a purely additive systematic table where all elements are to be understood as additive
-    systematics and systematics of type MULT have been correctly transformed to their additive
-    values. This dataframe we call ``additive_mat``.
+    Returns a tuple of the statistical uncertainty and 5 possible groups
+    of systematics uncertainties: UNCORR, CORR, THEORYUNCORR, THEORYCORR and
+    inter-dataset systematics


like I said I'll make this proper english later

wilsonmr · 2020-10-14T18:41:28Z

I also find this method to be quicker. But then I also find the c++ method to be quicker, admittedly testing both in a jupyter notebook on limited number of datasets

def old_method():
    l = Loader()
    correlated_datasets = [
        "ATLASWZRAP36PB",
        "ATLASZHIGHMASS49FB",
        "ATLASLOMASSDY11EXT",
        "ATLASWZRAP11",
    ]
    THEORYID = 162
    dss = [
        l.check_dataset(i, theoryid=THEORYID, cuts=None) for i in correlated_datasets
    ]
    exp = l.check_experiment("null", dss)
    ld_exp = exp.load.__wrapped__(exp)
    ld_exp.get_covmat()

%timeit old_method()
373 ms ± 5.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

def list_cd():
    l = Loader()
    correlated_datasets = [
        "ATLASWZRAP36PB",
        "ATLASZHIGHMASS49FB",
        "ATLASLOMASSDY11EXT",
        "ATLASWZRAP11",
    ]

    cds = list(map(l.check_commondata, correlated_datasets))
    ld_cds = list(map(load_commondata, cds))

    covmat = commondatas_covmat_from_systematics(ld_cds)

%timeit list_cd()
391 ms ± 9.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

def combine_cd():
    l = Loader()
    correlated_datasets = [
        "ATLASWZRAP36PB",
        "ATLASZHIGHMASS49FB",
        "ATLASLOMASSDY11EXT",
        "ATLASWZRAP11",
    ]

    cds = list(map(l.check_commondata, correlated_datasets))
    ld_cds = list(map(load_commondata, cds))

    combo_cd = combine_commondata(ld_cds)
    covmat = covmat_from_systematics(combo_cd)

%timeit combine_cd()
1.3 s ± 5.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

siranipour · 2020-10-14T18:49:48Z

. I shortened considerably the docs of constructing the covmat because I think having a dump of the old c++ code is a bit much..

This was well received at the time!!

To be honest, I'm more of a fan of the idea that these functions should work with CommonData objects rather than lists of commondata.

Zaharid · 2020-10-14T19:01:25Z

Without having looked at this in detail, I must say that the combination of my prior bias and the +140 −273 diff make me think very favourably of this PR.

wilsonmr · 2020-10-14T19:12:35Z

This was well received at the time!!

Well if there is overwhelming support we can put it back. I just don't see what it adds on top of us having the python source code which I think more clearly is adding some parts to the diagonal and then taking matrix multiplication of other parts

wilsonmr · 2020-10-15T10:41:24Z

I can see why you like combining the commondata because then you can reuse certain bits of code and it is quite an alluring prospect. I think the baseline code was really good, and I basically just slightly changed the return of your original covmat function and renamed it, because it no longer returns a covmat.

The pros for this:

the code for combining the commondatas felt a little hacky and was kind of complicated and hard to digest
The code for constructing the covmat for a single commondata I think is laid out in a way that requires less explaining here, which is why I cut the docs. By breaking the systematics up logically we don't need to set any elements to zero before performing a matrix multiplication which isn't quite matrix @ matrix.T
We don't create big block diagonal sys tables which I think will scale poorly with n_data and just feels a bit unneccessary
the code for constructing the multiple commondata is also very simple, relates back to the single commondata case and is easy to explain (I think it's more obvious which systematics are contributing to inter-dataset correlations)

The cons are that we will kind of implicitly have to generate data for a single commondata and a list of commondata, well actually I don't think this is strictly true, in general I think we only need the latter because we generally want to generate data for multiple commondata and also the latter can still do the job of the former by making a list of a single dataset anyway - btw that's sort of what happened in libnnpdf where MakeReplica is a method of Experiment anyway, except ours would take an argument of a simple object (list) of commondata

As a final comment I mentioned before

split systematics could also split into multiplicative and additive systematics

Which could potentially simplify #866 and later on could even be used to speed up the generation of pseudodata when we used numpy.random because we wouldn't have to construct the additive covmat and then take the square root again and instead could generate deviates directly for the additive systematics, which I'm pretty sure must be faster (but anyway that's a problem for the future)

…reshuffling of functions

wilsonmr · 2020-10-28T10:12:10Z

This is ready for review btw

siranipour

LGTM certainly happy to merge

wilsonmr · 2020-10-28T11:36:04Z

Ok I will merge then we can focus on merging the other PR

siranipour · 2020-10-28T11:36:27Z

Btw, would it be worthwhile making the test use test_expcovmat.csv from the regressions? When I tried to do this I got a mismatch between what the saved covmat was and the one we generate

wilsonmr · 2020-10-28T11:49:36Z

well I think that is testing some higher level functionality which uses the cpp underlying function which we test here so I don't think that's necessary. Btw the way the cd is loaded here means that certain things aren't added like the sys errors which are explicitly what the testing datasets check (and must be specified) if I carefully load the exact commondata and settings:

>>> from conftest import base_config
>>> from validphys.api import API
>>> data = API.data(**base_config)
`experiments` has been deprecated, specify data using `dataset_inputs`. Any grouping defined by `experiments` is being ignored.
>>> cds = [ds.commondata for ds in data.datasets]
>>> from validphys.commondataparser import load_commondata
>>> ld_cds = list(map(load_commondata, cds))
>>> from validphys.covmats import datasets_covmat_from_systematics
>>> covmat = datasets_covmat_from_systematics(ld_cds)
>>> from validphys.tableloader import parse_exp_mat
>>> df = parse_exp_mat("regressions/test_expcovmat.csv")
>>> import numpy as np
>>> np.testing.assert_allclose(df.to_numpy(), covmat)
>>> np.allclose(df.to_numpy(), covmat)
True

then I think we're good

wilsonmr · 2020-10-28T11:52:42Z

Btw the way the cd is loaded here means that certain things aren't added like the sys errors

To be clear I mean in the testing, in this specific line:

cds = list(map(l.check_commondata, correlated_datasets))

wilsonmr · 2020-10-28T11:53:37Z

probably to be safe we should actually use full dataset settings with cfactors, 'sys 10' to make sure it all gets properly accounted for..

propose way of constructing covmat of list of commondata objects, wit…

3b44fc8

…hout creating fake commondata object of multiple datasets

wilsonmr requested review from Zaharid and siranipour October 14, 2020 18:14

wilsonmr commented Oct 14, 2020

View reviewed changes

Comment thread validphys2/src/validphys/covmats.py

siranipour reviewed Oct 14, 2020

View reviewed changes

Comment thread validphys2/src/validphys/tests/test_covmats.py

Update names and docstrings of covmat construction functions

98fccc1

wilsonmr changed the title ~~Proposal for constructing covmat of multiple commondatas in simpler manner~~ correlated covmats from lists of commondata Oct 22, 2020

wilsonmr added 2 commits October 23, 2020 00:29

Add back the docstring from shayan editing to make sense with slight …

b601377

…reshuffling of functions

undo erroneous removal of blank lines

d8fb043

siranipour approved these changes Oct 28, 2020

View reviewed changes

wilsonmr merged commit 4a8d3a3 into python-correlated-covmats Oct 28, 2020

wilsonmr deleted the mw-python-covmat branch October 28, 2020 12:01

wilsonmr mentioned this pull request Oct 28, 2020

Correlated covmats in python #955

Merged

Conversation

wilsonmr commented Oct 14, 2020

Uh oh!

wilsonmr Oct 14, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wilsonmr commented Oct 14, 2020

Uh oh!

siranipour commented Oct 14, 2020

Uh oh!

Uh oh!

Zaharid commented Oct 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wilsonmr commented Oct 14, 2020

Uh oh!

wilsonmr commented Oct 15, 2020

Uh oh!

wilsonmr commented Oct 28, 2020

Uh oh!

siranipour left a comment

Choose a reason for hiding this comment

Uh oh!

wilsonmr commented Oct 28, 2020

Uh oh!

siranipour commented Oct 28, 2020

Uh oh!

wilsonmr commented Oct 28, 2020

Uh oh!

wilsonmr commented Oct 28, 2020

Uh oh!

wilsonmr commented Oct 28, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Zaharid commented Oct 14, 2020 •

edited

Loading