Skip to content

correlated covmats from lists of commondata#970

Merged
wilsonmr merged 4 commits into
python-correlated-covmatsfrom
mw-python-covmat
Oct 28, 2020
Merged

correlated covmats from lists of commondata#970
wilsonmr merged 4 commits into
python-correlated-covmatsfrom
mw-python-covmat

Conversation

@wilsonmr
Copy link
Copy Markdown
Contributor

I think that this method is easier to understand, in theory the split systematics could also split into multiplicative and additive systematics which might be useful for generating the pseudodata. The special experimental systematics are correlated by using the systematics names as columns and concatenating a dataframe. The rest only contribute to block diagonals anyway.

note that this implicitly handles the empty cases because:

>>> a = np.empty((34, 0))
>>> a
array([], shape=(34, 0), dtype=float64)
>>> a @ a.T
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])
>>> (a**2).sum(axis=1)
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

Since it's a rather large change I thought a PR might be more appropriate.

Let me know what you think. It probs needs better docs and reformatting but you get the idea. I shortened considerably the docs of constructing the covmat because I think having a dump of the old c++ code is a bit much..

…hout creating fake commondata object of multiple datasets
Comment thread validphys2/src/validphys/covmats.py Outdated
Comment on lines +28 to +39
"""Take the systematics table from a commondata and remove any systematics
which have name "SKIP". Next transform the systematics into the correct units.
Additive (ADD) systematics are left unchanged, whereas multiplicative (MULT)
systematics need to be converted from a percentage into an uncertainty
by multiplying by the central value and dividing by 100.

We now pick datapoint i and loop over all other datapoints j. For each such pair of points,
loop over all the systematics, if the l'th systematic of data point i has name ``SKIP`` we ignore
it. This is handled by scanning over all sytematic errors in the ``sys_errors`` dataframe and dropping
any columns which correspond to a systematic error name of ``SKIP``, thus the ``sys_errors`` dataframe
defined below contains only the systematics without the ``SKIP`` name.
Finally the systematics are split into the different components which
are treated differently when constructing the covariance matrix.

Note that in the switch statement an ADDitive or MULTiplicative systype of systematic i is handled
by either multiplying the additive or multiplicative uncertainties respectively. We instead opt to
create a purely additive systematic table where all elements are to be understood as additive
systematics and systematics of type MULT have been correctly transformed to their additive
values. This dataframe we call ``additive_mat``.
Returns a tuple of the statistical uncertainty and 5 possible groups
of systematics uncertainties: UNCORR, CORR, THEORYUNCORR, THEORYCORR and
inter-dataset systematics
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like I said I'll make this proper english later

Comment thread validphys2/src/validphys/covmats.py
@wilsonmr
Copy link
Copy Markdown
Contributor Author

I also find this method to be quicker. But then I also find the c++ method to be quicker, admittedly testing both in a jupyter notebook on limited number of datasets

def old_method():
    l = Loader()
    correlated_datasets = [
        "ATLASWZRAP36PB",
        "ATLASZHIGHMASS49FB",
        "ATLASLOMASSDY11EXT",
        "ATLASWZRAP11",
    ]
    THEORYID = 162
    dss = [
        l.check_dataset(i, theoryid=THEORYID, cuts=None) for i in correlated_datasets
    ]
    exp = l.check_experiment("null", dss)
    ld_exp = exp.load.__wrapped__(exp)
    ld_exp.get_covmat()

%timeit old_method()
373 ms ± 5.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
def list_cd():
    l = Loader()
    correlated_datasets = [
        "ATLASWZRAP36PB",
        "ATLASZHIGHMASS49FB",
        "ATLASLOMASSDY11EXT",
        "ATLASWZRAP11",
    ]

    cds = list(map(l.check_commondata, correlated_datasets))
    ld_cds = list(map(load_commondata, cds))

    covmat = commondatas_covmat_from_systematics(ld_cds)

%timeit list_cd()
391 ms ± 9.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
def combine_cd():
    l = Loader()
    correlated_datasets = [
        "ATLASWZRAP36PB",
        "ATLASZHIGHMASS49FB",
        "ATLASLOMASSDY11EXT",
        "ATLASWZRAP11",
    ]

    cds = list(map(l.check_commondata, correlated_datasets))
    ld_cds = list(map(load_commondata, cds))

    combo_cd = combine_commondata(ld_cds)
    covmat = covmat_from_systematics(combo_cd)

%timeit combine_cd()
1.3 s ± 5.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@siranipour
Copy link
Copy Markdown
Contributor

. I shortened considerably the docs of constructing the covmat because I think having a dump of the old c++ code is a bit much..

This was well received at the time!!

To be honest, I'm more of a fan of the idea that these functions should work with CommonData objects rather than lists of commondata.

Comment thread validphys2/src/validphys/tests/test_covmats.py
@Zaharid
Copy link
Copy Markdown
Contributor

Zaharid commented Oct 14, 2020

Without having looked at this in detail, I must say that the combination of my prior bias and the +140 −273 diff make me think very favourably of this PR.

@wilsonmr
Copy link
Copy Markdown
Contributor Author

This was well received at the time!!

Well if there is overwhelming support we can put it back. I just don't see what it adds on top of us having the python source code which I think more clearly is adding some parts to the diagonal and then taking matrix multiplication of other parts

@wilsonmr
Copy link
Copy Markdown
Contributor Author

I can see why you like combining the commondata because then you can reuse certain bits of code and it is quite an alluring prospect. I think the baseline code was really good, and I basically just slightly changed the return of your original covmat function and renamed it, because it no longer returns a covmat.

The pros for this:

  • the code for combining the commondatas felt a little hacky and was kind of complicated and hard to digest
  • The code for constructing the covmat for a single commondata I think is laid out in a way that requires less explaining here, which is why I cut the docs. By breaking the systematics up logically we don't need to set any elements to zero before performing a matrix multiplication which isn't quite matrix @ matrix.T
  • We don't create big block diagonal sys tables which I think will scale poorly with n_data and just feels a bit unneccessary
  • the code for constructing the multiple commondata is also very simple, relates back to the single commondata case and is easy to explain (I think it's more obvious which systematics are contributing to inter-dataset correlations)

The cons are that we will kind of implicitly have to generate data for a single commondata and a list of commondata, well actually I don't think this is strictly true, in general I think we only need the latter because we generally want to generate data for multiple commondata and also the latter can still do the job of the former by making a list of a single dataset anyway - btw that's sort of what happened in libnnpdf where MakeReplica is a method of Experiment anyway, except ours would take an argument of a simple object (list) of commondata

As a final comment I mentioned before

split systematics could also split into multiplicative and additive systematics

Which could potentially simplify #866 and later on could even be used to speed up the generation of pseudodata when we used numpy.random because we wouldn't have to construct the additive covmat and then take the square root again and instead could generate deviates directly for the additive systematics, which I'm pretty sure must be faster (but anyway that's a problem for the future)

@wilsonmr wilsonmr changed the title Proposal for constructing covmat of multiple commondatas in simpler manner correlated covmats from lists of commondata Oct 22, 2020
@wilsonmr
Copy link
Copy Markdown
Contributor Author

This is ready for review btw

Copy link
Copy Markdown
Contributor

@siranipour siranipour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM certainly happy to merge

@wilsonmr
Copy link
Copy Markdown
Contributor Author

Ok I will merge then we can focus on merging the other PR

@siranipour
Copy link
Copy Markdown
Contributor

Btw, would it be worthwhile making the test use test_expcovmat.csv from the regressions? When I tried to do this I got a mismatch between what the saved covmat was and the one we generate

@wilsonmr
Copy link
Copy Markdown
Contributor Author

well I think that is testing some higher level functionality which uses the cpp underlying function which we test here so I don't think that's necessary. Btw the way the cd is loaded here means that certain things aren't added like the sys errors which are explicitly what the testing datasets check (and must be specified) if I carefully load the exact commondata and settings:

>>> from conftest import base_config
>>> from validphys.api import API
>>> data = API.data(**base_config)
`experiments` has been deprecated, specify data using `dataset_inputs`. Any grouping defined by `experiments` is being ignored.
>>> cds = [ds.commondata for ds in data.datasets]
>>> from validphys.commondataparser import load_commondata
>>> ld_cds = list(map(load_commondata, cds))
>>> from validphys.covmats import datasets_covmat_from_systematics
>>> covmat = datasets_covmat_from_systematics(ld_cds)
>>> from validphys.tableloader import parse_exp_mat
>>> df = parse_exp_mat("regressions/test_expcovmat.csv")
>>> import numpy as np
>>> np.testing.assert_allclose(df.to_numpy(), covmat)
>>> np.allclose(df.to_numpy(), covmat)
True

then I think we're good

@wilsonmr
Copy link
Copy Markdown
Contributor Author

Btw the way the cd is loaded here means that certain things aren't added like the sys errors

To be clear I mean in the testing, in this specific line:

cds = list(map(l.check_commondata, correlated_datasets))

@wilsonmr
Copy link
Copy Markdown
Contributor Author

probably to be safe we should actually use full dataset settings with cfactors, 'sys 10' to make sure it all gets properly accounted for..

@wilsonmr wilsonmr merged commit 4a8d3a3 into python-correlated-covmats Oct 28, 2020
@wilsonmr wilsonmr deleted the mw-python-covmat branch October 28, 2020 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants