correlated covmats from lists of commondata#970
Conversation
…hout creating fake commondata object of multiple datasets
| """Take the systematics table from a commondata and remove any systematics | ||
| which have name "SKIP". Next transform the systematics into the correct units. | ||
| Additive (ADD) systematics are left unchanged, whereas multiplicative (MULT) | ||
| systematics need to be converted from a percentage into an uncertainty | ||
| by multiplying by the central value and dividing by 100. | ||
|
|
||
| We now pick datapoint i and loop over all other datapoints j. For each such pair of points, | ||
| loop over all the systematics, if the l'th systematic of data point i has name ``SKIP`` we ignore | ||
| it. This is handled by scanning over all sytematic errors in the ``sys_errors`` dataframe and dropping | ||
| any columns which correspond to a systematic error name of ``SKIP``, thus the ``sys_errors`` dataframe | ||
| defined below contains only the systematics without the ``SKIP`` name. | ||
| Finally the systematics are split into the different components which | ||
| are treated differently when constructing the covariance matrix. | ||
|
|
||
| Note that in the switch statement an ADDitive or MULTiplicative systype of systematic i is handled | ||
| by either multiplying the additive or multiplicative uncertainties respectively. We instead opt to | ||
| create a purely additive systematic table where all elements are to be understood as additive | ||
| systematics and systematics of type MULT have been correctly transformed to their additive | ||
| values. This dataframe we call ``additive_mat``. | ||
| Returns a tuple of the statistical uncertainty and 5 possible groups | ||
| of systematics uncertainties: UNCORR, CORR, THEORYUNCORR, THEORYCORR and | ||
| inter-dataset systematics |
There was a problem hiding this comment.
like I said I'll make this proper english later
|
I also find this method to be quicker. But then I also find the c++ method to be quicker, admittedly testing both in a jupyter notebook on limited number of datasets def old_method():
l = Loader()
correlated_datasets = [
"ATLASWZRAP36PB",
"ATLASZHIGHMASS49FB",
"ATLASLOMASSDY11EXT",
"ATLASWZRAP11",
]
THEORYID = 162
dss = [
l.check_dataset(i, theoryid=THEORYID, cuts=None) for i in correlated_datasets
]
exp = l.check_experiment("null", dss)
ld_exp = exp.load.__wrapped__(exp)
ld_exp.get_covmat()
%timeit old_method()
373 ms ± 5.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)def list_cd():
l = Loader()
correlated_datasets = [
"ATLASWZRAP36PB",
"ATLASZHIGHMASS49FB",
"ATLASLOMASSDY11EXT",
"ATLASWZRAP11",
]
cds = list(map(l.check_commondata, correlated_datasets))
ld_cds = list(map(load_commondata, cds))
covmat = commondatas_covmat_from_systematics(ld_cds)
%timeit list_cd()
391 ms ± 9.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)def combine_cd():
l = Loader()
correlated_datasets = [
"ATLASWZRAP36PB",
"ATLASZHIGHMASS49FB",
"ATLASLOMASSDY11EXT",
"ATLASWZRAP11",
]
cds = list(map(l.check_commondata, correlated_datasets))
ld_cds = list(map(load_commondata, cds))
combo_cd = combine_commondata(ld_cds)
covmat = covmat_from_systematics(combo_cd)
%timeit combine_cd()
1.3 s ± 5.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) |
This was well received at the time!! To be honest, I'm more of a fan of the idea that these functions should work with |
|
Without having looked at this in detail, I must say that the combination of my prior bias and the +140 −273 diff make me think very favourably of this PR. |
Well if there is overwhelming support we can put it back. I just don't see what it adds on top of us having the python source code which I think more clearly is adding some parts to the diagonal and then taking matrix multiplication of other parts |
|
I can see why you like combining the commondata because then you can reuse certain bits of code and it is quite an alluring prospect. I think the baseline code was really good, and I basically just slightly changed the return of your original covmat function and renamed it, because it no longer returns a covmat. The pros for this:
The cons are that we will kind of implicitly have to generate data for a single commondata and a list of commondata, well actually I don't think this is strictly true, in general I think we only need the latter because we generally want to generate data for multiple commondata and also the latter can still do the job of the former by making a list of a single dataset anyway - btw that's sort of what happened in As a final comment I mentioned before
Which could potentially simplify #866 and later on could even be used to speed up the generation of pseudodata when we used |
|
This is ready for review btw |
siranipour
left a comment
There was a problem hiding this comment.
LGTM certainly happy to merge
|
Ok I will merge then we can focus on merging the other PR |
|
Btw, would it be worthwhile making the test use |
|
well I think that is testing some higher level functionality which uses the cpp underlying function which we test here so I don't think that's necessary. Btw the way the cd is loaded here means that certain things aren't added like the sys errors which are explicitly what the testing datasets check (and must be specified) if I carefully load the exact commondata and settings: >>> from conftest import base_config
>>> from validphys.api import API
>>> data = API.data(**base_config)
`experiments` has been deprecated, specify data using `dataset_inputs`. Any grouping defined by `experiments` is being ignored.
>>> cds = [ds.commondata for ds in data.datasets]
>>> from validphys.commondataparser import load_commondata
>>> ld_cds = list(map(load_commondata, cds))
>>> from validphys.covmats import datasets_covmat_from_systematics
>>> covmat = datasets_covmat_from_systematics(ld_cds)
>>> from validphys.tableloader import parse_exp_mat
>>> df = parse_exp_mat("regressions/test_expcovmat.csv")
>>> import numpy as np
>>> np.testing.assert_allclose(df.to_numpy(), covmat)
>>> np.allclose(df.to_numpy(), covmat)
Truethen I think we're good |
To be clear I mean in the testing, in this specific line: cds = list(map(l.check_commondata, correlated_datasets)) |
|
probably to be safe we should actually use full dataset settings with cfactors, 'sys 10' to make sure it all gets properly accounted for.. |
I think that this method is easier to understand, in theory the split systematics could also split into multiplicative and additive systematics which might be useful for generating the pseudodata. The special experimental systematics are correlated by using the systematics names as columns and concatenating a dataframe. The rest only contribute to block diagonals anyway.
note that this implicitly handles the empty cases because:
Since it's a rather large change I thought a PR might be more appropriate.
Let me know what you think. It probs needs better docs and reformatting but you get the idea. I shortened considerably the docs of constructing the covmat because I think having a dump of the old c++ code is a bit much..