Correlated covmats in python#955
Conversation
And accounting for correlations
|
I need to finish docstrings and tests |
|
This would be the test we would like to run eventually but it melts my computer Full testimport numpy as np
from validphys.covmats import covmat_from_systematics
from validphys.commondataparser import combine_commondata, load_commondata
from validphys.loader import Loader
l = Loader()
cds = []
dss = []
for ds in l.available_datasets:
try:
cds.append(l.check_commondata(ds))
dss.append(l.check_dataset(ds, theoryid=53, cuts=None))
print(ds)
except:
print("failed")
continue
commondata_list = list(map(load_commondata, cds))
cd = combine_commondata(commondata_list)
new_covmat = covmat_from_systematics(cd)
exp = l.check_experiment("FOO", dss)
ld = exp.load()
old_covmat = ld.get_covmat()
assert np.allclose(old_covmat, new_covmat)for now this test seems to work nicely: Quick testimport random
import numpy as np
from validphys.covmats import covmat_from_systematics
from validphys.commondataparser import combine_commondata, load_commondata
from validphys.loader import Loader
l = Loader()
cds = []
dss = []
avails = l.available_datasets
samples = random.sample(avails, min(20, len(avails)))
for ds in samples:
try:
cd_spec = l.check_commondata(ds)
ds_spec = l.check_dataset(ds, theoryid=53, cuts=None)
cds.append(cd_spec)
dss.append(ds_spec)
print(ds)
except:
print("failed")
continue
commondata_list = list(map(load_commondata, cds))
cd = combine_commondata(commondata_list)
new_covmat = covmat_from_systematics(cd)
exp = l.check_experiment("FOO", dss)
ld = exp.load()
old_covmat = ld.get_covmat()
try:
assert np.allclose(old_covmat, new_covmat)
except AssertionError:
from IPython import embed; embed() |
d1e8c81 to
c69a9aa
Compare
|
I'm not sure I'd randomly choose the datasets, as far as I can tell this could put together two datasets which are the same but one has had a fix applied or something I think we should systematically choose some datasets which cover all basis, the current testing datasets cover some different configurations like I wouldn't try to do it for too many datasets at a time because the c++ code won't cope, and likely neither will python if you are constructing a dataframe with all the systematics.. |
|
Ah this was just to make sure it was doing the global covmat correctly since it's the function I'm least sure about. I'll come up with a quicker CI test tomorrow |
If the number of kins are not equal then raise ValueError
1f8ad7d to
4fad3e2
Compare
|
Would say this is ready for review |
|
Also I just tried a time comparison using 19 random (but having some correlation) datasets: python: 107.28161001205444 s edit: If i use 40 datasets the improvement is even more marked: python: 364.6011164188385 s |
|
Good! To be expected because the way the covmat is constructed in c++ is quite inefficient. |
4fad3e2 to
ddac5ef
Compare
|
Isn't it easier to have covmats working with lists of commondata than introducing some "combined commondata" object that is not quite the same? |
|
We could do, but I had always envisaged these functions operating on |
…hout creating fake commondata object of multiple datasets
…reshuffling of functions
correlated covmats from lists of commondata
4e6697b to
fadc917
Compare
Making tests use these fixtures
fadc917 to
1b67c30
Compare
|
Ok doke, ive implemented the fixtures, think this is ready for reviewing/merging? |
|
I have to look at the code in detail, but the changeset feels right when you scroll down. |
Changing `conftest` datasets to use the data keyword Co-authored-by: wilsonmr <33907451+wilsonmr@users.noreply.github.com>
Zaharid
left a comment
There was a problem hiding this comment.
I only have relatively minor comments (except perhaps the one related to performance, but I will survive).
I did not go over the math, but the tests look sensible to me so I am going to trust those.
|
|
||
| def split_uncertainties(commondata): | ||
| """Take the statistical uncertainty and systematics table from | ||
| a :py:class:`validphys.coredata.CommonData` object |
There was a problem hiding this comment.
Could you please add this as a documented parameter? The reason is that I don't know how to read and didn't see it on the first pass (or a type annotation, it would have some value here).
There was a problem hiding this comment.
If I run something like
In [22]: cd = l.check_commondata("NMC")
In [23]: lcd = load_commondata(cd)
In [24]: %timeit split_uncertainties(lcd)
293 ms ± 3.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)which is a bit much (or maybe I just need a new laptop). Profiling this a bit and seeing e.g. if some king of group by operation (possibly supplemented by indexing the original coredata dataframe by the systype) would be more efficient that the repeated use of things like
thunc = abs_sys_errors[:, sys_name == "THEORYUNCORR"]
unc = abs_sys_errors[:, sys_name == "UNCORR"]
thcorr = abs_sys_errors[:, sys_name == "THEORYCORR"]
corr = abs_sys_errors[:, sys_name == "CORR"]
would definitively be in the nice to have category.
There was a problem hiding this comment.
Perhaps we could use an isin, let me take a look, btw what did you use to profile the function?
There was a problem hiding this comment.
Just the %timeit magic of IPython.
There was a problem hiding this comment.
Hmm the awkward thing with this indexing is dealing with the cases when the key isn't present. The groupby seems promising i.e
isspecial = lambda x: x if x in INTRA_DATASET_SYS_NAME else "SPECIAL"
split_dict = {group: df for group, df in
abs_sys_errors_df.groupby(by=isspecial, axis=1)}But perhaps the covmat can be constructed directly from the groupby rather than having this split uncertainty function?
I also wonder if this is a little slow:
abs_sys_errors_df = sys_errors_df.apply(
lambda x: [
i.add if i.sys_type == "ADD" else (i.mult * j / 100)
for i, j in zip(x, commondata.central_values)
]
)Although I don't see much alternative with the current way we store the errors. I do half wonder if storing the uncertainties as two dataframes (one for mult, one for add) with sysnames as the column index and then the mult dataframe would just be the percentages (as a raw number) and the additive would be the absolute values would make slightly more sense in the context of what we do with the systematics?
The current method of storage is quite fancy but the raw data which we want to access ends up being nested quite far in
There was a problem hiding this comment.
Ah apologies, the posts crossed. Ok, I guess there is no way to avoid this? So we could speed up the df.apply then, since it seems to be the bottleneck
There was a problem hiding this comment.
So an alternative to the df.apply is the following, but it doesn't give all that big a saving:
sys_type = commondata.systype_table["type"]
mult_errors = sys_type == "MULT"
mult_sys_errors = sys_errors_df.loc[:, mult_errors].applymap(lambda x: x.mult)
converted_mult_sys_errors = mult_sys_errors.multiply(commondata.central_values, axis=0)
abs_sys_errors_df = sys_errors_df.applymap(lambda x: x.add)
abs_sys_errors_df.loc[:, mult_errors] = converted_mult_sys_errorslmk what you think
Edit: in fact, if anything, it's slower....
There was a problem hiding this comment.
Hmm well returning to my previous comment. I wonder if a slight refactoring of how the systematics are stored would benefit us here? I mean AFAIK applymap is a for column in columns: map(func, column) so it's not super far away from a nested for loop which I think we want to avoid in python. When I think about the systematic file, it's unclear to me why we want every element to be an object, because all elements in the same column have the same systematic name and type. When we load the data we have to manually convert it into this format only to have to unpack it like this, which seems a little suboptimal.
In the end there is this historic doubling of information in the commondata files, however changing that would involve changing build master which I don't think anybody wants to do. With that in mind, I think that just storing the multiplicative columns for MULT uncertainties and the additive (or absolute uncertainties) for the ADD uncertainties as dataframes, with column index as the sysname would mean we could then do something like:
mat = np.diag(commondata.stat_errors.to_numpy())
is_special = lambda x: x if x in INTRA_DATASET_SYS_NAME else "SPECIAL"
converted_mult = commondata.mult * central_values[:, np.newaxis] / 100
for abs_sys_df in (converted_mult, commondata.add):
for sys_name, sys_table in commondata.mult.groupby(by=is_special, axis=1):
if sys_name == "UNCORR":
mat += np.diag((sys_table.values ** 2).sum(axis=1))
...
Which I think would be quite a bit faster than what we have because we convert the mult uncertainties in a vectorised way and we pick out the different cases using groupby. BTW we don't even have to drop SKIP here we just don't add to the mat, in some sense this is more similar to the c++ code but we avoid explicit nested loops (and let numpy handle it)
There was a problem hiding this comment.
I should note that INTRA_DATASET_SYS_NAME would need to include SKIP in that example or else they would be treated as experimental uncertainties erroneously
There was a problem hiding this comment.
another note is that sys_errors property of the coredata.CommonData is calling the function which does a nested for loop which constructs the complicated format so I wonder if it's really the indexing or it's the sys_errors taking ages to construct
Updating docstring to reflect this
I've added a function that takes in a list of
CommondDataobjects and combines them into one 'effective'CommonDatawhere the (possible) correlations between the datasets are correctly accounted for.So you can now do something like
This will in principle remedy #866 (comment) too, though I've not tested it yet.
The relevant piece of C++ code is here
nnpdf/libnnpdf/src/experiments.cc
Line 437 in f038c95