-
Notifications
You must be signed in to change notification settings - Fork 14
Correlated covmats in python #955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
653791c
Combining list of commondatas
a1412f4
Removing toolz dependency
c69a9aa
Resetting labels for systematics after correlating
2c78e7b
Raise exception
1221d93
Adding docstrings
ddac5ef
Adding CI tests
3b44fc8
propose way of constructing covmat of list of commondata objects, wit…
wilsonmr 98fccc1
Update names and docstrings of covmat construction functions
wilsonmr b601377
Add back the docstring from shayan editing to make sense with slight …
wilsonmr d8fb043
undo erroneous removal of blank lines
wilsonmr 4a8d3a3
use list of commondata
wilsonmr 589074c
Removing unused imports
1b67c30
Adding new fixtures for cut data and correlated data
e58f443
Apply suggestions from code review
siranipour 780bcaf
Adding Parameters header to split_uncertainties function
1c45843
Making return type namedtuple
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add this as a documented parameter? The reason is that I don't know how to read and didn't see it on the first pass (or a type annotation, it would have some value here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I run something like
which is a bit much (or maybe I just need a new laptop). Profiling this a bit and seeing e.g. if some king of group by operation (possibly supplemented by indexing the original coredata dataframe by the systype) would be more efficient that the repeated use of things like
would definitively be in the nice to have category.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could use an
isin, let me take a look, btw what did you use to profile the function?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the
%timeitmagic of IPython.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm the awkward thing with this indexing is dealing with the cases when the key isn't present. The groupby seems promising i.e
But perhaps the covmat can be constructed directly from the groupby rather than having this split uncertainty function?
I also wonder if this is a little slow:
Although I don't see much alternative with the current way we store the errors. I do half wonder if storing the uncertainties as two dataframes (one for mult, one for add) with sysnames as the column index and then the mult dataframe would just be the percentages (as a raw number) and the additive would be the absolute values would make slightly more sense in the context of what we do with the systematics?
The current method of storage is quite fancy but the raw data which we want to access ends up being nested quite far in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah apologies, the posts crossed. Ok, I guess there is no way to avoid this? So we could speed up the
df.applythen, since it seems to be the bottleneckUh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So an alternative to the
df.applyis the following, but it doesn't give all that big a saving:lmk what you think
Edit: in fact, if anything, it's slower....
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm well returning to my previous comment. I wonder if a slight refactoring of how the systematics are stored would benefit us here? I mean AFAIK
applymapis afor column in columns: map(func, column)so it's not super far away from a nested for loop which I think we want to avoid in python. When I think about the systematic file, it's unclear to me why we want every element to be an object, because all elements in the same column have the same systematic name and type. When we load the data we have to manually convert it into this format only to have to unpack it like this, which seems a little suboptimal.In the end there is this historic doubling of information in the commondata files, however changing that would involve changing build master which I don't think anybody wants to do. With that in mind, I think that just storing the multiplicative columns for
MULTuncertainties and the additive (or absolute uncertainties) for theADDuncertainties as dataframes, with column index as the sysname would mean we could then do something like:Which I think would be quite a bit faster than what we have because we convert the mult uncertainties in a vectorised way and we pick out the different cases using groupby. BTW we don't even have to drop
SKIPhere we just don't add to the mat, in some sense this is more similar to the c++ code but we avoid explicit nested loops (and let numpy handle it)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should note that
INTRA_DATASET_SYS_NAMEwould need to includeSKIPin that example or else they would be treated as experimental uncertainties erroneouslyThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another note is that
sys_errorsproperty of thecoredata.CommonDatais calling the function which does a nested for loop which constructs the complicated format so I wonder if it's really the indexing or it's thesys_errorstaking ages to construct