Optimise constructing ordered metadata (merge path)#587
Optimise constructing ordered metadata (merge path)#587rhattersley merged 1 commit intoSciTools:masterfrom
Conversation
|
Requires #586 to be merged then I'll rebase for tests to pass |
|
A future piece of work that could come out of this PR is with a refactor of variables used datatypes (some are classes in the _merge module while other which now duplicate functionality belong as methods to the cube object). |
lib/iris/_merge.py
Outdated
There was a problem hiding this comment.
I like that you've used a private API as opposed to just poking around in the internal state variables of the Cube. But if we're going to do this it'd be nice if it was via a public API.
That said, a public API providing coords-and-dims would duplicate the functionality already provided by Cube.coord/Cube.coords and Cube.coord_dims.
@esc24 - have you got any thoughts on this?
There was a problem hiding this comment.
My objections are the same as yours, I see a lot of code duplication and the use of a private method. If you want to use this information from here it needs to be public. If you make your method/property public there would actually be two ways to get to the same information which is not ideal but not a deal breaker. However, a need has to be demonstrated. I currently see no need for additional methods on the cube because in my opinion
coords_and_dims = [(coord, cube.coord_dims(coord)) for coord in cube.dim_coords]is simple and readable. I can use the same pattern for cube.dim_coords, cube.aux_coords, cube.derived_coords or cube.coords() constrained in anyway I like. Should we have a method for each of these? I do not think so. List comprehensions and generator expressions are clear and concise. A multitude of difficult to name methods on an already cluttered class is neither clear, concise or easy to maintain. 👎
One further point. If usercode makes use of the dimension, it raises alarm bells. Clearly when performing low level manipulation of the cube we need to inspect and perhaps modify this mapping (I do so in a helper function that restores coordinates that have been collapsed/sliced to scalars). Adding more ways to get to this information raises similar sounding alarm bells.
There was a problem hiding this comment.
@esc24 - can you describe what's behind the "alarm bells"?
I haven't thought this through yet, but what if we did something similar to Python's "unbound method" vs. "bound method" and added a "bound Coord"? It could behave just like a Coord, except it also has a "dimension"/"dimensions" attribute.Then you'd only need a single method on the Cube: Cube.coord() which would return a "bound Coord". Cube.coord_dims() could be retired, and there'd be no need for a Cube._coord_and_dims().
😱
There was a problem hiding this comment.
Not such a crazy idea. I seem to remember the idea was floated when we implemented the shift to CF. I think @bblay may have suggested it (it may have been me 😄). I think it's worth exploring. Bear in mind that unbound methods have been removed from python3 - there may be a lesson there too.
There was a problem hiding this comment.
I was looking at turning the existing cube attributes into dictionaries:
self._dim_coords_and_dims = []
self._aux_coords_and_dims = []
self._aux_factories = []
into something like:
self._dim_coords_and_dims = {coord_defn: tuple(coord, dim), ..., ..., }
This would have a similar capability without architectural change, having a coord-dim association by using dictionary key lookup.
what do you think? My idea however does not make an architectural change and it does feel like an architectural change may improve code clarity (at least in med. term to address how coordinate-dimensions are associated and called).
@rhattersley, I like your idea as it addresses this architectural change, how about everyone else?
I got confused at first since this is not strictly a case of bound and unbound methods in the python sense, since the coordinate as a method on the cube is bound to that instance irrespective of whether a coordinate exists away from the cube.
should it be part of this PR? I had noticed this, however, it seemed to touch a greater scope of the code that this PR intended to change. I'm happy to make changes though as part of this PR if you do consider it to be within scope. |
Sorry - I missed your original comment. If we decide we want to use a public interface then I think the de-duplication is definitely in scope. Otherwise I'm not sure. So let's wait and see what happens with the interface first. |
|
Sorry about all the picky style stuff !! I'm just wondering if you _really_ need to add more functions to the Cube object... |
Alarm bells might be too strong. Perhaps I'm still scarred from coding |
|
I think the issue with handling coordinate-dimension association is obviously a blocker here and from looking at the code in general, this problem appears widespread, I propose we create a separate issue as this is perhaps not directly within scope of this ticket. |
I was discussing this with @cpelley, maintaining that this is not the same as having the coords_and_dims dictionary he was proposing. But I may not have understood properly... I must say that if that is right, then it seems a dodgy idea to me : Although a BoundCoord would "know" which dimensions it maps to, it can't know which cube it belongs to -- so the 'dims' information is no use to the Coord itself, as it only has meaning in relation to the whole cube. Can I make a bid here to spin this thing out into a separate issue? I think it would be far more manageable if the code can be recast to avoid needing changes to the Cube api/implementation. (and then reconsider if that won't work for some reason). Does that work @cpelley ? |
|
^rebased and simplified. |
More thoughts on this... I think the point about 'bound' methods is that --at least apparently-- they contain a reference to the "thing they are part of" (i.e. the instance or the class). I think this is also a bit like the problem we had ages ago about whether a coordinate's name should be a part of it, or just a tag it is "known by in the cube". |
lib/iris/_merge.py
Outdated
There was a problem hiding this comment.
Can you not use the public API here now?
I'd expected we would now just have coords_and_dims = [(coord, cube.coord_dims(coord)) for coord in cube.coords()] here.
Or does that make it noticeably slower ??
There was a problem hiding this comment.
Sadl, moving away from using cube.coord_dims() is the main target of this optimisation PR. It's a lot slower than getting the coordinate and its dimension from the Cube at the same time.
phew! @pp-mo has managed to shoot down my silly idea before it got too far! Thank you! Given we already have a uniqueness constraint on a Cube's coordinates, @cpelley's suggestion to recast |
Apologies, I've now discussed this again with @cpelley and my understanding was wrong : I somehow thought this code was making multiple coord_dims calls per coordinate, but that is not the case : It is just that calling coord_dims on each coord is slow as it has to search for the specific coordinate every time. So we do either need to get this data all at once (as latest code here, which uses the private data), or make the coord_dim calls faster (as in @cpelley's recent suggestion to "recast Cube._dim_coords_and_dims and friends" ). As this change is not essential to fixing the efficiency problem, I would suggest we could still extend the public Cube api instead (e.g. have a "coords_and_dims" call returning the info that is wanted her).
|
|
ah the coordinate definition is a LimitedAttributeDict instance, which is not hashable. I'm taking a step back, using the coordinate as key and the corresponding value a dimension. Had been using a dictionary of named tuples. |
lib/iris/_merge.py
Outdated
There was a problem hiding this comment.
Misleading name, coord_dim. Suggest coord_and_dim.
Warning: As things stand, you can't use a Coord instance as a key in a dictionary! |
|
:( oh dear Iv been going round in circles. Ieft my idea earlier because the coordinate _as_defn() is not hashable |
|
From the Python docs:
So the real problem is that everything that one might use to define the identify of a Coord is mutable. (i.e. standard_name, long_name, var_name, units, attributes, coord_system.) And this is the case whether you refer to the Coord directly, or use a CoordDefn. I guess we can't use a dictionary after all. |
|
:( aww it would have been so neat too. I was looking at the possibility of a two-way dictionary (dict subclass) for even more efficient lookup on the dim-coord assoc and everything. |
|
in case a final implementation results in a changing of coord_dims, I'll reference #609 to keep it in mind |
|
I would like to summarise the paths taken so far so reviewer(s) dont get lost in this!: option1: cube._dim_coord_dim and cube._aux_coord_dim private methods which return a list of coordinate-dim pairs (used wherever required) - using the cube._aux_coords_and_dims and cube._dim_coords_and_dims attributes. - private API changes not favoured @rhattersley can I ask whether you have any idea whether we can produce a hashable result from the coordinate efficiently? or has this idea died to death? thanks |
|
After discussion with @rhattersley the simplest possible solution with no architectural change! appears to work: |
lib/iris/cube.py
Outdated
There was a problem hiding this comment.
This comment is now out of date. And the new code could do with a comment explaining why it exists: i.e. it's just an optimisation and makes no functional difference.
Most the benefit with none of the pain. Nice one @cpelley 😀 |
|
thanks @rhattersley got there in the end hehe |
This commit offers to speedup the comparison of supplied coordinate with that on the cube by way of checking if its the same object, then comparing by way of metadata if object not found. This in particular was targetting CubeList merging.
|
On my machine this lops 10s off the time taken to run the test suite. Down from 110s to 100s. 😀 I'm happy to merge now. 👍 Last call for any objections! 😉 |
|
Time's up! I'm merging. |
Optimise Cube.coord_dims()



Duplicate coord-dimension association get found, resulting in 10% faster overall for
the merge path after removal.