Skip to content

Optimise constructing ordered metadata (merge path)#587

Merged
rhattersley merged 1 commit intoSciTools:masterfrom
cpelley:merge_path
Jul 11, 2013
Merged

Optimise constructing ordered metadata (merge path)#587
rhattersley merged 1 commit intoSciTools:masterfrom
cpelley:merge_path

Conversation

@cpelley
Copy link

@cpelley cpelley commented Jul 1, 2013

Duplicate coord-dimension association get found, resulting in 10% faster overall for
the merge path after removal.

@cpelley
Copy link
Author

cpelley commented Jul 1, 2013

The following diagrams are provided merely to indicate where the code is spending most of its time. A sample based technique was used.
figure_1
The above diagram indicates two significant (indicated by their thickens) branches. This thickness indicates number of occurrences in the sampling. One branch corresponds to the registering of the cube with its appropriate protocube. The other branch corresponds to the re-ordering of the protocube coordinates. You will notice how the cube register is of much greater thickness. This corresponds to a 67% runtime of the merge process, while the re-ordering corresponds to a 33% of the runtime.

plot

figure_1
The cube registering with its appropriate protocube consists of two major parts, one is a complex set of combined sorting operations (15% of total merge process run-time) and constructing ordered metadata then corresponding to 33%.

Over 10 samples:

Before:
min: 6.506858s max: 6.728443

After:
min: 5.900889 max: 6.085443

The merge process is now ~10% faster overall.

@cpelley
Copy link
Author

cpelley commented Jul 1, 2013

Requires #586 to be merged then I'll rebase for tests to pass

@cpelley
Copy link
Author

cpelley commented Jul 1, 2013

A future piece of work that could come out of this PR is with a refactor of variables used datatypes (some are classes in the _merge module while other which now duplicate functionality belong as methods to the cube object).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that you've used a private API as opposed to just poking around in the internal state variables of the Cube. But if we're going to do this it'd be nice if it was via a public API.

That said, a public API providing coords-and-dims would duplicate the functionality already provided by Cube.coord/Cube.coords and Cube.coord_dims.

@esc24 - have you got any thoughts on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My objections are the same as yours, I see a lot of code duplication and the use of a private method. If you want to use this information from here it needs to be public. If you make your method/property public there would actually be two ways to get to the same information which is not ideal but not a deal breaker. However, a need has to be demonstrated. I currently see no need for additional methods on the cube because in my opinion

coords_and_dims = [(coord, cube.coord_dims(coord)) for coord in cube.dim_coords]

is simple and readable. I can use the same pattern for cube.dim_coords, cube.aux_coords, cube.derived_coords or cube.coords() constrained in anyway I like. Should we have a method for each of these? I do not think so. List comprehensions and generator expressions are clear and concise. A multitude of difficult to name methods on an already cluttered class is neither clear, concise or easy to maintain. 👎

One further point. If usercode makes use of the dimension, it raises alarm bells. Clearly when performing low level manipulation of the cube we need to inspect and perhaps modify this mapping (I do so in a helper function that restores coordinates that have been collapsed/sliced to scalars). Adding more ways to get to this information raises similar sounding alarm bells.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@esc24 - can you describe what's behind the "alarm bells"?

I haven't thought this through yet, but what if we did something similar to Python's "unbound method" vs. "bound method" and added a "bound Coord"? It could behave just like a Coord, except it also has a "dimension"/"dimensions" attribute.

Then you'd only need a single method on the Cube: Cube.coord() which would return a "bound Coord". Cube.coord_dims() could be retired, and there'd be no need for a Cube._coord_and_dims().

BoundCoord class diagram
</crazy idea>

😱

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not such a crazy idea. I seem to remember the idea was floated when we implemented the shift to CF. I think @bblay may have suggested it (it may have been me 😄). I think it's worth exploring. Bear in mind that unbound methods have been removed from python3 - there may be a lesson there too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at turning the existing cube attributes into dictionaries:
self._dim_coords_and_dims = []
self._aux_coords_and_dims = []
self._aux_factories = []

into something like:
self._dim_coords_and_dims = {coord_defn: tuple(coord, dim), ..., ..., }

This would have a similar capability without architectural change, having a coord-dim association by using dictionary key lookup.
what do you think? My idea however does not make an architectural change and it does feel like an architectural change may improve code clarity (at least in med. term to address how coordinate-dimensions are associated and called).

@rhattersley, I like your idea as it addresses this architectural change, how about everyone else?
I got confused at first since this is not strictly a case of bound and unbound methods in the python sense, since the coordinate as a method on the cube is bound to that instance irrespective of whether a coordinate exists away from the cube.

@cpelley
Copy link
Author

cpelley commented Jul 2, 2013

@cpelley: A future piece of work that could come out of this PR is with a refactor of variables used datatypes (some are classes in the _merge module while other which now duplicate functionality belong as methods to the cube object).

@rhattersley: The _CoordAndDims namedtuple has exactly the same structure as the pairings currently obtained from cube._dim_coord_dim() + cube._aux_coord_dim(). Once we've resolved the Cube API issues these two structures should be unified.

should it be part of this PR? I had noticed this, however, it seemed to touch a greater scope of the code that this PR intended to change. I'm happy to make changes though as part of this PR if you do consider it to be within scope.

@rhattersley
Copy link
Member

should it be part of this PR? I had noticed this, however, it seemed to touch a greater scope of the code that this PR intended to change. I'm happy to make changes though as part of this PR if you do consider it to be within scope.

Sorry - I missed your original comment. If we decide we want to use a public interface then I think the de-duplication is definitely in scope. Otherwise I'm not sure. So let's wait and see what happens with the interface first.

@pp-mo
Copy link
Member

pp-mo commented Jul 2, 2013

Sorry about all the picky style stuff !!
The function is nice, and obviously represents a significant performance boost.

I'm just wondering if you _really_ need to add more functions to the Cube object...
It won't provide much benefit to anything but the merge code, as far as I can see.
If you could avoid using private Cube properties (_xxx_coord_and_dims), and instead work through the standard Cube API, then you could code the _dim_coord_dim and _aux_coord_dim lists construction within the merge module itself. And then you could also make the list elements into _CoordAndDims instances, so you can replace coord[0] and coord[1] with coord_info.coord and coord_info.dims. Which would be a lot easier on the eye + brain !
But that might slow execution again ...

@ghost ghost assigned bblay Jul 8, 2013
@esc24
Copy link
Member

esc24 commented Jul 9, 2013

can you describe what's behind the "alarm bells"?

Alarm bells might be too strong. Perhaps I'm still scarred from coding iris.iterate.zip 😉, but experience tells me usercode that uses dimensions rather than the coordinates is less robust, reliable and reusable. It may work today with your 2d lat-lon cube, but look at a 3d cube tomorrow that has a time coordinate and the code breaks. Similarly, it works with one dataset because lat and lon are on dimensions 0 and 1, but it falls over (or fails silently) with another dataset has them the other way round. It can't always be avoided, but I'd encourage people to think in terms of coordinates and let our implementation take care of this mapping.

@cpelley
Copy link
Author

cpelley commented Jul 9, 2013

I think the issue with handling coordinate-dimension association is obviously a blocker here and from looking at the code in general, this problem appears widespread, I propose we create a separate issue as this is perhaps not directly within scope of this ticket.
related: #583

@pp-mo
Copy link
Member

pp-mo commented Jul 9, 2013

@rhattersley
I haven't thought this through yet, but what if we did something similar to Python's "unbound method" vs. "bound method" and added a "bound Coord"? It could behave just like a Coord, except it also has a "dimension"/"dimensions" attribute.
Then you'd only need a single method on the Cube: Cube.coord() which would return a "bound Coord". Cube.coord_dims() could be retired, and there'd be no need for a Cube._coord_and_dims().
</crazy idea>

I was discussing this with @cpelley, maintaining that this is not the same as having the coords_and_dims dictionary he was proposing. But I may not have understood properly...
My understanding is that the coordinates added in a cube will be BoundCoords, which contain their dims instead of it being stored in a separate cube "mapping" property
-- is that what you mean @rhattersley ?

I must say that if that is right, then it seems a dodgy idea to me : Although a BoundCoord would "know" which dimensions it maps to, it can't know which cube it belongs to -- so the 'dims' information is no use to the Coord itself, as it only has meaning in relation to the whole cube.
So to me this looks like a convenience kludge -- the information really doesn't "belong" here, it belongs at the level of the cube, where it has a context that gives it some actual meaning.
( Alternatively, if a BoundCoord does contain a reference to its 'parent' cube, this could make some sense, but that will create awkward circular references. )

Can I make a bid here to spin this thing out into a separate issue? I think it would be far more manageable if the code can be recast to avoid needing changes to the Cube api/implementation. (and then reconsider if that won't work for some reason). Does that work @cpelley ?

@cpelley
Copy link
Author

cpelley commented Jul 9, 2013

^rebased and simplified.
Note: I have accessed the cubes private attributes in light with the above discussion.
Ready for review

@pp-mo
Copy link
Member

pp-mo commented Jul 9, 2013

@esc24 Not such a crazy idea ... Bear in mind that unbound methods have been removed from python3 - there may be a lesson there too.

More thoughts on this...
Unbound methods do still exist in Python3, of course, but they are now just plain functions, whereas in Python2 there is a fetch hook that makes them appear as an 'unbound method' wrapper -- which is there just to check the type of the context argument ("self" / "cls").
See http://mail.python.org/pipermail/python-dev/2005-January/050625.html

I think the point about 'bound' methods is that --at least apparently-- they contain a reference to the "thing they are part of" (i.e. the instance or the class).
As said above, doing that with BoundCoords would create a two-way linkage structure (parent knows its children, children know their parent), with attendant consistency and circular reference problems.

I think this is also a bit like the problem we had ages ago about whether a coordinate's name should be a part of it, or just a tag it is "known by in the cube".
For a while we had both -- i.e. the coord named 'x' in the cube had its own name property, which might be something different. E.G. stuff like cube.coords['longitude'].name --> 'grid_x'. That seemed to be unmanageably confusing.
I think in that case we ended up deciding that the name was part of the definition of a coordinate. Following that, it then made much sense to do away with the 'other name'.
The moral I would take from this is that if a Coordinate exists as an independent entity, then 'parent cube' and 'mapped dimensions' are not really sensible properties of it. Which I guess is an argument for the status quo..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you not use the public API here now?
I'd expected we would now just have coords_and_dims = [(coord, cube.coord_dims(coord)) for coord in cube.coords()] here.

Or does that make it noticeably slower ??

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadl, moving away from using cube.coord_dims() is the main target of this optimisation PR. It's a lot slower than getting the coordinate and its dimension from the Cube at the same time.

@rhattersley
Copy link
Member

Although a BoundCoord would "know" which dimensions it maps to, it can't know which cube it belongs to -- so the 'dims' information is no use to the Coord itself, as it only has meaning in relation to the whole cube.
So to me this looks like a convenience kludge -- the information really doesn't "belong" here, it belongs at the level of the cube, where it has a context that gives it some actual meaning.

phew! @pp-mo has managed to shoot down my silly idea before it got too far! Thank you!

Given we already have a uniqueness constraint on a Cube's coordinates, @cpelley's suggestion to recast Cube._dim_coords_and_dims and friends as dictionaries seems the most interesting option right now. It has the potential to leave the public API unchanged and just boost its speed.

@pp-mo
Copy link
Member

pp-mo commented Jul 10, 2013

@ppeglar Can you not use the public API here now?

@rhattersley moving away from using cube.coord_dims() is the main target of this optimisation PR

Apologies, I've now discussed this again with @cpelley and my understanding was wrong : I somehow thought this code was making multiple coord_dims calls per coordinate, but that is not the case : It is just that calling coord_dims on each coord is slow as it has to search for the specific coordinate every time.

So we do either need to get this data all at once (as latest code here, which uses the private data), or make the coord_dim calls faster (as in @cpelley's recent suggestion to "recast Cube._dim_coords_and_dims and friends" ).

As this change is not essential to fixing the efficiency problem, I would suggest we could still extend the public Cube api instead (e.g. have a "coords_and_dims" call returning the info that is wanted her).
However, as @cpelley pointed out, @esc24 was already arguing against the earlier scheme to extend the API.
Reminder... (as now buried in outdated diff comments above) :

@esc24 if you make your method/property public there would actually be two ways to get to the same information which is not ideal but not a deal breaker. However, a need has to be demonstrated. I currently see no need for additional methods on the cube because in my opinion
coords_and_dims = [(coord, cube.coord_dims(coord)) for coord in cube.dim_coords]
is simple and readable. I can use the same pattern for cube.dim_coords, cube.aux_coords, cube.derived_coords or cube.coords() constrained in anyway I like. Should we have a method for each of these? I do not think so. List comprehensions and generator expressions are clear and concise. A multitude of difficult to name methods on an already cluttered class is neither clear, concise or easy to maintain.

@cpelley
Copy link
Author

cpelley commented Jul 10, 2013

ah the coordinate definition is a LimitedAttributeDict instance, which is not hashable. I'm taking a step back, using the coordinate as key and the corresponding value a dimension. Had been using a dictionary of named tuples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading name, coord_dim. Suggest coord_and_dim.

@rhattersley
Copy link
Member

ah the coordinate definition is a LimitedAttributeDict instance, which is not hashable. I'm taking a step back, using the coordinate as key and the corresponding value a dimension. Had been using a dictionary of named tuples.

Warning: As things stand, you can't use a Coord instance as a key in a dictionary!

>>> c = DimCoord(1, 'height')
>>> d = c.copy()
>>> c == d
True
>>> hash(c) == hash(d)
False

@cpelley
Copy link
Author

cpelley commented Jul 10, 2013

:( oh dear Iv been going round in circles. Ieft my idea earlier because the coordinate _as_defn() is not hashable
now looking back at this, the coordinate system is where the trouble lies since it has a dict in its eq special method. In order to make the coordinate system hashable I would have to turn this dictionary into a sorted tuple. There is likely a massive overhead associated with these changes though.

@rhattersley
Copy link
Member

From the Python docs:

hashable collection implementations require that a object’s hash value is immutable

So the real problem is that everything that one might use to define the identify of a Coord is mutable. (i.e. standard_name, long_name, var_name, units, attributes, coord_system.) And this is the case whether you refer to the Coord directly, or use a CoordDefn. I guess we can't use a dictionary after all.

@cpelley
Copy link
Author

cpelley commented Jul 10, 2013

:( aww it would have been so neat too. I was looking at the possibility of a two-way dictionary (dict subclass) for even more efficient lookup on the dim-coord assoc and everything.

@cpelley
Copy link
Author

cpelley commented Jul 10, 2013

in case a final implementation results in a changing of coord_dims, I'll reference #609 to keep it in mind

@cpelley
Copy link
Author

cpelley commented Jul 10, 2013

I would like to summarise the paths taken so far so reviewer(s) dont get lost in this!:

option1: cube._dim_coord_dim and cube._aux_coord_dim private methods which return a list of coordinate-dim pairs (used wherever required) - using the cube._aux_coords_and_dims and cube._dim_coords_and_dims attributes. - private API changes not favoured
option2: cache this locally in the function in question (ProtoCube._extract_coord_payload). - maybe? but does reference private attributes which is not favourable for maintenance
option3: BoundClass associating coordinate with dimension on cube. - coord-dim-cube parent-child two-way association not favoured
option4: Dictionary lookup for storing this information, resulting in minimum API changes and much more widespread and efficient coordinate-dim association (cube._dim_coords_and_dims = {coord_defn: tuple(coord, dim), ..., ..., }). - not possible as requires coordinate to be immutable

@rhattersley can I ask whether you have any idea whether we can produce a hashable result from the coordinate efficiently? or has this idea died to death? thanks

@cpelley
Copy link
Author

cpelley commented Jul 11, 2013

After discussion with @rhattersley the simplest possible solution with no architectural change! appears to work:
BEFORE - min 0:00:06.512437, max 0:00:06.749497
AFTER - min 0:00:05.942012, max 0:00:06.156421
~8.8% gain within merge

lib/iris/cube.py Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is now out of date. And the new code could do with a comment explaining why it exists: i.e. it's just an optimisation and makes no functional difference.

@rhattersley
Copy link
Member

~8.8% gain within merge

Most the benefit with none of the pain. Nice one @cpelley 😀

@cpelley
Copy link
Author

cpelley commented Jul 11, 2013

thanks @rhattersley got there in the end hehe

This commit offers to speedup the comparison of supplied coordinate with
that on the cube by way of checking if its the same object, then
comparing by way of metadata if object not found.  This in particular
was targetting CubeList merging.
@ghost ghost assigned rhattersley Jul 11, 2013
@rhattersley
Copy link
Member

On my machine this lops 10s off the time taken to run the test suite. Down from 110s to 100s. 😀

I'm happy to merge now. 👍 Last call for any objections! 😉

@rhattersley
Copy link
Member

Time's up! I'm merging.

rhattersley added a commit that referenced this pull request Jul 11, 2013
@rhattersley rhattersley merged commit 54745c2 into SciTools:master Jul 11, 2013
@cpelley cpelley deleted the merge_path branch July 15, 2013 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants