Add Parquet geospatial statistics utility#8414
Conversation
kylebarron
left a comment
There was a problem hiding this comment.
It would be great to have more widespread docstrings, but otherwise this looks really good so far
| arrow-schema = { workspace = true } | ||
| geo-traits = { version = "0.3" } | ||
| wkb = { version = "0.9" } | ||
|
|
||
| [dev-dependencies] | ||
| wkt = { version = "0.14" } |
There was a problem hiding this comment.
Do other crates have dependencies defined here? Or should we lift all dependencies up to the top-level Cargo.toml, even if they're only used by a single crate?
There was a problem hiding this comment.
Other crates seem to 🤷
Happy to put these anywhere!
There was a problem hiding this comment.
I think it is fine to put here, and if we end up with multiple versions in the workspace we can consolidate
alamb
left a comment
There was a problem hiding this comment.
Thank you @paleolimbot !
I had some small structural comments, but I don't really have enough experience with geometry to be able to detect logic errors or major mismatches in representation. I would (really) love to defer to @kylebarron for that level review
But from my perspective it would be fine to merge this PR and iterate as follow on PRs as well.
| arrow-schema = { workspace = true } | ||
| geo-traits = { version = "0.3" } | ||
| wkb = { version = "0.9" } | ||
|
|
||
| [dev-dependencies] | ||
| wkt = { version = "0.14" } |
There was a problem hiding this comment.
I think it is fine to put here, and if we end up with multiple versions in the workspace we can consolidate
|
|
||
| use crate::interval::{Interval, IntervalTrait, WraparoundInterval}; | ||
|
|
||
| /// Geometry bounder |
There was a problem hiding this comment.
This comment implies to me that this method will be needed in the parquet writer/encoder itself, to accumulate the appropriate statistics.
I think that is fine, but it will take a bit of plumbing to sort out, as we won't have anything similar for Variant (Variant doesn't define any special statistics, so the writer will just treat it as a normal struct / binary array),
There was a problem hiding this comment.
Yes, we will! (Perhaps we can wire that in at runtime as well)
| /// Parquet statistics with minimal modification. | ||
| #[derive(Debug)] | ||
| pub struct GeometryBounder { | ||
| x_left: Interval, |
There was a problem hiding this comment.
For someone familiar with geometry the meanings of x_left, x_mid and x_right are probably clear, but it might help someone who is not, like myself, if we could add a few more comments here
| Ok(()) | ||
| } | ||
|
|
||
| fn geometry_type(geom: &impl GeometryTrait<T = f64>) -> Result<i32, ArrowError> { |
There was a problem hiding this comment.
Can you add some reference about where these values came from? It seems maybe they are similar to the int32 values in #8225
It also seems strange to me that we are using i32 when we started with strongly typed enums (why not use enums again?)
There was a problem hiding this comment.
Or maybe this should be a method on GeometryTrait 🤔
There was a problem hiding this comment.
Probably the Wkb object could provide this. We could also compute it from the input bytes, but then we'd have to validate it which takes about as much code as this.
I don't really mind using enums but I don't think it helps us (#8225 can't depend on this because I think the idea is that this crate would be optional).
There was a problem hiding this comment.
It would be nice to have an enum here. I get that it could make inter-crate workings more complex, so we can revisit it in the future
a2a4473 to
a03090b
Compare
| /// (which adds some complexity to this implementation). | ||
| #[derive(Debug)] | ||
| pub struct GeometryBounder { | ||
| /// Union of all contiguous x intervals to the left of the wraparound midpoint |
|
@kylebarron any further comments or concerns about merging this PR? |
kylebarron
left a comment
There was a problem hiding this comment.
Thanks for all the docstrings! I haven't implemented this part of the spec before, so I also learned new things.
| out | ||
| } | ||
|
|
||
| /// Update this bounder with one WKB-encoded geometry |
There was a problem hiding this comment.
Question for the future: will it ever make sense to be able to update the bounder before values have been encoded to WKB? It makes sense that it's simplest at this step of the process to intercept the WKB, though it's a small amount of overhead to have to parse the unaligned floats here, right? Maybe that amount of overhead isn't worth optimizing
There was a problem hiding this comment.
I'm not sure! We can expose the geo-traits version as public if it comes up (or allow a caller to calculate the full GeoStatistics themselves).
| Ok(()) | ||
| } | ||
|
|
||
| fn geometry_type(geom: &impl GeometryTrait<T = f64>) -> Result<i32, ArrowError> { |
There was a problem hiding this comment.
It would be nice to have an enum here. I get that it could make inter-crate workings more complex, so we can revisit it in the future
Co-authored-by: Kyle Barron <kylebarron2@gmail.com>
|
Thanks @paleolimbot and @kylebarron 🚀 |
Which issue does this PR close?
We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.
Rationale for this change
The presence of relevant statistics when writing geometries and/or geographies is one of the primary motivations behind the GEOMETRY and/or GEOGRAPHY in Parquet. We'd like to make it easy for writers to provide them!
What changes are included in this PR?
This PR introduces
IntervalandWraparoundIntervalstructs that handle interval math, and aGeometryBounderthat iterates over input using the fantasticwkbcrate (viageo-traits).Are these changes tested?
Yes!
Are there any user-facing changes?
All public structures and functions are documented (although I am not sure what the final public API will be).