Skip to content

Add Parquet geospatial statistics utility#8414

Merged
alamb merged 18 commits intoapache:mainfrom
paleolimbot:parquet-geospatial-bounding
Sep 28, 2025
Merged

Add Parquet geospatial statistics utility#8414
alamb merged 18 commits intoapache:mainfrom
paleolimbot:parquet-geospatial-bounding

Conversation

@paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Sep 22, 2025

Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.

Rationale for this change

The presence of relevant statistics when writing geometries and/or geographies is one of the primary motivations behind the GEOMETRY and/or GEOGRAPHY in Parquet. We'd like to make it easy for writers to provide them!

What changes are included in this PR?

This PR introduces Interval and WraparoundInterval structs that handle interval math, and a GeometryBounder that iterates over input using the fantastic wkb crate (via geo-traits).

Are these changes tested?

Yes!

Are there any user-facing changes?

All public structures and functions are documented (although I am not sure what the final public API will be).

Copy link
Member

@kylebarron kylebarron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to have more widespread docstrings, but otherwise this looks really good so far

Comment on lines +34 to +39
arrow-schema = { workspace = true }
geo-traits = { version = "0.3" }
wkb = { version = "0.9" }

[dev-dependencies]
wkt = { version = "0.14" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do other crates have dependencies defined here? Or should we lift all dependencies up to the top-level Cargo.toml, even if they're only used by a single crate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other crates seem to 🤷

Happy to put these anywhere!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to put here, and if we end up with multiple versions in the workspace we can consolidate

@paleolimbot paleolimbot changed the title [WIP] Draft Parquet geospatial statistics creation Add Parquet geospatial statistics utility Sep 24, 2025
@paleolimbot paleolimbot marked this pull request as ready for review September 24, 2025 04:11
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @paleolimbot !

I had some small structural comments, but I don't really have enough experience with geometry to be able to detect logic errors or major mismatches in representation. I would (really) love to defer to @kylebarron for that level review

But from my perspective it would be fine to merge this PR and iterate as follow on PRs as well.

Comment on lines +34 to +39
arrow-schema = { workspace = true }
geo-traits = { version = "0.3" }
wkb = { version = "0.9" }

[dev-dependencies]
wkt = { version = "0.14" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to put here, and if we end up with multiple versions in the workspace we can consolidate


use crate::interval::{Interval, IntervalTrait, WraparoundInterval};

/// Geometry bounder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment implies to me that this method will be needed in the parquet writer/encoder itself, to accumulate the appropriate statistics.

I think that is fine, but it will take a bit of plumbing to sort out, as we won't have anything similar for Variant (Variant doesn't define any special statistics, so the writer will just treat it as a normal struct / binary array),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we will! (Perhaps we can wire that in at runtime as well)

/// Parquet statistics with minimal modification.
#[derive(Debug)]
pub struct GeometryBounder {
x_left: Interval,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For someone familiar with geometry the meanings of x_left, x_mid and x_right are probably clear, but it might help someone who is not, like myself, if we could add a few more comments here

Ok(())
}

fn geometry_type(geom: &impl GeometryTrait<T = f64>) -> Result<i32, ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some reference about where these values came from? It seems maybe they are similar to the int32 values in #8225

It also seems strange to me that we are using i32 when we started with strongly typed enums (why not use enums again?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe this should be a method on GeometryTrait 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the Wkb object could provide this. We could also compute it from the input bytes, but then we'd have to validate it which takes about as much code as this.

I don't really mind using enums but I don't think it helps us (#8225 can't depend on this because I think the idea is that this crate would be optional).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have an enum here. I get that it could make inter-crate workings more complex, so we can revisit it in the future

@paleolimbot paleolimbot force-pushed the parquet-geospatial-bounding branch from a2a4473 to a03090b Compare September 26, 2025 14:38
@github-actions github-actions bot added the arrow Changes to the arrow crate label Sep 26, 2025
/// (which adds some complexity to this implementation).
#[derive(Debug)]
pub struct GeometryBounder {
/// Union of all contiguous x intervals to the left of the wraparound midpoint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@alamb
Copy link
Contributor

alamb commented Sep 26, 2025

@kylebarron any further comments or concerns about merging this PR?

Copy link
Member

@kylebarron kylebarron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the docstrings! I haven't implemented this part of the spec before, so I also learned new things.

out
}

/// Update this bounder with one WKB-encoded geometry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for the future: will it ever make sense to be able to update the bounder before values have been encoded to WKB? It makes sense that it's simplest at this step of the process to intercept the WKB, though it's a small amount of overhead to have to parse the unaligned floats here, right? Maybe that amount of overhead isn't worth optimizing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure! We can expose the geo-traits version as public if it comes up (or allow a caller to calculate the full GeoStatistics themselves).

Ok(())
}

fn geometry_type(geom: &impl GeometryTrait<T = f64>) -> Result<i32, ArrowError> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have an enum here. I get that it could make inter-crate workings more complex, so we can revisit it in the future

paleolimbot and others added 2 commits September 27, 2025 21:16
Co-authored-by: Kyle Barron <kylebarron2@gmail.com>
@alamb
Copy link
Contributor

alamb commented Sep 28, 2025

Thanks @paleolimbot and @kylebarron 🚀

@alamb alamb merged commit b6bea12 into apache:main Sep 28, 2025
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add geospatial statistics creation support for GEOMETRY/GEOGRAPHY Parquet logical types

3 participants