Skip to content

Spatially sort output features #2

@brawer

Description

@brawer

As per GeoParquet best practices it would be good to spatially sort the output features, for two reasons:

  • Typically, clients will restrict their parquet queries to a region of interest. Without spatial sorting, most parquet chunks contain data from locations all over the planet. The current lack of spatial correlation means that clients have to decompress almost all chunks in the parquet file, no matter their query bounding box — this is expensive. If the data was spatially sorted, queries would be (much) faster because they’d only have to decode those few parquet chunks that actually intersect the queried bounding box.
  • Also, spatial sorting will likely reduce the output file size. Because nearby features tend to share tags like street and city names, there will be a higher chance of sharing tags between features of a single parquet chunk.

Currently, class GeoParquetWriter seems to emit features in the same order as they happen to be passed from libosmium. Consider extending the implementation of GeoParquetWriter to calculate the center lat/lon of each feature’s bounding box. Then, find the position of that point along a space-filling Hilbert curve, and use this number as a sort key for an external sort. There's several python libraries for hilbert curves, and likewise for external sorting.

To check the difference, try gt sort hilbert from GeoParquet tools. Perhaps you could simply call this tool in a post-processing step, before uploading the layercake output. But it seems a little heavy to bundle DuckDB; doing this yourself from python seems easy enough.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions