-
-
Notifications
You must be signed in to change notification settings - Fork 4
Description
This is more of a discussion than a detailed issue/feature request (maybe we could enable discussions on the repo?), but partitioning of large datasets is a pretty common practice in the industry. Here's a few ideas to get the discussion going.
Most projects use something called hive partitioning. You can get an overview here: https://duckdb.org/docs/stable/data/partitioning/hive_partitioning.html. We don't need to necessarily just "do what everyone else does" but there are a few ways we could approach this, which might be beneficial.
First, the top level is usually date-based. We can do this by release dates, OSM changeset number, or something similar, but that's really only relevant for how we store the final dataset, not what the code in this project does.
At the next layer, things get more interesting. My suggestion (open for discussion) is that we partition by a relatively low resolution H3 or S2 cell. I don't really have a strong preference for one or the other. H3 is sorta hip and genuinely better for some applications, but S2 is probably more understandable as everything is a "square" and most people and tooling are used to thinking in bboxes.
Some secondary effects of this:
- It gets us halfway to Spatially sort output features #2; the files would probably naturally have better compression out of the box even when written in a suboptimal order. And as each file is smaller, it would be more efficient to run a post-processing pass.
- The GeoParquetWriter base class would get a bit more complex, unfortunately. It would now have to manage dozens of writers. It would also need to potentially buffer a lot more items at a time before flushing a row group(!).
Thoughts?