Partitioning

This is more of a discussion than a detailed issue/feature request (maybe we could enable discussions on the repo?), but partitioning of large datasets is a pretty common practice in the industry. Here's a few ideas to get the discussion going.

Most projects use something called hive partitioning. You can get an overview here: https://duckdb.org/docs/stable/data/partitioning/hive_partitioning.html. We don't need to necessarily just "do what everyone else does" but there are a few ways we could approach this, which might be beneficial.

First, the top level is _usually_ date-based. We can do this by release dates, OSM changeset number, or something similar, but that's really only relevant for how we store the final dataset, not what the code in this project does.

At the next layer, things get more interesting. My _suggestion_ (open for discussion) is that we partition by a relatively low resolution H3 or S2 cell. I don't really have a strong preference for one or the other. H3 is sorta hip and genuinely better for some applications, but S2 is probably more understandable as everything is a "square" and most people and tooling are used to thinking in bboxes.

Some secondary effects of this:

* It gets us halfway to #2; the files would probably naturally have better compression out of the box even when written in a suboptimal order. And as each file is smaller, it would be more efficient to run a post-processing pass.
* The GeoParquetWriter base class would get a bit more complex, unfortunately. It would now have to manage dozens of writers. It would also need to potentially buffer a lot more items at a time before flushing a row group(!).

Thoughts?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Partitioning #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Partitioning #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions