Implemented Zarr / Xarray Catalog Provider for Multiple Tables#141
Implemented Zarr / Xarray Catalog Provider for Multiple Tables#141
Conversation
alxmrs
left a comment
There was a problem hiding this comment.
A high level design note: I am reading more about how catalog providers work, and I am starting to think that we may not need to subclass anything from the catalog providers at all. Check out this page of documentation. It says that constructors are not typically called directly. Furthermore, the more I read about what CatalogProviders are for, the more I am beginning to think that they may not be the appropriate choice for what we really aim to do, which is register multiple tables for a single Xarray Dataset when needed.
This issue is mainly on me, because I did not fully understand what Catalogs were for when I originally wrote up the issue. For this, I apologize. For now, what I ideally would like to see in the next revision is a simplification of the theory of this feature. For now, what if we simply altered the behavior of the XarrayContext.from_dataset() method so that it registered multiple tables directly? In this approach, we could reuse the core logic that you two have implemented here (grouping by dimensions, naming tables, registering tables, etc.) for making multiple tables from a single Xarray Dataset. In the new version of the method, we'd have to think of names for each of the new tables that we'd register (e.g. the dataset could be called era5 and the subtable underneath it could be lat_lon_time for a full path of era5.lat_lon_time. Maybe we could give users the ability to override this to call these dimensions era5.surface instead?).
Thanks for your hard work here, Evan and Yagna.
|
|
||
| import pyarrow as pa | ||
| import xarray as xr | ||
| import datafusion as dfn |
There was a problem hiding this comment.
A more typical import pattern is to import from datafusion:
from datafusion.catalog import SchemaProvider
We don't have a convention of import as dfn.
| """ | ||
| Group variables in the dataset based on shared dims |
There was a problem hiding this comment.
The style I prefer is that this all exists on one line and ends in a full stop.
| """ | |
| Group variables in the dataset based on shared dims | |
| """Group variables in the dataset based on shared dims. |
|
|
||
| return LazyArrowStreamTable(partition_pairs(), schema) | ||
|
|
||
| def group_vars_by_dims(ds): |
There was a problem hiding this comment.
Because this isn't a public API, I think this should be made private (i.e., prefix with a _).
There was a problem hiding this comment.
Let's also add type annotations to the inputs and outputs of this function.
| ("time", "lat", "lon"): ["temperature_2m", "wind_speed"], | ||
| ("time", "lat", "lon", "level"): ["pressure", "humidity"] |
There was a problem hiding this comment.
This is pretty close to a doctest. Can we format the comment like so? We don't have to actually run docstests: https://docs.python.org/3/library/doctest.html
| ("time", "lat", "lon"): ["temperature_2m", "wind_speed"], | ||
| ("time", "lat", "lon", "level"): ["pressure", "humidity"] | ||
| """ | ||
| groups = {} |
There was a problem hiding this comment.
🐑 (the sheep means a sheepish suggestion, i.e. optional): Consider using a defaultdict(list).
|
|
||
|
|
||
| def register_catalog_from_dataset( | ||
| ctx, ds, catalog_name="xarray", schema_name="data", chunks=None |
There was a problem hiding this comment.
Instead of xarray and data, I think we should have no defaults and insist that users specify the names. Maybe, the schema name could be generated by us.
| Main function. Takes an xarray dataset and registers it | ||
| with DataFusion so you can query it with SQL. |
There was a problem hiding this comment.
Please explain how we are creating multiple tables from a single Xarray dataset. This docstring will soon become part of our public API. Further, please follow this style guide for writing docstrings: https://google.github.io/styleguide/pyguide.html#383-functions-and-methods
There was a problem hiding this comment.
In fact, I believe this function should live as a new method in the XarrayContext in sql.py.
| def register_catalog_from_dataset( | ||
| self, ds, catalog_name="xarray", schema_name="data", chunks=None | ||
| ): | ||
| register_catalog_from_dataset(self, ds, catalog_name, schema_name, chunks) |
There was a problem hiding this comment.
Ah, I think the body of this help fn should just live in this method.
| """ | ||
| A regular DataFusion SessionContext but with an extra method | ||
| for registering xarray datasets. | ||
| """ |
There was a problem hiding this comment.
Please revert my docstring for now.
| "py-spy>=0.4.0", | ||
| "pyink>=24.10.1", | ||
| "maturin>=1.9.1", | ||
| "pre-commit>=4.5.1", |
fixes: #85