Skip to content

Implemented Zarr / Xarray Catalog Provider for Multiple Tables#141

Open
yc-606 wants to merge 3 commits intomainfrom
catalog-provider
Open

Implemented Zarr / Xarray Catalog Provider for Multiple Tables#141
yc-606 wants to merge 3 commits intomainfrom
catalog-provider

Conversation

@yc-606
Copy link
Collaborator

@yc-606 yc-606 commented Mar 5, 2026

fixes: #85

@yc-606 yc-606 requested a review from alxmrs March 6, 2026 02:18
Copy link
Owner

@alxmrs alxmrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A high level design note: I am reading more about how catalog providers work, and I am starting to think that we may not need to subclass anything from the catalog providers at all. Check out this page of documentation. It says that constructors are not typically called directly. Furthermore, the more I read about what CatalogProviders are for, the more I am beginning to think that they may not be the appropriate choice for what we really aim to do, which is register multiple tables for a single Xarray Dataset when needed.

This issue is mainly on me, because I did not fully understand what Catalogs were for when I originally wrote up the issue. For this, I apologize. For now, what I ideally would like to see in the next revision is a simplification of the theory of this feature. For now, what if we simply altered the behavior of the XarrayContext.from_dataset() method so that it registered multiple tables directly? In this approach, we could reuse the core logic that you two have implemented here (grouping by dimensions, naming tables, registering tables, etc.) for making multiple tables from a single Xarray Dataset. In the new version of the method, we'd have to think of names for each of the new tables that we'd register (e.g. the dataset could be called era5 and the subtable underneath it could be lat_lon_time for a full path of era5.lat_lon_time. Maybe we could give users the ability to override this to call these dimensions era5.surface instead?).

Thanks for your hard work here, Evan and Yagna.


import pyarrow as pa
import xarray as xr
import datafusion as dfn
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more typical import pattern is to import from datafusion:

from datafusion.catalog import SchemaProvider

We don't have a convention of import as dfn.

Comment on lines +278 to +279
"""
Group variables in the dataset based on shared dims
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The style I prefer is that this all exists on one line and ends in a full stop.

Suggested change
"""
Group variables in the dataset based on shared dims
"""Group variables in the dataset based on shared dims.


return LazyArrowStreamTable(partition_pairs(), schema)

def group_vars_by_dims(ds):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this isn't a public API, I think this should be made private (i.e., prefix with a _).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add type annotations to the inputs and outputs of this function.

Comment on lines +281 to +282
("time", "lat", "lon"): ["temperature_2m", "wind_speed"],
("time", "lat", "lon", "level"): ["pressure", "humidity"]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty close to a doctest. Can we format the comment like so? We don't have to actually run docstests: https://docs.python.org/3/library/doctest.html

("time", "lat", "lon"): ["temperature_2m", "wind_speed"],
("time", "lat", "lon", "level"): ["pressure", "humidity"]
"""
groups = {}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐑 (the sheep means a sheepish suggestion, i.e. optional): Consider using a defaultdict(list).



def register_catalog_from_dataset(
ctx, ds, catalog_name="xarray", schema_name="data", chunks=None
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of xarray and data, I think we should have no defaults and insist that users specify the names. Maybe, the schema name could be generated by us.

Comment on lines +374 to +375
Main function. Takes an xarray dataset and registers it
with DataFusion so you can query it with SQL.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain how we are creating multiple tables from a single Xarray dataset. This docstring will soon become part of our public API. Further, please follow this style guide for writing docstrings: https://google.github.io/styleguide/pyguide.html#383-functions-and-methods

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I believe this function should live as a new method in the XarrayContext in sql.py.

def register_catalog_from_dataset(
self, ds, catalog_name="xarray", schema_name="data", chunks=None
):
register_catalog_from_dataset(self, ds, catalog_name, schema_name, chunks)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think the body of this help fn should just live in this method.

Comment on lines +9 to +12
"""
A regular DataFusion SessionContext but with an extra method
for registering xarray datasets.
"""
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert my docstring for now.

"py-spy>=0.4.0",
"pyink>=24.10.1",
"maturin>=1.9.1",
"pre-commit>=4.5.1",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implementation idea: make a custom catalog provider

2 participants