Implemented Zarr / Xarray Catalog Provider for Multiple Tables by yc-606 · Pull Request #141 · alxmrs/xarray-sql

yc-606 · 2026-03-05T03:33:11Z

fixes: #85

alxmrs

A high level design note: I am reading more about how catalog providers work, and I am starting to think that we may not need to subclass anything from the catalog providers at all. Check out this page of documentation. It says that constructors are not typically called directly. Furthermore, the more I read about what CatalogProviders are for, the more I am beginning to think that they may not be the appropriate choice for what we really aim to do, which is register multiple tables for a single Xarray Dataset when needed.

This issue is mainly on me, because I did not fully understand what Catalogs were for when I originally wrote up the issue. For this, I apologize. For now, what I ideally would like to see in the next revision is a simplification of the theory of this feature. For now, what if we simply altered the behavior of the XarrayContext.from_dataset() method so that it registered multiple tables directly? In this approach, we could reuse the core logic that you two have implemented here (grouping by dimensions, naming tables, registering tables, etc.) for making multiple tables from a single Xarray Dataset. In the new version of the method, we'd have to think of names for each of the new tables that we'd register (e.g. the dataset could be called era5 and the subtable underneath it could be lat_lon_time for a full path of era5.lat_lon_time. Maybe we could give users the ability to override this to call these dimensions era5.surface instead?).

Thanks for your hard work here, Evan and Yagna.

alxmrs · 2026-03-07T19:06:26Z

xarray_sql/reader.py


 import pyarrow as pa
 import xarray as xr
+import datafusion as dfn


A more typical import pattern is to import from datafusion:

from datafusion.catalog import SchemaProvider

We don't have a convention of import as dfn.

alxmrs · 2026-03-07T19:07:04Z

xarray_sql/reader.py

+    """
+    Group variables in the dataset based on shared dims


The style I prefer is that this all exists on one line and ends in a full stop.

Suggested change

"""

Group variables in the dataset based on shared dims

"""Group variables in the dataset based on shared dims.

alxmrs · 2026-03-07T19:07:29Z

xarray_sql/reader.py


  return LazyArrowStreamTable(partition_pairs(), schema)
+
+  def group_vars_by_dims(ds):


Because this isn't a public API, I think this should be made private (i.e., prefix with a _).

Let's also add type annotations to the inputs and outputs of this function.

alxmrs · 2026-03-07T19:09:29Z

xarray_sql/reader.py

+    ("time", "lat", "lon"):          ["temperature_2m", "wind_speed"],
+    ("time", "lat", "lon", "level"): ["pressure", "humidity"]


This is pretty close to a doctest. Can we format the comment like so? We don't have to actually run docstests: https://docs.python.org/3/library/doctest.html

alxmrs · 2026-03-07T19:10:19Z

xarray_sql/reader.py

+    ("time", "lat", "lon"):          ["temperature_2m", "wind_speed"],
+    ("time", "lat", "lon", "level"): ["pressure", "humidity"]
+    """
+    groups = {}


🐑 (the sheep means a sheepish suggestion, i.e. optional): Consider using a defaultdict(list).

alxmrs · 2026-03-07T19:24:28Z

xarray_sql/reader.py

+
+
+def register_catalog_from_dataset(
+    ctx, ds, catalog_name="xarray", schema_name="data", chunks=None


Instead of xarray and data, I think we should have no defaults and insist that users specify the names. Maybe, the schema name could be generated by us.

alxmrs · 2026-03-07T19:26:04Z

xarray_sql/reader.py

+  Main function. Takes an xarray dataset and registers it
+  with DataFusion so you can query it with SQL.


Please explain how we are creating multiple tables from a single Xarray dataset. This docstring will soon become part of our public API. Further, please follow this style guide for writing docstrings: https://google.github.io/styleguide/pyguide.html#383-functions-and-methods

In fact, I believe this function should live as a new method in the XarrayContext in sql.py.

alxmrs · 2026-03-07T19:53:35Z

xarray_sql/sql.py

+  def register_catalog_from_dataset(
+      self, ds, catalog_name="xarray", schema_name="data", chunks=None
+  ):
+    register_catalog_from_dataset(self, ds, catalog_name, schema_name, chunks)


Ah, I think the body of this help fn should just live in this method.

alxmrs · 2026-03-07T19:54:02Z

xarray_sql/sql.py

+  """
+  A regular DataFusion SessionContext but with an extra method
+  for registering xarray datasets.
+  """


Please revert my docstring for now.

alxmrs · 2026-03-07T19:54:12Z

pyproject.toml

    "py-spy>=0.4.0",
    "pyink>=24.10.1",
    "maturin>=1.9.1",
+    "pre-commit>=4.5.1",


This is a good idea, thanks.

yc-606 added 3 commits March 2, 2026 09:26

add catalog provider for review

a6bfde3

Remove mistakenly added catalog files

5fed884

move catalog functionality into existing modules

bf03125

yc-606 requested a review from alxmrs March 6, 2026 02:18

alxmrs requested changes Mar 7, 2026

View reviewed changes

	"""
	Group variables in the dataset based on shared dims
	"""Group variables in the dataset based on shared dims.


		return LazyArrowStreamTable(partition_pairs(), schema)

		def group_vars_by_dims(ds):

		("time", "lat", "lon"): ["temperature_2m", "wind_speed"],
		("time", "lat", "lon", "level"): ["pressure", "humidity"]



		def register_catalog_from_dataset(
		ctx, ds, catalog_name="xarray", schema_name="data", chunks=None

		Main function. Takes an xarray dataset and registers it
		with DataFusion so you can query it with SQL.

Conversation

yc-606 commented Mar 5, 2026

Uh oh!

alxmrs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants