Virtualize with lithops #1

abarciauskas-bgse · 2025-12-08T06:15:39Z

This PR adds a script and documentation to create an icechunk virtual store using lithops.

espg · 2025-12-09T18:48:38Z

@abarciauskas-bgse , can you update the gs://ismip6-icechunk bucket to allow anonymous access? Right now the following fails:

icechunk_storage = icechunk.gcs_storage(
    bucket="ismip6-icechunk",
    prefix="12-07-2025",
    anonymous=True
)

I tried testing on a jupyterhub instance, and haven't had a chance to setup with Google Cloud SDK (gcloud) installed and configured for providing with gcloud credentials provisioned using from_env=True yet. But I anticipate that when we set this pipeline up, we'll want the access pattern to allow for non-credentialed / anonymous access.

abarciauskas-bgse · 2025-12-09T20:59:20Z

@espg yes I agree, this anonymous access should work. I'm looking into the org-level policies that may be preventing this and should hopefully have it fixed soon!

abarciauskas-bgse · 2025-12-10T01:16:10Z

@espg could you try again now? A few of us have tested with anonymous=True and opening and reading from the store appears to be working via anonymous access.

espg · 2025-12-11T18:36:12Z

Great, seems like it's setup to work publicly now! I was playing around with it a little bit, and it seems like this cell:

base_array.hvplot.image(
    x='x', 
    y='y', 
    clim=(vmin, vmax), 
    cmap='RdBu',
    groupby='time',
    widget_location='bottom',
    aspect='equal',
    title=(" ").join(base_array.attrs['standard_name'].split("_"))
)

...is able to produce the map with the correct axis labels and colorbar, but the slider widget doesn't actually update anything when you select a different time slice. @abarciauskas-bgse do you know if this in a artifact of the zarr virtual store, or is this a bug from holoviews?

It also looks like there's some inconsistency in how datetimes are being parsed. My assumption is that some of the ISMIP contributors are encoding date/times correctly (i.e., PISM) and some (CPOM_BISICLES, which is the default shown in the notebook) are using something that isn't getting cast to correct datetime object. I think these sort of mismatches are the type of things @thomasteisberg had mentioned wanting to detect and flag. I'm guessing we can fix it in the virtualization layer declaratively... any thoughts on how to do it that isn't a game of whack-a-mole? Can we check for proper time encoding and patch fixes automatically for when we don't get a valid datetime stamp array?

abarciauskas-bgse · 2025-12-11T20:25:07Z

is able to produce the map with the correct axis labels and colorbar, but the slider widget doesn't actually update anything when you select a different time slice

ah yes, apologies for not commenting on this one: I experienced the same issue (time slider not updating the plot) with the code and data source in https://github.com/englacial/ismip-indexing/blob/main/notebooks/working_with_ismip6_examples.ipynb. I wasn't able to track down the issue yet. But it's almost certainly not a result of the virtual store as I experienced the same issue using the in-memory data tree in the working_with_ismip6_examples.ipynb notebook, i.e.:

delta_thickness.hvplot.image(x='x', y='y', clim=(-vmag, vmag), cmap='RdBu').opts(
        aspect='equal',
        title="Change in ice thickness relative to the first timestep",
        colorbar_opts={'title': 'Change in thickness (m)'},
    )

Can you say more about what you are seeing with respect to time encodings? I haven't looked deeply into the individual datasets but am using the fix_time_encodings helper function and storing the time coordinates as native Zarr chunks. But I think that function only changes the time attributes so I wouldn't be surprised if there are changes we want to make to the time values themselves.

espg · 2025-12-11T22:14:21Z

It looks like the viewer not updating is some dependency issue that's specific to jupyterhub -- Thomas is able to get the slider to update when he runs it vscode , but both you and I apparently can't in jupyter even though all three of us are referencing the same uv file to execute the environment build.

For the time encoding issue, check the slider values we get here:

dt = xr.open_datatree(
    session.store,
    engine="zarr",
    consolidated=False,
    group='CPOM_BISICLES/expT71_08'
)

base_array = dt['lithk'].lithk
vmin, vmax = base_array.quantile([0.01, 0.99]).values

base_array.hvplot.image(
    x='x', 
    y='y', 
    clim=(vmin, vmax), 
    cmap='RdBu',
    groupby='time',
    widget_location='bottom',
    aspect='equal',
    title=(" ").join(base_array.attrs['standard_name'].split("_"))
)

...vs when we switch to pism here:

dt = xr.open_datatree(
    session.store,
    engine="zarr",
    consolidated=False,
    group='AWI_PISM1/exp08'
)

base_array = dt['lithk'].lithk
vmin, vmax = base_array.quantile([0.01, 0.99]).values

base_array.hvplot.image(
    x='x', 
    y='y', 
    clim=(vmin, vmax), 
    cmap='RdBu',
    groupby='time',
    widget_location='bottom',
    aspect='equal',
    title=(" ").join(base_array.attrs['standard_name'].split("_"))
)

The second case with PISM give us proper datetimes in the slider (and in the xarray), but CPOM_BISICLES has decided to encode as something non-standard like floats of doy (day of year).

thomasteisberg · 2025-12-11T22:15:33Z

This is very cool! I think this is the exactly the direction we want to be going.

fix_time_encodings is still a bit of a hack and it does not work for the model/experiment combination you happened to choose. I wasn't sure if this set of "fix" functions was going to be the right approach, but, if it is, I'm happy to invest more time in making sure that the fixes actually fix everything. Feel free to check out https://models.englacial.org/ for examples that do work.

For what it's worth, the sliders in both your code and my original notebook work for me. I use VSCode notebooks and run in the uv environment. Your notebook by default shows the base variable, which is bedrock elevation and doesn't evolve with time in any of the models. If you change it something else (try lithk -- land ice thickness for example), then it works on my end -- but not for Shane on a Juypter Hub instance. Not sure what's up with that, but, as you said, it's an unrelated issue.

My main question is about your note that loading the entire datatree in Xarray is not recommended. Naturally, I immediately tried to do exactly that -- and it's still running 12 minutes later. In theory, this doesn't seem like it should be slow. As you demonstrated, lazily loading the entire tree with Zarr is no problem. Why is this slow with Xarray and what would it take to fix?

Likely related: pydata/xarray#9511

In fact because it is possible to represent huge amounts of archival data with a single DataTree, someone will probably do something like attempt to represent the entire CMIP6 catalog as a DataTree and then complain after hitting a performance limit...

In our defense, the ISMIP6 catalog is a lot smaller than CMIP6 (<2 TB versus, I believe, >10 PB), but actually what Tom is describing is pretty much exactly what we want to do and doesn't inherently strike me as crazy. From our perspective, the reasons to want to enable loading the entire ISMIP6 catalog into one DataTree include:

Making it easy for users to understand what datasets are available without needing to interact with a separate library
Facilitating clean ways of writing transformations that can be lazily evaluated on different models/experiments

These things don't inherently require that we can load the whole catalog, but it certainly seems like the cleanest approach.

espg · 2025-12-11T23:03:52Z

Possibly relevant discussion about this here. There is a flag to avoid / skip setting indices on dataset objects (which I'm guessing is what causes the hang)-- however, while the method is present for the datatree object, I can't tell if that functionality actually does anything for us, since we still hang when setting create_default_indexes=False.

My impression is that some of the issue here is that we aren't doing lazy loading (of coordinates, or data, or both) in xarray, but are doing lazy loading in zarr. @abarciauskas-bgse is this correct?

jbusecke · 2025-12-12T00:40:49Z

Hey @espg I was just looking into this as well, and I think that you are def right. Xarray will load coordinates by default, zarr will not (they do not have any special meaning in zarr and are 'just another array' basically).
One compounding factor here is that each variable is stored with duplicated coordinates in its own group. Besides more general fixes and possible upstream improvements to opening deep group structures I am working on restructuring the references into another icechunk store which has all variables of a given / merged. From some preliminary experiments this could reduce the number of loaded variables by an order of ~20, which I am hoping will make this particular data tree open in a more reasonable time.

Ill work more on that tomorrow and will post results.

abarciauskas-bgse · 2025-12-12T01:06:37Z

@espg @thomasteisberg I totally agree - it would be ideal if we could load the entire data tree using xarray.open_datatree.

And thanks for pointing out pydata/xarray#9511 and pydata/xarray#10579: it seems like an investigation into what is causing the slowness may be warranted before coming up with a more generalizable solution for deep datatrees.

I'm glad that @jbusecke is already investigating a solution with respect to reducing coordinates that could be helpful for this particular dataset, but we may consider also doing a deeper dive into how to improve lazy loading performance for deep datatrees, what do you think @thomasteisberg @espg @jbusecke @maxrjones? Tom N mentioned ASV...

Post-script: I was actually able to load the entire data tree. I waited around long enough to get an error about decode_times, and once I set that to False, was able to load the entire datatree in 1min 30seconds. This actually surprised me since this was on an AWS instance in us-west-1 and the data and icechunk store are in Google's us-west1. I thought maybe it would be faster in Google Cloud in us-west1 but when I ran the same code on a google colab instance in us-west1 it took longer. This is obviously not a robust test but it was interesting and gave me hope that some performance improvement might make for an acceptable user experience.

maxrjones · 2025-12-12T18:08:03Z

Tom N mentioned ASV.

I've used asv for benchmarking (which will tell you if your code is slow), but not for profiling (which will tell you why your code is slow). There are lots of reasonably good tools out there for profiling - pyinstrument is a good starting point. I'd also start with some mental modeling to guide the profiling - how many requests and how much data transfer should be required to load your datatree? Based on that + your network card, what is the "speed of light" for how fast you could possibly open the dataset? are you approaching that speed?

You may also be able to get enough improvements from available configuration options to not worry about upstream improvements I would recommend trying opening the zarr store with chunked_array_type ="cubed" and setting zarr.config.set({'async.concurrency': 128}) to increase the concurrency used by Zarr and prevent competition between Zarr's and Dask's concurrency management. pydata/xarray#10327 also dramatically improves Zarr data loading - it's worth checking that a recent version of xarray is used and whether those improvements will be seen by open_datatree as well as open_dataset.

maxrjones · 2025-12-12T18:27:30Z

This may be duplicative with prior comments and/or already checked, but it's also worth making sure that the native coordinates have large enough chunks (or shards, if used) that there aren't too many requests needed when loading coordinates.

abarciauskas-bgse · 2025-12-12T18:51:47Z

@maxrjones thanks for the suggestions. I forgot to mention that I was already using a concurrency of 100. I added the chunked_array_type ="cubed" argument and ensured I am running the latest version of xarray. Time to load the data tree on AWS us-west-1 instance with 7.4GB memory and 4vCPUs was still about 90 seconds. 90seconds actually seems acceptable to me but curious what others think is an acceptable load time. And I still think diagnostics on how time is spent would be very useful.

jbusecke · 2025-12-12T19:10:25Z

90seconds actually seems acceptable to me but curious what others think is an acceptable load time. And I still think diagnostics on how time is spent would be very useful.

I got about the same over wifi on my laptop, so were probably not maxing out the IO 😁, but in general I think 90s is acceptable-ish but I am curious how much we can push this down with combined variables.

espg · 2025-12-12T20:12:31Z

Excited to see what @jbusecke gets with his solution. Even aside from performance enhancements, merging coordinate systems among the experiment variables -- and treating them as variables in a single data array rather than seperate data array objects -- seems like what we should be doing just to reduce cognitive load and make access cleaner / terser.

@abarciauskas-bgse I think that 90s is usable, but it's probably worth doing some profiling just to understand what and where the bottlenecks are for this. We can decide from there if it's worth the effort to address what's dominating the clock.

maxrjones · 2025-12-12T20:56:43Z

The zarr store has 71609 nodes based on the script below, if we assume no metadata consolidation, 128 concurrency, and a 100 ms cross-region RTT, the "speed of light" should be 56 s for opening the full datatree, which makes 90s look alright but not great. However, from the icechunk docs, "consolidated metadata is unnecessary (and unsupported) in Icechunk. Icechunk already organizes the dataset metadata in a way that makes it very fast to fetch from storage." - we should look into this to understand how it impacts the speed of light.

# /// script
# requires-python = ">=3.11"
# dependencies = [
#     "icechunk",
#     "zarr",
#     "rich"
# ]
# ///


import icechunk as ic
import zarr

zarr.config.set({'async.concurrency': 128})
storage = ic.gcs_storage(bucket="ismip6-icechunk", anonymous=True, prefix="12-07-2025")
repo = ic.Repository.open(storage)
session = repo.readonly_session("main")
g = zarr.open_group(session.store, mode="r")
print(g.nmembers(max_depth=None))

Here's a script for creating a flamegraph of the datatree opening step. FWIW this took 40s for me locally (yay, fiber 🚀)

# /// script
# requires-python = ">=3.11"
# dependencies = [
#     "icechunk",
#     "zarr",
#     "xarray",
#     "pyinstrument"
# ]
# ///


import icechunk as ic
import xarray as xr
import zarr
import warnings
from pyinstrument import Profiler

warnings.filterwarnings(
  "ignore",
  message="Numcodecs codecs are not in the Zarr version 3 specification*",
  category=UserWarning
)

zarr.config.set({'async.concurrency': 128})

storage = ic.gcs_storage(bucket="ismip6-icechunk", anonymous=True, prefix="12-07-2025")
repo = ic.Repository.open(storage)
session = repo.readonly_session("main")

with Profiler() as profiler:
    dt = xr.open_datatree(session.store, engine="zarr", create_default_indexes=False, decode_times=False)

profiler.write_html("profile.html")

TomNicholas · 2025-12-13T09:19:09Z

Hello everyone 👋

On performance

Even without profiling, it seems highly likely that pydata/xarray#10579 is responsible for the slowdown here. An easy way to confirm this would be to create a test zarr store that has a large number of tiny (but still 1D) coordinate variables in different groups, and see how long it takes to open as a function of the number of nodes. I suspect you will see a graph like the one in this issue (that issue is for the opposite problem - saving a datatree with many nodes). Then it should be clear what xarray improvement is needed to speed this up (I bet it's adding concurrency).

On schema

Stepping back, and reading about your goals as described by @jbusecke here and @thomasteisberg here, I can't help but wonder if a more datacollections-like schema would make more sense here. My understanding is that you:

want users to be able to run 1 line of code which gives them an overview of all the different metadata in the store, e.g. group names, their structure, and variable names.
want users to be able to quickly get a DataTree of whatever subset of that overall structure they care about, so that they can use map_over_subtree and other xarray primitives to define analysis across groups.

I feel like you could get both of those things by instead writing a flat (or even still nested) group structure, serializing information about that group structure into the top-level group, and helping users open that as their first call. Then you also provide a method to facilitate querying that index of groups (with pandas or SQL or both), and creating a DataTree out of only the results.

In other words, you do something like what @sharkinsspatial and Kyle did in https://github.com/developmentseed/zarr-datafusion-search-examples, but for MIP-like structures. You could think of this as "the contents of an intake catalog csv, but stored in the root group of a zarr store".

# library code
class MIPCollection:
    df: pd.DataFrame

    def __init__(store_url):
        # opens the root group of the store and populates a pandas dataframe following a MIP-aware schema
        # (or creates the datafusion context Kyle made)

# user code
cmip6_collection = MIPCollection(store_url)

# prints a dataframe of all group names with relevant metadata about the contents of each group
cmip6_collection.display()

# returns an xr.DataTree comprising just the groups the user is interested in
# (this might be bad SQL - I don't know SQL 🤷‍♂️)
# (could also be a dataframe API, or both)
dt = cmip6_collection.sql_query(
    "SELECT * FROM icechunk_data WHERE model == "CESM",
).to_datatree()

(This of course assumes that asking users to open/query with a more specialized open-source library rather than just xarray is acceptable to you.)

This would work with normal zarr, but the advantage of doing it with Icechunk would be transactions to help guarantee that if you change the data in the groups you always keep the metadata in the root group consistent.

DataTree performance work would still be worthwhile in this model, but presumably users aren't often trying to do analysis on the entire archive, so opening only a subset will be much faster even with no upstream optimizations.

Prototype gcp lithops for combined variables

abarciauskas-bgse added 9 commits December 4, 2025 15:45

Using coiled for virtualization

f0310fe

Add try/catch

994cd46

Working on coiled setup, using a large instance

4b34349

Working lithops setup

b9e6423

Refact to write each row separately using virtual_dataset_to_icechunk

33f1cc2

Add documentation and update notebook

2cfe79f

Only add groups that aren't already in the store

e3e1e35

Write failures to bucket

43e88b9

Revert change to python version

2b6f1b7

Udpate notebook with anonymous=True

d78224a

jbusecke mentioned this pull request Dec 12, 2025

Reading virtual references back out into VirtualiZarr Manifests earth-mover/icechunk#104

Open

jbusecke mentioned this pull request Dec 13, 2025

Write out zarr with combined variables #2

Merged

2 tasks

abarciauskas-bgse merged commit d78224a into main Dec 22, 2025

martham93 pushed a commit that referenced this pull request Dec 22, 2025

Merge pull request #1 from EarthStackLLC/combine_variables_gcloudfunc

6e8f9ad

Prototype gcp lithops for combined variables

Virtualize with lithops #1

Virtualize with lithops #1

Uh oh!

Conversation

abarciauskas-bgse commented Dec 8, 2025

Uh oh!

espg commented Dec 9, 2025

Uh oh!

abarciauskas-bgse commented Dec 9, 2025

Uh oh!

abarciauskas-bgse commented Dec 10, 2025

Uh oh!

espg commented Dec 11, 2025

Uh oh!

abarciauskas-bgse commented Dec 11, 2025

Uh oh!

espg commented Dec 11, 2025

Uh oh!

thomasteisberg commented Dec 11, 2025

Uh oh!

espg commented Dec 11, 2025

Uh oh!

jbusecke commented Dec 12, 2025

Uh oh!

abarciauskas-bgse commented Dec 12, 2025

Uh oh!

maxrjones commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxrjones commented Dec 12, 2025

Uh oh!

abarciauskas-bgse commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbusecke commented Dec 12, 2025

Uh oh!

espg commented Dec 12, 2025

Uh oh!

maxrjones commented Dec 12, 2025

Uh oh!

TomNicholas commented Dec 13, 2025

On performance

On schema

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

maxrjones commented Dec 12, 2025 •

edited

Loading

abarciauskas-bgse commented Dec 12, 2025 •

edited

Loading