Skip to content

Conversation

@abarciauskas-bgse
Copy link

This PR adds a script and documentation to create an icechunk virtual store using lithops.

cc @jbusecke

@espg
Copy link

espg commented Dec 9, 2025

@abarciauskas-bgse , can you update the gs://ismip6-icechunk bucket to allow anonymous access? Right now the following fails:

icechunk_storage = icechunk.gcs_storage(
    bucket="ismip6-icechunk",
    prefix="12-07-2025",
    anonymous=True
)

I tried testing on a jupyterhub instance, and haven't had a chance to setup with Google Cloud SDK (gcloud) installed and configured for providing with gcloud credentials provisioned using from_env=True yet. But I anticipate that when we set this pipeline up, we'll want the access pattern to allow for non-credentialed / anonymous access.

@abarciauskas-bgse
Copy link
Author

@espg yes I agree, this anonymous access should work. I'm looking into the org-level policies that may be preventing this and should hopefully have it fixed soon!

@abarciauskas-bgse
Copy link
Author

@espg could you try again now? A few of us have tested with anonymous=True and opening and reading from the store appears to be working via anonymous access.

@espg
Copy link

espg commented Dec 11, 2025

Great, seems like it's setup to work publicly now! I was playing around with it a little bit, and it seems like this cell:

base_array.hvplot.image(
    x='x', 
    y='y', 
    clim=(vmin, vmax), 
    cmap='RdBu',
    groupby='time',
    widget_location='bottom',
    aspect='equal',
    title=(" ").join(base_array.attrs['standard_name'].split("_"))
)

...is able to produce the map with the correct axis labels and colorbar, but the slider widget doesn't actually update anything when you select a different time slice. @abarciauskas-bgse do you know if this in a artifact of the zarr virtual store, or is this a bug from holoviews?

It also looks like there's some inconsistency in how datetimes are being parsed. My assumption is that some of the ISMIP contributors are encoding date/times correctly (i.e., PISM) and some (CPOM_BISICLES, which is the default shown in the notebook) are using something that isn't getting cast to correct datetime object. I think these sort of mismatches are the type of things @thomasteisberg had mentioned wanting to detect and flag. I'm guessing we can fix it in the virtualization layer declaratively... any thoughts on how to do it that isn't a game of whack-a-mole? Can we check for proper time encoding and patch fixes automatically for when we don't get a valid datetime stamp array?

@abarciauskas-bgse
Copy link
Author

is able to produce the map with the correct axis labels and colorbar, but the slider widget doesn't actually update anything when you select a different time slice

ah yes, apologies for not commenting on this one: I experienced the same issue (time slider not updating the plot) with the code and data source in https://github.com/englacial/ismip-indexing/blob/main/notebooks/working_with_ismip6_examples.ipynb. I wasn't able to track down the issue yet. But it's almost certainly not a result of the virtual store as I experienced the same issue using the in-memory data tree in the working_with_ismip6_examples.ipynb notebook, i.e.:

delta_thickness.hvplot.image(x='x', y='y', clim=(-vmag, vmag), cmap='RdBu').opts(
        aspect='equal',
        title="Change in ice thickness relative to the first timestep",
        colorbar_opts={'title': 'Change in thickness (m)'},
    )

Can you say more about what you are seeing with respect to time encodings? I haven't looked deeply into the individual datasets but am using the fix_time_encodings helper function and storing the time coordinates as native Zarr chunks. But I think that function only changes the time attributes so I wouldn't be surprised if there are changes we want to make to the time values themselves.

@espg
Copy link

espg commented Dec 11, 2025

It looks like the viewer not updating is some dependency issue that's specific to jupyterhub -- Thomas is able to get the slider to update when he runs it vscode , but both you and I apparently can't in jupyter even though all three of us are referencing the same uv file to execute the environment build.

For the time encoding issue, check the slider values we get here:

dt = xr.open_datatree(
    session.store,
    engine="zarr",
    consolidated=False,
    group='CPOM_BISICLES/expT71_08'
)

base_array = dt['lithk'].lithk
vmin, vmax = base_array.quantile([0.01, 0.99]).values

base_array.hvplot.image(
    x='x', 
    y='y', 
    clim=(vmin, vmax), 
    cmap='RdBu',
    groupby='time',
    widget_location='bottom',
    aspect='equal',
    title=(" ").join(base_array.attrs['standard_name'].split("_"))
)

...vs when we switch to pism here:

dt = xr.open_datatree(
    session.store,
    engine="zarr",
    consolidated=False,
    group='AWI_PISM1/exp08'
)

base_array = dt['lithk'].lithk
vmin, vmax = base_array.quantile([0.01, 0.99]).values

base_array.hvplot.image(
    x='x', 
    y='y', 
    clim=(vmin, vmax), 
    cmap='RdBu',
    groupby='time',
    widget_location='bottom',
    aspect='equal',
    title=(" ").join(base_array.attrs['standard_name'].split("_"))
)

The second case with PISM give us proper datetimes in the slider (and in the xarray), but CPOM_BISICLES has decided to encode as something non-standard like floats of doy (day of year).

@thomasteisberg
Copy link

This is very cool! I think this is the exactly the direction we want to be going.

fix_time_encodings is still a bit of a hack and it does not work for the model/experiment combination you happened to choose. I wasn't sure if this set of "fix" functions was going to be the right approach, but, if it is, I'm happy to invest more time in making sure that the fixes actually fix everything. Feel free to check out https://models.englacial.org/ for examples that do work.

For what it's worth, the sliders in both your code and my original notebook work for me. I use VSCode notebooks and run in the uv environment. Your notebook by default shows the base variable, which is bedrock elevation and doesn't evolve with time in any of the models. If you change it something else (try lithk -- land ice thickness for example), then it works on my end -- but not for Shane on a Juypter Hub instance. Not sure what's up with that, but, as you said, it's an unrelated issue.

My main question is about your note that loading the entire datatree in Xarray is not recommended. Naturally, I immediately tried to do exactly that -- and it's still running 12 minutes later. In theory, this doesn't seem like it should be slow. As you demonstrated, lazily loading the entire tree with Zarr is no problem. Why is this slow with Xarray and what would it take to fix?

Likely related: pydata/xarray#9511

In fact because it is possible to represent huge amounts of archival data with a single DataTree, someone will probably do something like attempt to represent the entire CMIP6 catalog as a DataTree and then complain after hitting a performance limit...

In our defense, the ISMIP6 catalog is a lot smaller than CMIP6 (<2 TB versus, I believe, >10 PB), but actually what Tom is describing is pretty much exactly what we want to do and doesn't inherently strike me as crazy. From our perspective, the reasons to want to enable loading the entire ISMIP6 catalog into one DataTree include:

  • Making it easy for users to understand what datasets are available without needing to interact with a separate library
  • Facilitating clean ways of writing transformations that can be lazily evaluated on different models/experiments

These things don't inherently require that we can load the whole catalog, but it certainly seems like the cleanest approach.

@espg
Copy link

espg commented Dec 11, 2025

Possibly relevant discussion about this here. There is a flag to avoid / skip setting indices on dataset objects (which I'm guessing is what causes the hang)-- however, while the method is present for the datatree object, I can't tell if that functionality actually does anything for us, since we still hang when setting create_default_indexes=False.

My impression is that some of the issue here is that we aren't doing lazy loading (of coordinates, or data, or both) in xarray, but are doing lazy loading in zarr. @abarciauskas-bgse is this correct?

@jbusecke
Copy link

Hey @espg I was just looking into this as well, and I think that you are def right. Xarray will load coordinates by default, zarr will not (they do not have any special meaning in zarr and are 'just another array' basically).
One compounding factor here is that each variable is stored with duplicated coordinates in its own group. Besides more general fixes and possible upstream improvements to opening deep group structures I am working on restructuring the references into another icechunk store which has all variables of a given / merged. From some preliminary experiments this could reduce the number of loaded variables by an order of ~20, which I am hoping will make this particular data tree open in a more reasonable time.

Ill work more on that tomorrow and will post results.

@abarciauskas-bgse
Copy link
Author

@espg @thomasteisberg I totally agree - it would be ideal if we could load the entire data tree using xarray.open_datatree.

And thanks for pointing out pydata/xarray#9511 and pydata/xarray#10579: it seems like an investigation into what is causing the slowness may be warranted before coming up with a more generalizable solution for deep datatrees.

I'm glad that @jbusecke is already investigating a solution with respect to reducing coordinates that could be helpful for this particular dataset, but we may consider also doing a deeper dive into how to improve lazy loading performance for deep datatrees, what do you think @thomasteisberg @espg @jbusecke @maxrjones? Tom N mentioned ASV...

Post-script: I was actually able to load the entire data tree. I waited around long enough to get an error about decode_times, and once I set that to False, was able to load the entire datatree in 1min 30seconds. This actually surprised me since this was on an AWS instance in us-west-1 and the data and icechunk store are in Google's us-west1. I thought maybe it would be faster in Google Cloud in us-west1 but when I ran the same code on a google colab instance in us-west1 it took longer. This is obviously not a robust test but it was interesting and gave me hope that some performance improvement might make for an acceptable user experience.

Screenshot 2025-12-11 at 5 00 55 PM Screenshot 2025-12-11 at 5 02 54 PM

@maxrjones
Copy link
Member

maxrjones commented Dec 12, 2025

Tom N mentioned ASV.

I've used asv for benchmarking (which will tell you if your code is slow), but not for profiling (which will tell you why your code is slow). There are lots of reasonably good tools out there for profiling - pyinstrument is a good starting point. I'd also start with some mental modeling to guide the profiling - how many requests and how much data transfer should be required to load your datatree? Based on that + your network card, what is the "speed of light" for how fast you could possibly open the dataset? are you approaching that speed?

You may also be able to get enough improvements from available configuration options to not worry about upstream improvements I would recommend trying opening the zarr store with chunked_array_type ="cubed" and setting zarr.config.set({'async.concurrency': 128}) to increase the concurrency used by Zarr and prevent competition between Zarr's and Dask's concurrency management. pydata/xarray#10327 also dramatically improves Zarr data loading - it's worth checking that a recent version of xarray is used and whether those improvements will be seen by open_datatree as well as open_dataset.

@maxrjones
Copy link
Member

This may be duplicative with prior comments and/or already checked, but it's also worth making sure that the native coordinates have large enough chunks (or shards, if used) that there aren't too many requests needed when loading coordinates.

@abarciauskas-bgse
Copy link
Author

abarciauskas-bgse commented Dec 12, 2025

@maxrjones thanks for the suggestions. I forgot to mention that I was already using a concurrency of 100. I added the chunked_array_type ="cubed" argument and ensured I am running the latest version of xarray. Time to load the data tree on AWS us-west-1 instance with 7.4GB memory and 4vCPUs was still about 90 seconds. 90seconds actually seems acceptable to me but curious what others think is an acceptable load time. And I still think diagnostics on how time is spent would be very useful.

@jbusecke
Copy link

90seconds actually seems acceptable to me but curious what others think is an acceptable load time. And I still think diagnostics on how time is spent would be very useful.

I got about the same over wifi on my laptop, so were probably not maxing out the IO 😁, but in general I think 90s is acceptable-ish but I am curious how much we can push this down with combined variables.

@espg
Copy link

espg commented Dec 12, 2025

Excited to see what @jbusecke gets with his solution. Even aside from performance enhancements, merging coordinate systems among the experiment variables -- and treating them as variables in a single data array rather than seperate data array objects -- seems like what we should be doing just to reduce cognitive load and make access cleaner / terser.

@abarciauskas-bgse I think that 90s is usable, but it's probably worth doing some profiling just to understand what and where the bottlenecks are for this. We can decide from there if it's worth the effort to address what's dominating the clock.

@maxrjones
Copy link
Member

The zarr store has 71609 nodes based on the script below, if we assume no metadata consolidation, 128 concurrency, and a 100 ms cross-region RTT, the "speed of light" should be 56 s for opening the full datatree, which makes 90s look alright but not great. However, from the icechunk docs, "consolidated metadata is unnecessary (and unsupported) in Icechunk. Icechunk already organizes the dataset metadata in a way that makes it very fast to fetch from storage." - we should look into this to understand how it impacts the speed of light.

# /// script
# requires-python = ">=3.11"
# dependencies = [
#     "icechunk",
#     "zarr",
#     "rich"
# ]
# ///


import icechunk as ic
import zarr

zarr.config.set({'async.concurrency': 128})
storage = ic.gcs_storage(bucket="ismip6-icechunk", anonymous=True, prefix="12-07-2025")
repo = ic.Repository.open(storage)
session = repo.readonly_session("main")
g = zarr.open_group(session.store, mode="r")
print(g.nmembers(max_depth=None))

Here's a script for creating a flamegraph of the datatree opening step. FWIW this took 40s for me locally (yay, fiber 🚀)

# /// script
# requires-python = ">=3.11"
# dependencies = [
#     "icechunk",
#     "zarr",
#     "xarray",
#     "pyinstrument"
# ]
# ///


import icechunk as ic
import xarray as xr
import zarr
import warnings
from pyinstrument import Profiler

warnings.filterwarnings(
  "ignore",
  message="Numcodecs codecs are not in the Zarr version 3 specification*",
  category=UserWarning
)

zarr.config.set({'async.concurrency': 128})

storage = ic.gcs_storage(bucket="ismip6-icechunk", anonymous=True, prefix="12-07-2025")
repo = ic.Repository.open(storage)
session = repo.readonly_session("main")

with Profiler() as profiler:
    dt = xr.open_datatree(session.store, engine="zarr", create_default_indexes=False, decode_times=False)

profiler.write_html("profile.html")

@TomNicholas
Copy link

Hello everyone 👋

On performance

Even without profiling, it seems highly likely that pydata/xarray#10579 is responsible for the slowdown here. An easy way to confirm this would be to create a test zarr store that has a large number of tiny (but still 1D) coordinate variables in different groups, and see how long it takes to open as a function of the number of nodes. I suspect you will see a graph like the one in this issue (that issue is for the opposite problem - saving a datatree with many nodes). Then it should be clear what xarray improvement is needed to speed this up (I bet it's adding concurrency).

On schema

Stepping back, and reading about your goals as described by @jbusecke here and @thomasteisberg here, I can't help but wonder if a more datacollections-like schema would make more sense here. My understanding is that you:

  1. want users to be able to run 1 line of code which gives them an overview of all the different metadata in the store, e.g. group names, their structure, and variable names.
  2. want users to be able to quickly get a DataTree of whatever subset of that overall structure they care about, so that they can use map_over_subtree and other xarray primitives to define analysis across groups.

I feel like you could get both of those things by instead writing a flat (or even still nested) group structure, serializing information about that group structure into the top-level group, and helping users open that as their first call. Then you also provide a method to facilitate querying that index of groups (with pandas or SQL or both), and creating a DataTree out of only the results.

In other words, you do something like what @sharkinsspatial and Kyle did in https://github.com/developmentseed/zarr-datafusion-search-examples, but for MIP-like structures. You could think of this as "the contents of an intake catalog csv, but stored in the root group of a zarr store".

# library code
class MIPCollection:
    df: pd.DataFrame

    def __init__(store_url):
        # opens the root group of the store and populates a pandas dataframe following a MIP-aware schema
        # (or creates the datafusion context Kyle made)
# user code
cmip6_collection = MIPCollection(store_url)

# prints a dataframe of all group names with relevant metadata about the contents of each group
cmip6_collection.display()

# returns an xr.DataTree comprising just the groups the user is interested in
# (this might be bad SQL - I don't know SQL 🤷‍♂️)
# (could also be a dataframe API, or both)
dt = cmip6_collection.sql_query(
    "SELECT * FROM icechunk_data WHERE model == "CESM",
).to_datatree()

(This of course assumes that asking users to open/query with a more specialized open-source library rather than just xarray is acceptable to you.)

This would work with normal zarr, but the advantage of doing it with Icechunk would be transactions to help guarantee that if you change the data in the groups you always keep the metadata in the root group consistent.

DataTree performance work would still be worthwhile in this model, but presumably users aren't often trying to do analysis on the entire archive, so opening only a subset will be much faster even with no upstream optimizations.

@abarciauskas-bgse abarciauskas-bgse merged commit d78224a into main Dec 22, 2025
martham93 pushed a commit that referenced this pull request Dec 22, 2025
Prototype gcp lithops for combined variables
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants