-
Notifications
You must be signed in to change notification settings - Fork 1
Virtualize with lithops #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@abarciauskas-bgse , can you update the gs://ismip6-icechunk bucket to allow anonymous access? Right now the following fails: icechunk_storage = icechunk.gcs_storage(
bucket="ismip6-icechunk",
prefix="12-07-2025",
anonymous=True
)I tried testing on a jupyterhub instance, and haven't had a chance to setup with Google Cloud SDK (gcloud) installed and configured for providing with gcloud credentials provisioned using |
|
@espg yes I agree, this anonymous access should work. I'm looking into the org-level policies that may be preventing this and should hopefully have it fixed soon! |
|
@espg could you try again now? A few of us have tested with |
|
Great, seems like it's setup to work publicly now! I was playing around with it a little bit, and it seems like this cell: base_array.hvplot.image(
x='x',
y='y',
clim=(vmin, vmax),
cmap='RdBu',
groupby='time',
widget_location='bottom',
aspect='equal',
title=(" ").join(base_array.attrs['standard_name'].split("_"))
)...is able to produce the map with the correct axis labels and colorbar, but the slider widget doesn't actually update anything when you select a different time slice. @abarciauskas-bgse do you know if this in a artifact of the zarr virtual store, or is this a bug from holoviews? It also looks like there's some inconsistency in how datetimes are being parsed. My assumption is that some of the ISMIP contributors are encoding date/times correctly (i.e., PISM) and some (CPOM_BISICLES, which is the default shown in the notebook) are using something that isn't getting cast to correct datetime object. I think these sort of mismatches are the type of things @thomasteisberg had mentioned wanting to detect and flag. I'm guessing we can fix it in the virtualization layer declaratively... any thoughts on how to do it that isn't a game of whack-a-mole? Can we check for proper time encoding and patch fixes automatically for when we don't get a valid datetime stamp array? |
ah yes, apologies for not commenting on this one: I experienced the same issue (time slider not updating the plot) with the code and data source in https://github.com/englacial/ismip-indexing/blob/main/notebooks/working_with_ismip6_examples.ipynb. I wasn't able to track down the issue yet. But it's almost certainly not a result of the virtual store as I experienced the same issue using the in-memory data tree in the working_with_ismip6_examples.ipynb notebook, i.e.: delta_thickness.hvplot.image(x='x', y='y', clim=(-vmag, vmag), cmap='RdBu').opts(
aspect='equal',
title="Change in ice thickness relative to the first timestep",
colorbar_opts={'title': 'Change in thickness (m)'},
)Can you say more about what you are seeing with respect to time encodings? I haven't looked deeply into the individual datasets but am using the |
|
It looks like the viewer not updating is some dependency issue that's specific to jupyterhub -- Thomas is able to get the slider to update when he runs it vscode , but both you and I apparently can't in jupyter even though all three of us are referencing the same uv file to execute the environment build. For the time encoding issue, check the slider values we get here: dt = xr.open_datatree(
session.store,
engine="zarr",
consolidated=False,
group='CPOM_BISICLES/expT71_08'
)
base_array = dt['lithk'].lithk
vmin, vmax = base_array.quantile([0.01, 0.99]).values
base_array.hvplot.image(
x='x',
y='y',
clim=(vmin, vmax),
cmap='RdBu',
groupby='time',
widget_location='bottom',
aspect='equal',
title=(" ").join(base_array.attrs['standard_name'].split("_"))
)...vs when we switch to pism here: dt = xr.open_datatree(
session.store,
engine="zarr",
consolidated=False,
group='AWI_PISM1/exp08'
)
base_array = dt['lithk'].lithk
vmin, vmax = base_array.quantile([0.01, 0.99]).values
base_array.hvplot.image(
x='x',
y='y',
clim=(vmin, vmax),
cmap='RdBu',
groupby='time',
widget_location='bottom',
aspect='equal',
title=(" ").join(base_array.attrs['standard_name'].split("_"))
)The second case with PISM give us proper datetimes in the slider (and in the xarray), but CPOM_BISICLES has decided to encode as something non-standard like floats of doy (day of year). |
|
This is very cool! I think this is the exactly the direction we want to be going.
For what it's worth, the sliders in both your code and my original notebook work for me. I use VSCode notebooks and run in the uv environment. Your notebook by default shows the My main question is about your note that loading the entire datatree in Xarray is not recommended. Naturally, I immediately tried to do exactly that -- and it's still running 12 minutes later. In theory, this doesn't seem like it should be slow. As you demonstrated, lazily loading the entire tree with Zarr is no problem. Why is this slow with Xarray and what would it take to fix? Likely related: pydata/xarray#9511
In our defense, the ISMIP6 catalog is a lot smaller than CMIP6 (<2 TB versus, I believe, >10 PB), but actually what Tom is describing is pretty much exactly what we want to do and doesn't inherently strike me as crazy. From our perspective, the reasons to want to enable loading the entire ISMIP6 catalog into one DataTree include:
These things don't inherently require that we can load the whole catalog, but it certainly seems like the cleanest approach. |
|
Possibly relevant discussion about this here. There is a flag to avoid / skip setting indices on My impression is that some of the issue here is that we aren't doing lazy loading (of coordinates, or data, or both) in xarray, but are doing lazy loading in zarr. @abarciauskas-bgse is this correct? |
|
Hey @espg I was just looking into this as well, and I think that you are def right. Xarray will load coordinates by default, zarr will not (they do not have any special meaning in zarr and are 'just another array' basically). Ill work more on that tomorrow and will post results. |
|
@espg @thomasteisberg I totally agree - it would be ideal if we could load the entire data tree using xarray.open_datatree. And thanks for pointing out pydata/xarray#9511 and pydata/xarray#10579: it seems like an investigation into what is causing the slowness may be warranted before coming up with a more generalizable solution for deep datatrees. I'm glad that @jbusecke is already investigating a solution with respect to reducing coordinates that could be helpful for this particular dataset, but we may consider also doing a deeper dive into how to improve lazy loading performance for deep datatrees, what do you think @thomasteisberg @espg @jbusecke @maxrjones? Tom N mentioned ASV... Post-script: I was actually able to load the entire data tree. I waited around long enough to get an error about decode_times, and once I set that to
|
I've used asv for benchmarking (which will tell you if your code is slow), but not for profiling (which will tell you why your code is slow). There are lots of reasonably good tools out there for profiling - pyinstrument is a good starting point. I'd also start with some mental modeling to guide the profiling - how many requests and how much data transfer should be required to load your datatree? Based on that + your network card, what is the "speed of light" for how fast you could possibly open the dataset? are you approaching that speed? You may also be able to get enough improvements from available configuration options to not worry about upstream improvements I would recommend trying opening the zarr store with |
|
This may be duplicative with prior comments and/or already checked, but it's also worth making sure that the native coordinates have large enough chunks (or shards, if used) that there aren't too many requests needed when loading coordinates. |
|
@maxrjones thanks for the suggestions. I forgot to mention that I was already using a concurrency of 100. I added the |
I got about the same over wifi on my laptop, so were probably not maxing out the IO 😁, but in general I think 90s is acceptable-ish but I am curious how much we can push this down with combined variables. |
|
Excited to see what @jbusecke gets with his solution. Even aside from performance enhancements, merging coordinate systems among the experiment variables -- and treating them as variables in a single data array rather than seperate data array objects -- seems like what we should be doing just to reduce cognitive load and make access cleaner / terser. @abarciauskas-bgse I think that 90s is usable, but it's probably worth doing some profiling just to understand what and where the bottlenecks are for this. We can decide from there if it's worth the effort to address what's dominating the clock. |
|
The zarr store has 71609 nodes based on the script below, if we assume no metadata consolidation, 128 concurrency, and a 100 ms cross-region RTT, the "speed of light" should be 56 s for opening the full datatree, which makes 90s look alright but not great. However, from the icechunk docs, "consolidated metadata is unnecessary (and unsupported) in Icechunk. Icechunk already organizes the dataset metadata in a way that makes it very fast to fetch from storage." - we should look into this to understand how it impacts the speed of light. Here's a script for creating a flamegraph of the datatree opening step. FWIW this took 40s for me locally (yay, fiber 🚀) # /// script
# requires-python = ">=3.11"
# dependencies = [
# "icechunk",
# "zarr",
# "xarray",
# "pyinstrument"
# ]
# ///
import icechunk as ic
import xarray as xr
import zarr
import warnings
from pyinstrument import Profiler
warnings.filterwarnings(
"ignore",
message="Numcodecs codecs are not in the Zarr version 3 specification*",
category=UserWarning
)
zarr.config.set({'async.concurrency': 128})
storage = ic.gcs_storage(bucket="ismip6-icechunk", anonymous=True, prefix="12-07-2025")
repo = ic.Repository.open(storage)
session = repo.readonly_session("main")
with Profiler() as profiler:
dt = xr.open_datatree(session.store, engine="zarr", create_default_indexes=False, decode_times=False)
profiler.write_html("profile.html") |
|
Hello everyone 👋 On performanceEven without profiling, it seems highly likely that pydata/xarray#10579 is responsible for the slowdown here. An easy way to confirm this would be to create a test zarr store that has a large number of tiny (but still 1D) coordinate variables in different groups, and see how long it takes to open as a function of the number of nodes. I suspect you will see a graph like the one in this issue (that issue is for the opposite problem - saving a datatree with many nodes). Then it should be clear what xarray improvement is needed to speed this up (I bet it's adding concurrency). On schemaStepping back, and reading about your goals as described by @jbusecke here and @thomasteisberg here, I can't help but wonder if a more datacollections-like schema would make more sense here. My understanding is that you:
I feel like you could get both of those things by instead writing a flat (or even still nested) group structure, serializing information about that group structure into the top-level group, and helping users open that as their first call. Then you also provide a method to facilitate querying that index of groups (with pandas or SQL or both), and creating a In other words, you do something like what @sharkinsspatial and Kyle did in https://github.com/developmentseed/zarr-datafusion-search-examples, but for MIP-like structures. You could think of this as "the contents of an intake catalog csv, but stored in the root group of a zarr store". # library code
class MIPCollection:
df: pd.DataFrame
def __init__(store_url):
# opens the root group of the store and populates a pandas dataframe following a MIP-aware schema
# (or creates the datafusion context Kyle made)# user code
cmip6_collection = MIPCollection(store_url)
# prints a dataframe of all group names with relevant metadata about the contents of each group
cmip6_collection.display()
# returns an xr.DataTree comprising just the groups the user is interested in
# (this might be bad SQL - I don't know SQL 🤷♂️)
# (could also be a dataframe API, or both)
dt = cmip6_collection.sql_query(
"SELECT * FROM icechunk_data WHERE model == "CESM",
).to_datatree()(This of course assumes that asking users to open/query with a more specialized open-source library rather than just xarray is acceptable to you.) This would work with normal zarr, but the advantage of doing it with Icechunk would be transactions to help guarantee that if you change the data in the groups you always keep the metadata in the root group consistent. DataTree performance work would still be worthwhile in this model, but presumably users aren't often trying to do analysis on the entire archive, so opening only a subset will be much faster even with no upstream optimizations. |
Prototype gcp lithops for combined variables


This PR adds a script and documentation to create an icechunk virtual store using lithops.
cc @jbusecke