Update contains_cftime_datetimes to avoid loading entire variable array#7494
Update contains_cftime_datetimes to avoid loading entire variable array#7494dcherian merged 23 commits intopydata:mainfrom
Conversation
|
Thanks for the PR. However, does that actually make a difference? To me it looks like Lines 1779 to 1780 in b451558 |
This isn't actually the line of code that's causing the performance bottleneck, it's the access to import numpy as np
import xarray as xr
str_array = np.arange(100000000).astype(str)
ds = xr.DataArray(dims=('x',), data=str_array).to_dataset(name='str_array')
ds = ds.chunk(x=10000)
ds['str_array'] = ds.str_array.astype('O') # Needs to actually be object dtype to show the problem
ds.to_zarr('str_array.zarr')
%time xr.open_zarr('str_array.zarr') |
|
@agoodm, what you think of this version? Using xr.Variable directly seems a little easier to work with than trying to guess which type of array (cupy, dask, pint, backendarray, etc) is in the variable. |
|
@Illviljan I gave your update a quick test, it seems to work well enough and still maintains the performance improvement. It looks fine to me though I guess it looks like you still need to fix this failing mypy stuff now? |
mathause
left a comment
There was a problem hiding this comment.
I thought da.data passes the array along as is - but you can learn something everyday. Thanks @Illviljan for taking over and sorry @agoodm for not properly reading the issue...
Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>
xarray/core/common.py
Outdated
| if var.dtype == np.dtype("O") and var.size > 0: | ||
| first_idx = (0,) * var.ndim | ||
| sample = var[first_idx] | ||
| return isinstance(sample.to_numpy().item(), cftime.datetime) |
There was a problem hiding this comment.
very clean. It'd be nice to add some sort of test like DuckBackendArrayWrapper in https://github.com/pydata/xarray/pull/6874/files . __getitem__ should raise if it will return more than one value.
There was a problem hiding this comment.
Thanks for taking a look at this. I am a little confused for what you are suggesting here. Are you looking for a simple test in test_variable.py that applies the same logic in this block to extract the very first element via Variable.__getitem__ here and check that it returns one value, a more general contains_cftime_datetimes test, or both?
There was a problem hiding this comment.
Sotry that was a bit complicated and intended for IIlviljan.
I pushed a commit with a test. I also changed the code to account for those lazily indexed backend arrays explicitly.
remove _variable_contains_cftime_datetimes
headtr1ck
left a comment
There was a problem hiding this comment.
LGTM.
Some simple test of the new functionality would be nice.
If you are really motivated, you can think about adding an asv benchmark here: https://github.com/pydata/xarray/blob/main/asv_bench/benchmarks/dataset_io.py
|
Seems like we're passing a DataArray instead of a Variable somewhere. |
|
I think that's in the tests themselves xarray/xarray/tests/test_coding_times.py Line 778 in 6531b57 |
|
Thanks @agoodm this work prompted a bunch of internal cleanup! |
|
Thanks @Illviljan and @dcherian for helping to see this through. |
* main: Preserve `base` and `loffset` arguments in `resample` (pydata#7444) ignore the `pkg_resources` deprecation warning (pydata#7594) Update contains_cftime_datetimes to avoid loading entire variable array (pydata#7494) Support first, last with dask arrays (pydata#7562) update the docs environment (pydata#7442) Add xCDAT to list of Xarray related projects (pydata#7579) [pre-commit.ci] pre-commit autoupdate (pydata#7565) fix nczarr when libnetcdf>4.8.1 (pydata#7575) use numpys SupportsDtype (pydata#7521)
whats-new.rstThis PR greatly improves the performance for opening datasets with large arrays of object type (typically string arrays) since
contains_cftime_datetimeswas triggering the entire array to be read from the file just to check the very first element in the entire array.@Illviljan continuing our discussion from the issue thread, I did try to pass in
var._datato_contains_cftime_datetimes, but I had a lot of trouble finding a way to generalize how to index the first array element. The best I could do wasvar._data.array.get_array(), but I don't thinkget_arrayis implemented for every backend. So for now I am leaving my original proposed solution.