Update contains_cftime_datetimes to avoid loading entire variable array by agoodm · Pull Request #7494 · pydata/xarray

agoodm · 2023-01-30T21:54:35Z

Closes Opening datasets with large object dtype arrays is very slow #7484
User visible changes (including notable bug fixes) are documented in whats-new.rst

This PR greatly improves the performance for opening datasets with large arrays of object type (typically string arrays) since contains_cftime_datetimes was triggering the entire array to be read from the file just to check the very first element in the entire array.

@Illviljan continuing our discussion from the issue thread, I did try to pass in var._data to _contains_cftime_datetimes, but I had a lot of trouble finding a way to generalize how to index the first array element. The best I could do was var._data.array.get_array(), but I don't think get_array is implemented for every backend. So for now I am leaving my original proposed solution.

mathause · 2023-01-31T12:06:00Z

Thanks for the PR. However, does that actually make a difference? To me it looks like _contains_cftime_datetimes also only considers one element of the array.

xarray/xarray/core/common.py

Lines 1779 to 1780 in b451558

    
           if array.dtype == np.dtype("O") and array.size > 0: 
        
               sample = np.asarray(array).flat[0]

agoodm · 2023-01-31T12:22:02Z

Thanks for the PR. However, does that actually make a difference? To me it looks like _contains_cftime_datetimes also only considers one element of the array.

xarray/xarray/core/common.py

Lines 1779 to 1780 in b451558

if array.dtype == np.dtype("O") and array.size > 0:

sample = np.asarray(array).flat[0]

This isn't actually the line of code that's causing the performance bottleneck, it's the access to var.data in the function call that is actually problematic as I explained in the issue thread. You can verify this yourself running this simple example before and after applying the changes in this PR:

import numpy as np
import xarray as xr

str_array = np.arange(100000000).astype(str)
ds = xr.DataArray(dims=('x',), data=str_array).to_dataset(name='str_array')
ds = ds.chunk(x=10000)
ds['str_array'] = ds.str_array.astype('O') # Needs to actually be object dtype to show the problem
ds.to_zarr('str_array.zarr')

%time xr.open_zarr('str_array.zarr')

…/agoodm/xarray into pr/7494

Illviljan · 2023-01-31T22:49:24Z

@agoodm, what you think of this version? Using xr.Variable directly seems a little easier to work with than trying to guess which type of array (cupy, dask, pint, backendarray, etc) is in the variable.

agoodm · 2023-01-31T23:17:38Z

@Illviljan I gave your update a quick test, it seems to work well enough and still maintains the performance improvement. It looks fine to me though I guess it looks like you still need to fix this failing mypy stuff now?

mathause

I thought da.data passes the array along as is - but you can learn something everyday. Thanks @Illviljan for taking over and sorry @agoodm for not properly reading the issue...

xarray/core/common.py

Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>

xarray/core/common.py

dcherian · 2023-02-03T16:21:36Z

xarray/core/common.py

+    if var.dtype == np.dtype("O") and var.size > 0:
+        first_idx = (0,) * var.ndim
+        sample = var[first_idx]
+        return isinstance(sample.to_numpy().item(), cftime.datetime)


very clean. It'd be nice to add some sort of test like DuckBackendArrayWrapper in https://github.com/pydata/xarray/pull/6874/files . __getitem__ should raise if it will return more than one value.

Thanks for taking a look at this. I am a little confused for what you are suggesting here. Are you looking for a simple test in test_variable.py that applies the same logic in this block to extract the very first element via Variable.__getitem__ here and check that it returns one value, a more general contains_cftime_datetimes test, or both?

Sotry that was a bit complicated and intended for IIlviljan.

I pushed a commit with a test. I also changed the code to account for those lazily indexed backend arrays explicitly.

remove _variable_contains_cftime_datetimes

headtr1ck

LGTM.

Some simple test of the new functionality would be nice.

If you are really motivated, you can think about adding an asv benchmark here: https://github.com/pydata/xarray/blob/main/asv_bench/benchmarks/dataset_io.py

doc/whats-new.rst

dcherian · 2023-03-01T04:30:54Z

Seems like we're passing a DataArray instead of a Variable somewhere.

mathause · 2023-03-01T11:55:41Z

I think that's in the tests themselves

xarray/xarray/tests/test_coding_times.py

Line 778 in 6531b57

@pytest.fixture()

xarray/core/common.py

xarray/tests/test_coding.py

dcherian · 2023-03-06T16:52:13Z

Thanks @agoodm this work prompted a bunch of internal cleanup!

agoodm · 2023-03-07T16:22:21Z

Thanks @Illviljan and @dcherian for helping to see this through.

* main: Preserve `base` and `loffset` arguments in `resample` (pydata#7444) ignore the `pkg_resources` deprecation warning (pydata#7594) Update contains_cftime_datetimes to avoid loading entire variable array (pydata#7494) Support first, last with dask arrays (pydata#7562) update the docs environment (pydata#7442) Add xCDAT to list of Xarray related projects (pydata#7579) [pre-commit.ci] pre-commit autoupdate (pydata#7565) fix nczarr when libnetcdf>4.8.1 (pydata#7575) use numpys SupportsDtype (pydata#7521)

agoodm added 2 commits January 30, 2023 15:40

Update contains_cftime_datetimes to avoid loading entire variable array

1ab7295

Update whats-new.rst

5132446

agoodm and others added 3 commits January 31, 2023 15:51

Merge branch 'main' into improve_cftime_check_performance

2bbf98a

Convert arrays to variable instead for better control

cb70040

Merge branch 'improve_cftime_check_performance' of https://github.com…

222c14a

…/agoodm/xarray into pr/7494

fix mypy?

02edc59

Illviljan added the run-benchmark Run the ASV benchmark workflow label Jan 31, 2023

Update common.py

b5b7bae

mathause reviewed Feb 1, 2023

View reviewed changes

xarray/core/common.py Outdated Show resolved Hide resolved

Update xarray/core/common.py

b191a51

Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>

dcherian reviewed Feb 3, 2023

View reviewed changes

xarray/core/common.py Outdated Show resolved Hide resolved

dcherian reviewed Feb 3, 2023

View reviewed changes

agoodm added 2 commits February 14, 2023 14:40

Merge branch 'main' into improve_cftime_check_performance

34cb0e0

Update common.py

3c09948

remove _variable_contains_cftime_datetimes

headtr1ck reviewed Feb 26, 2023

View reviewed changes

doc/whats-new.rst Outdated Show resolved Hide resolved

dcherian added 3 commits February 28, 2023 20:44

Avoid creating variable.

55c3f3d

Add test

ded5dd1

minimize diff

06a8706

Illviljan reviewed Mar 1, 2023

View reviewed changes

xarray/core/common.py Outdated Show resolved Hide resolved

dcherian and others added 5 commits March 2, 2023 09:49

Merge branch 'main' into improve_cftime_check_performance

440880b

Update tests.

74485fd

address comment

5d677f1

Fix test

b387995

Fix whats-new

c075245

dcherian reviewed Mar 2, 2023

View reviewed changes

xarray/tests/test_coding.py Show resolved Hide resolved

Merge branch 'main' into improve_cftime_check_performance

a88354c

dcherian added the plan to merge Final call for comments label Mar 2, 2023

dcherian added 2 commits March 2, 2023 15:17

Fix more tests

f9e7a2e

More fixes

6fe2c25

dcherian removed the plan to merge Final call for comments label Mar 2, 2023

dcherian closed this Mar 3, 2023

dcherian reopened this Mar 3, 2023

github-actions bot added the topic-cftime label Mar 3, 2023

dcherian and others added 2 commits March 3, 2023 16:16

fix iris tests

9c07dcb

Merge branch 'main' into improve_cftime_check_performance

62ce6e1

dcherian requested a review from Illviljan March 5, 2023 05:40

dcherian added the plan to merge Final call for comments label Mar 5, 2023

Illviljan approved these changes Mar 5, 2023

View reviewed changes

dcherian merged commit 798f4d4 into pydata:main Mar 7, 2023

agoodm deleted the improve_cftime_check_performance branch March 7, 2023 16:22

Illviljan mentioned this pull request Mar 19, 2023

encode_cf_variable triggers AttributeError: 'DataArray' object has no attribute '_data' #7645

Closed

4 tasks

spencerkclark mentioned this pull request Jul 9, 2023

DataArray throws error in contains_cftime_datetimes() -> _data vs data (with fix?) #7966

Closed

4 tasks

Uh oh!

Conversation

agoodm commented Jan 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mathause commented Jan 31, 2023

Uh oh!

agoodm commented Jan 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Illviljan commented Jan 31, 2023

Uh oh!

agoodm commented Jan 31, 2023

Uh oh!

mathause left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dcherian Feb 3, 2023

Choose a reason for hiding this comment

Uh oh!

agoodm Feb 14, 2023

Choose a reason for hiding this comment

Uh oh!

dcherian Mar 1, 2023

Choose a reason for hiding this comment

Uh oh!

headtr1ck left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dcherian commented Mar 1, 2023

Uh oh!

mathause commented Mar 1, 2023

Uh oh!

Uh oh!

Uh oh!

dcherian commented Mar 6, 2023

Uh oh!

agoodm commented Mar 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

agoodm commented Jan 30, 2023 •

edited

Loading

agoodm commented Jan 31, 2023 •

edited

Loading