Handle NetCDF variable length strings (and other VLen types)#6340
Handle NetCDF variable length strings (and other VLen types)#6340pp-mo merged 33 commits intoSciTools:mainfrom
Conversation
Would be better to force all Variable Length types to be lazy, but can't ascertain this information from the CFAuxiliaryCoordinateVariable instance.
* upstream/main: (98 commits) [pre-commit.ci] pre-commit autoupdate (SciTools#6335) SPEC 0: drop py310 and support py313 (SciTools#6195) Better benchmarking Python version handling (SciTools#6329) Move loading and combine code into their own submodules. (SciTools#6321) Bump scitools/workflows from 2025.02.1 to 2025.02.2 (SciTools#6327) replaced reference from build to python build (SciTools#6324) [pre-commit.ci] pre-commit autoupdate (SciTools#6315) Cache Dask arrays created from `NetCDFDataProxy`s to speed up loading files with multiple variables (SciTools#6252) Bump scitools/workflows from 2025.02.0 to 2025.02.1 (SciTools#6313) [pre-commit.ci] pre-commit autoupdate (SciTools#6310) Bump scitools/workflows from 2025.01.5 to 2025.02.0 (SciTools#6306) Updated environment lockfiles (SciTools#6301) Improve speed of loading small NetCDF files (SciTools#6229) [pre-commit.ci] pre-commit autoupdate (SciTools#6298) Use cube chunks for weights in aggregations with smart weights (SciTools#6288) Updated environment lockfiles (SciTools#6296) Bump scitools/workflows from 2025.01.4 to 2025.01.5 (SciTools#6300) Bump scitools/workflows from 2025.01.3 to 2025.01.4 (SciTools#6295) Lazy rectilinear interpolator (SciTools#6084) Revert "Fix broken link. (SciTools#6246)" (SciTools#6297) ...
variable length "str" case.
storage is netCDF (which for this module is true, but for mock testing is not)
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6340 +/- ##
==========================================
- Coverage 89.80% 89.80% -0.01%
==========================================
Files 90 90
Lines 23576 23589 +13
Branches 4398 4402 +4
==========================================
+ Hits 21172 21183 +11
Misses 1662 1662
- Partials 742 744 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
The problems with |
|
BTW I think it's really important that people can get what they want out of these cases by taking control of the variable chunking. Which currently doesn't work since it hits the 'itemsize' line and crashes. I think it's rather cool that we have a drop on xarray here : we can specify chunking on a per-variable basis, but they don't |
Use 'safe-access' version of netCDF4.VLType
for more information, see https://pre-commit.ci
Yes - to a degree. Users can now pass down custom chunk parameters for variable length arrays, but as the "ragged array size" is not defined by an actual dimension we still have not way of passing this down to Dask. We can hint on the mean ragged array length to aid the decision for lazy loading, but that's about the extent of it. Your chunks will always be bigger in memory than you expect for variable length arrays as the chunking is only working over the explicit known dimensions. This does at least allow is to load variable length string data now in a way that is consistent with how other variable length data was being handled. |
I totally agree with that -- the chunks of course don't have known sizes. |
The docs should already be up-to-date:
However, the doctest GHA is failing due to the RTD environment having an older version of iris-sample-data (2.5.1). |
Oh perhaps it could be.. |
OK - have updated these and doctests are succeeding now. |
⏱️ Performance Benchmark Report: 548bc76Performance shiftsFull benchmark resultsGenerated by GHA run |
|
@ukmo-ccbunney testing all good now, I'm really keen to get this in ! |
⏱️ Performance Benchmark Report: 07dfb1bPerformance shiftsFull benchmark resultsGenerated by GHA run |
⏱️ Performance Benchmark Report: 96c9d0cPerformance shiftsFull benchmark resultsGenerated by GHA run |
⏱️ Performance Benchmark Report: b2409cdPerformance shiftsFull benchmark resultsGenerated by GHA run |
pp-mo
left a comment
There was a problem hiding this comment.
Just one tiny further suggestion ...
⏱️ Performance Benchmark Report: 16c97c7Performance shiftsFull benchmark resultsGenerated by GHA run |
⏱️ Performance Benchmark Report: edf8e41Performance shiftsFull benchmark resultsGenerated by GHA run |
…ciTools#6340)" This reverts commit 0ae0d49.
🚀 Pull Request
Description
NetCDF provides a "variable length" data type (
VLen) that can be used to store arrays where one dimension size is unknown/variable.This variable length datatype is used when creating a new variable with the
strtype (crucially this is different from aNC_CHARtype).Iris fails to load
VLenstrtypes as it cannot determine the itemsize of astrtype when trying to calculate whether the array is big enough to lazy load.Other
VLentypes load fine (such as floats, chars, ints, etc), but the calculation for the total array size will be incorrect as the size of the variable length array dimension is not known until the data has been read from disk.This PR makes the following changes
strtype, a default itemsize of 4 bytes is used (sufficient for storing a single Unicode character)strtypes is not a think in netCDF4, so I have made it the same as the 'S1' type (a null byte).CHUNK_CONTROLcontext manager using a special_vl_hintdimension name.Fixes: #6149