Make small AuxCoords (including scalar) non-lazy, for efficiency#5069
Make small AuxCoords (including scalar) non-lazy, for efficiency#5069pp-mo wants to merge 1 commit intoSciTools:mainfrom
Conversation
|
The test results are encouraging : nothing unexpected. |
|
At present, a NetcdfDataProxy contains
Plus "sizeof" a NetCDFDataProxy object itself is ~56 However, the question of how much memory a Python object "costs" has, it seems generally agreed, Also, look at this ... So, it looks like anything beyond a Python primitive object is generally rather hard to assess for memory consumption. So InsteadUsing very simplistic measures, based on "tracemalloc", as explained in #4883 This example is intending to mimic an absolutely minimal (1-point) lazy array, like that within a scalar coordinate. The above "measures" 2000 of them --> average of ~4883.3 bytes. It seems a lot, doesn't it ?!? |
|
Replaced by #5229 |
See #5053.
The tiny fix here seems to show that the slowness is largely due to creating lots of tiny lazy coords (for scalar coords).
This really speeds up the testcase for netcdf load -- from ~150sec to ~4secs ...
(testcase is a file with ~200 variables all of which have 2 scalar coords)
Here we are making any smaller AuxCoords real : we fetch variable data immediately to make a new coord, instead of giving it a "dask.Array wrapping a NetcdfDataProxy referring to a file variable".
In practice, if the size threshold is set right, this approach should save memory too.
But it's not trivial to determine what a typical minimal overhead for a "dask.Array wrapping a NetcdfDataProxy" actually is.
At least the NetcdfDataProxy object is pretty small + simple, containing some numbers + a couple of strings. The Dask array object is probably more costly.
TBD