diff --git a/doc/recipe/preprocessor.rst b/doc/recipe/preprocessor.rst index 6a5b72e673..88ba4c9ff7 100644 --- a/doc/recipe/preprocessor.rst +++ b/doc/recipe/preprocessor.rst @@ -456,17 +456,17 @@ Missing values masks -------------------- Missing (masked) values can be a nuisance especially when dealing with -multimodel ensembles and having to compute multimodel statistics; different +multi-model ensembles and having to compute multi-model statistics; different numbers of missing data from dataset to dataset may introduce biases and -artificially assign more weight to the datasets that have less missing -data. This is handled in ESMValTool via the missing values masks: two types of -such masks are available, one for the multimodel case and another for the -single model case. +artificially assign more weight to the datasets that have less missing data. +This is handled in ESMValTool via the missing values masks: two types of such +masks are available, one for the multi-model case and another for the single +model case. -The multimodel missing values mask (``mask_fillvalues``) is a preprocessor step +The multi-model missing values mask (``mask_fillvalues``) is a preprocessor step that usually comes after all the single-model steps (regridding, area selection etc) have been performed; in a nutshell, it combines missing values masks from -individual models into a multimodel missing values mask; the individual model +individual models into a multi-model missing values mask; the individual model masks are built according to common criteria: the user chooses a time window in which missing data points are counted, and if the number of missing data points relative to the number of total data points in a window is less than a chosen @@ -492,11 +492,11 @@ See also :func:`esmvalcore.preprocessor.mask_fillvalues`. Common mask for multiple models ------------------------------- -It is possible to use ``mask_fillvalues`` to create a combined multimodel -mask (all the masks from all the analyzed models combined into a single -mask); for that purpose setting the ``threshold_fraction`` to 0 will not -discard any time windows, essentially keeping the original model masks and -combining them into a single mask; here is an example: +It is possible to use ``mask_fillvalues`` to create a combined multi-model mask +(all the masks from all the analyzed models combined into a single mask); for +that purpose setting the ``threshold_fraction`` to 0 will not discard any time +windows, essentially keeping the original model masks and combining them into a +single mask; here is an example: .. code-block:: yaml @@ -530,13 +530,12 @@ Horizontal regridding Regridding is necessary when various datasets are available on a variety of `lat-lon` grids and they need to be brought together on a common grid (for -various statistical operations e.g. multimodel statistics or for e.g. direct +various statistical operations e.g. multi-model statistics or for e.g. direct inter-comparison or comparison with observational datasets). Regridding is conceptually a very similar process to interpolation (in fact, the regridder engine uses interpolation and extrapolation, with various schemes). The primary difference is that interpolation is based on sample data points, while -regridding is based on the horizontal grid of another cube (the reference -grid). +regridding is based on the horizontal grid of another cube (the reference grid). The underlying regridding mechanism in ESMValTool uses the `cube.regrid() `_ @@ -651,28 +650,28 @@ Multi-model statistics ====================== Computing multi-model statistics is an integral part of model analysis and evaluation: individual models display a variety of biases depending on model -set-up, initial conditions, forcings and implementation; comparing model data -to observational data, these biases have a significantly lower statistical -impact when using a multi-model ensemble. ESMValTool has the capability of -computing a number of multi-model statistical measures: using the preprocessor -module ``multi_model_statistics`` will enable the user to ask for either a -multi-model ``mean``, ``median``, ``max``, ``min``, ``std``, and / or -``pXX.YY`` with a set of argument parameters passed to -``multi_model_statistics``. Percentiles can be specified like ``p1.5`` or -``p95``. The decimal point will be replaced by a dash in the output file. - -Note that current multimodel statistics in ESMValTool are local (not global), -and are computed along the time axis. As such, can be computed across a common -overlap in time (by specifying ``span: overlap`` argument) or across the full -length in time of each model (by specifying ``span: full`` argument). +set-up, initial conditions, forcings and implementation; comparing model data to +observational data, these biases have a significantly lower statistical impact +when using a multi-model ensemble. ESMValTool has the capability of computing a +number of multi-model statistical measures: using the preprocessor module +``multi_model_statistics`` will enable the user to ask for either a multi-model +``mean``, ``median``, ``max``, ``min``, ``std``, and / or ``pXX.YY`` with a set +of argument parameters passed to ``multi_model_statistics``. Percentiles can be +specified like ``p1.5`` or ``p95``. The decimal point will be replaced by a dash +in the output file. Restrictive computation is also available by excluding any set of models that the user will not want to include in the statistics (by setting ``exclude: [excluded models list]`` argument). The implementation has a few restrictions -that apply to the input data: model datasets must have consistent shapes, and -from a statistical point of view, this is needed since weights are not yet -implemented; also higher dimensional data is not supported (i.e. anything with -dimensionality higher than four: time, vertical axis, two horizontal axes). +that apply to the input data: model datasets must have consistent shapes, apart +from the time dimension; and cubes with more than four dimensions (time, +vertical axis, two horizontal axes) are not supported. + +Input datasets may have different time coordinates. Statistics can be computed +across overlapping times only (``span: overlap``) or across the full time span +of the combined models (``span: full``). The preprocessor sets a common time +coordinate on all datasets. As the number of days in a year may vary between +calendars, (sub-)daily data with different calendars are not supported. Input datasets may have different time coordinates. The multi-model statistics preprocessor sets a common time coordinate on all datasets. As the number of @@ -681,7 +680,7 @@ days in a year may vary between calendars, (sub-)daily data are not supported. .. code-block:: yaml preprocessors: - multimodel_preprocessor: + multi_model_preprocessor: multi_model_statistics: span: overlap statistics: [mean, median] @@ -702,14 +701,12 @@ entry contains the resulting cube with the requested statistic operations. .. note:: - Note that the multimodel array operations, albeit performed in - per-time/per-horizontal level loops to save memory, could, however, be - rather memory-intensive (since they are not performed lazily as - yet). The Section on :ref:`Memory use` details the memory intake - for different run scenarios, but as a thumb rule, for the multimodel - preprocessor, the expected maximum memory intake could be approximated as - the number of datasets multiplied by the average size in memory for one - dataset. + The multi-model array operations can be rather memory-intensive (since they + are not performed lazily as yet). The Section on :ref:`Memory use` details + the memory intake for different run scenarios, but as a thumb rule, for the + multi-model preprocessor, the expected maximum memory intake could be + approximated as the number of datasets multiplied by the average size in + memory for one dataset. .. _time operations: @@ -1512,14 +1509,14 @@ In the most general case, we can set upper limits on the maximum memory the analysis will require: -``Ms = (R + N) x F_eff - F_eff`` - when no multimodel analysis is performed; +``Ms = (R + N) x F_eff - F_eff`` - when no multi-model analysis is performed; -``Mm = (2R + N) x F_eff - 2F_eff`` - when multimodel analysis is performed; +``Mm = (2R + N) x F_eff - 2F_eff`` - when multi-model analysis is performed; where * ``Ms``: maximum memory for non-multimodel module -* ``Mm``: maximum memory for multimodel module +* ``Mm``: maximum memory for multi-model module * ``R``: computational efficiency of module; `R` is typically 2-3 * ``N``: number of datasets * ``F_eff``: average size of data per dataset where ``F_eff = e x f x F`` @@ -1538,7 +1535,7 @@ where ``Mm = 1.5 x (N - 2)`` GB As a rule of thumb, the maximum required memory at a certain time for -multimodel analysis could be estimated by multiplying the number of datasets by +multi-model analysis could be estimated by multiplying the number of datasets by the average file size of all the datasets; this memory intake is high but also assumes that all data is fully realized in memory; this aspect will gradually change and the amount of realized data will decrease with the increase of diff --git a/esmvalcore/preprocessor/_multimodel.py b/esmvalcore/preprocessor/_multimodel.py index 750768195a..680496aa08 100644 --- a/esmvalcore/preprocessor/_multimodel.py +++ b/esmvalcore/preprocessor/_multimodel.py @@ -1,16 +1,13 @@ -"""multimodel statistics. +"""Statistics across cubes. -Functions for multi-model operations -supports a multitude of multimodel statistics -computations; the only requisite is the ingested -cubes have (TIME-LAT-LON) or (TIME-PLEV-LAT-LON) -dimensions; and obviously consistent units. +This module contains functions to compute statistics across multiple cubes or +products. -It operates on different (time) spans: -- full: computes stats on full dataset time; -- overlap: computes common time overlap between datasets; +Wrapper functions separate esmvalcore internals, operating on products, from +generalized functions that operate on iris cubes. """ +import copy import logging import re from datetime import datetime @@ -20,6 +17,7 @@ import iris import numpy as np import scipy +from iris.experimental.equalise_cubes import equalise_attributes logger = logging.getLogger(__name__) @@ -295,66 +293,47 @@ def _assemble_data(cubes, statistic, span='overlap'): return stats_cube -def multi_model_statistics(products, span, statistics, output_products=None): - """Compute multi-model statistics. +def multicube_statistics(cubes, statistics, span): + """Compute statistics across input cubes. - Multimodel statistics computed along the time axis. Can be - computed across a common overlap in time (set span: overlap) - or across the full length in time of each model (set span: full). - Restrictive computation is also available by excluding any set of - models that the user will not want to include in the statistics - (set exclude: [excluded models list]). + This function deals with non-homogeneous cubes by taking the time union + (span='full') or intersection (span='overlap'), and extending or subsetting + the cubes as necessary. Apart from the time coordinate, cubes must have + consistent shapes. - Restrictions needed by the input data: - - model datasets must have consistent shapes, - - higher dimensional data is not supported (ie dims higher than four: - time, vertical axis, two horizontal axes). + This function operates directly on numpy (masked) arrays and rebuilds the + resulting cubes from scratch. Therefore, it is not suitable for lazy + evaluation. This function is restricted to maximum four-dimensional data: + time, vertical axis, two horizontal axes. Parameters ---------- - products: list - list of data products or cubes to be used in multimodel stat - computation; - cube attribute of product is the data cube for computing the stats. + cubes: list + list of cubes across which the statistics will be computed; + statistics: list + statistical metrics to be computed. Available options: mean, median, + max, min, std, or pXX.YY (for percentile XX.YY; decimal part optional). span: str overlap or full; if overlap, statitsticss are computed on common time- span; if full, statistics are computed on full time spans, ignoring missing data. - output_products: dict - dictionary of output products. MUST be specified if products are NOT - cubes - statistics: list of str - list of statistical measure(s) to be computed. Available options: - mean, median, max, min, std, or pXX.YY (for percentile XX.YY; decimal - part optional). Returns ------- - set or dict or list - `set` of data products if `output_products` is given - `dict` of cubes if `output_products` is not given - `list` of input cubes if there is no overlap between cubes when - using `span='overlap'` + dict + dictionary of statistics cubes with statistics' names as keys. Raises ------ ValueError If span is neither overlap nor full. """ - logger.debug('Multimodel statistics: computing: %s', statistics) - if len(products) < 2: - logger.info("Single dataset in list: will not compute statistics.") - return products - if output_products: - cubes = [cube for product in products for cube in product.cubes] - statistic_products = set() - else: - cubes = products - statistic_products = {} + logger.debug('Multicube statistics: computing: %s', statistics) # Reset time coordinates and make cubes share the same calendar _unify_time_coordinates(cubes) + # Check whether input is valid if span == 'overlap': # check if we have any time overlap times = [cube.coord('time').points for cube in cubes] @@ -362,7 +341,7 @@ def multi_model_statistics(products, span, statistics, output_products=None): if len(overlap) <= 1: logger.info("Time overlap between cubes is none or a single point." "check datasets: will not compute statistics.") - return products + return cubes logger.debug("Using common time overlap between " "datasets to compute statistics.") elif span == 'full': @@ -372,22 +351,156 @@ def multi_model_statistics(products, span, statistics, output_products=None): "Unexpected value for span {}, choose from 'overlap', 'full'". format(span)) + # Compute statistics + statistics_cubes = {} for statistic in statistics: - # Compute statistic statistic_cube = _assemble_data(cubes, statistic, span) + statistics_cubes[statistic] = statistic_cube + + return statistics_cubes + + +def multicube_statistics_iris(cubes, statistics): + """Compute statistics across input cubes. + + Like multicube_statistics, but operates directly on Iris cubes. Cubes are + merged and subsequently collapsed along a new auxiliary coordinate. + Inconsistent attributes will be removed. + + This method uses iris' built in functions, exposing the operators in + iris.analysis and supporting lazy evaluation. Input cubes must have + consistent shapes. + + Note: some of the operators in iris.analysis require additional + arguments, such as percentiles or weights. These operators are + currently not supported. + + Parameters + ---------- + cubes: list + list of cubes to be used in multimodel stat computation; + statistics: list statistical measures to be computed. Choose from the + operators listed in the iris.analysis package. + + Returns + ------- + dict + dictionary of statistics cubes with statistics' names as keys. + """ + operators = vars(iris.analysis) + + cubes = copy.deepcopy(cubes) + + for i, cube in enumerate(cubes): + concat_dim = iris.coords.AuxCoord(i, var_name='ens') + cube.add_aux_coord(concat_dim) + + equalise_attributes(cubes) + + cubes = iris.cube.CubeList(cubes) + cube = cubes.merge_cube() + + statistics_cubes = {} + for statistic in statistics: + try: + operator = operators[statistic.upper()] + except KeyError as err: + raise ValueError( + f'Statistic `{statistic}` not supported in', + '`ensemble_statistics`. Choose supported operator from', + '`iris.analysis package`.') from err + + # this will always return a masked array + statistic_cube = cube.collapsed('ens', operator) + statistics_cubes[statistic] = statistic_cube + + return statistics_cubes - if output_products: - # Add to output product and log provenance - statistic_product = output_products[statistic] - statistic_product.cubes = [statistic_cube] - for product in products: - statistic_product.wasderivedfrom(product) - logger.info("Generated %s", statistic_product) - statistic_products.add(statistic_product) - else: - statistic_products[statistic] = statistic_cube - if output_products: - products |= statistic_products - return products - return statistic_products +def _multiproduct_statistics(products, + statistics, + output_products, + span='overlap', + engine='esmvalcore'): + """Compute statistics across ESMValCore products. + + Extract cubes from products, calculate the statistics across cubes and + assign the resulting output cubes to the output_products. + + This function separates the ESMValCore internals (products) from the actual + statistics function that operates on Iris cubes. + + Parameters + ---------- + products: list + list of PreprocessorFile's + statistics: list + list of strings describing the statistics that will be computed + output_products: dict + dict of PreprocessorFile's with statistic names as keys. + span: str + 'full' or 'overlap', whether to calculate the statistics on the time + union ('full') or time intersection ('overlap') of the cubes. + engine: str + 'iris' or 'esmvalcore', which function to use to compute the + statistics. Iris is more efficient and exposes more operators, + but requires highly homogeneous/compatible cubes. + + Returns + ------- + set + set of PreprocessorFiles containing the computed statistics cubes. + """ + if engine == 'iris': + aggregator = multicube_statistics_iris + else: + aggregator = partial(multicube_statistics, span=span) + + # Extract cubes from products + cubes = [cube for product in products for cube in product.cubes] + + # Compute statistics + if len(cubes) < 2: + logger.info('Found only 1 cube; no statistics computed for %r', + list(products)[0]) + statistics_cubes = {statistic: cubes[0] for statistic in statistics} + else: + statistics_cubes = aggregator(cubes=cubes, statistics=statistics) + + # Add cubes to output products and log provenance + statistics_products = set() + for statistic, cube in statistics_cubes.items(): + statistics_product = output_products[statistic] + statistics_product.cubes = [cube] + + for product in products: + statistics_product.wasderivedfrom(product) + + logger.info("Generated %s", statistics_product) + statistics_products.add(statistics_product) + + return statistics_products + + +def multi_model_statistics(products: set, + statistics: list, + output_products: set, + span: str = 'overlap', + engine: str = 'esmvalcore'): + """Entry point for ESMValCore's multi-model statistics preprocessor. + + Compute statistics across cubes from different models. + + See Also + -------- + multicube_statistics : core statistics function. + """ + statistics_products = _multiproduct_statistics( + products=products, + statistics=statistics, + output_products=output_products, + span=span, + engine=engine, + ) + + return statistics_products diff --git a/tests/sample_data/multimodel_statistics/test_multimodel.py b/tests/sample_data/multimodel_statistics/test_multimodel.py index a527a9b913..9f6551675a 100644 --- a/tests/sample_data/multimodel_statistics/test_multimodel.py +++ b/tests/sample_data/multimodel_statistics/test_multimodel.py @@ -1,6 +1,7 @@ """Test using sample data for :func:`esmvalcore.preprocessor._multimodel`.""" import pickle +from functools import partial from itertools import groupby from pathlib import Path @@ -8,7 +9,11 @@ import numpy as np import pytest -from esmvalcore.preprocessor import extract_time, multi_model_statistics +from esmvalcore.preprocessor import extract_time +from esmvalcore.preprocessor._multimodel import ( + multicube_statistics, + multicube_statistics_iris, +) esmvaltool_sample_data = pytest.importorskip("esmvaltool_sample_data") @@ -28,6 +33,11 @@ SPAN_PARAMS = ('overlap', 'full') +ENGINE_PARAMS = ( + 'esmvalcore', + pytest.param('iris', marks=pytest.mark.xfail), +) + def assert_array_almost_equal(this, other): """Assert that array `this` almost equals array `other`.""" @@ -118,18 +128,23 @@ def calendar(cube): return cube_dict -def multimodel_test(cubes, span, statistic): +def multimodel_test(cubes, statistic, span, engine=None): """Run multimodel test with some simple checks.""" statistics = [statistic] - result = multi_model_statistics(cubes, span=span, statistics=statistics) + if engine == 'iris': + aggregator = multicube_statistics_iris + else: + aggregator = partial(multicube_statistics, span=span) + + result = aggregator(cubes, statistics=statistics) assert isinstance(result, dict) assert statistic in result return result -def multimodel_regression_test(cubes, span, name): +def multimodel_regression_test(cubes, span, name, engine): """Run multimodel regression test. This test will fail if the input data or multimodel code changed. To @@ -139,7 +154,10 @@ def multimodel_regression_test(cubes, span, name): are being written. """ statistic = 'mean' - result = multimodel_test(cubes, span=span, statistic=statistic) + result = multimodel_test(cubes, + statistic=statistic, + span=span, + engine=engine) result_cube = result[statistic] filename = Path(__file__).with_name(f'{name}-{span}-{statistic}.nc') @@ -164,8 +182,9 @@ def multimodel_regression_test(cubes, span, name): @pytest.mark.use_sample_data +@pytest.mark.parametrize('engine', ENGINE_PARAMS) @pytest.mark.parametrize('span', SPAN_PARAMS) -def test_multimodel_regression_month(timeseries_cubes_month, span): +def test_multimodel_regression_month(timeseries_cubes_month, span, engine): """Test statistic.""" cubes = timeseries_cubes_month name = 'timeseries_monthly' @@ -173,13 +192,16 @@ def test_multimodel_regression_month(timeseries_cubes_month, span): name=name, span=span, cubes=cubes, + engine=engine, ) @pytest.mark.use_sample_data +@pytest.mark.parametrize('engine', ENGINE_PARAMS) @pytest.mark.parametrize('calendar', CALENDAR_PARAMS) @pytest.mark.parametrize('span', SPAN_PARAMS) -def test_multimodel_regression_day(timeseries_cubes_day, span, calendar): +def test_multimodel_regression_day(timeseries_cubes_day, span, calendar, + engine): """Test statistic.""" cubes = timeseries_cubes_day[calendar] name = f'timeseries_daily_{calendar}' @@ -187,6 +209,7 @@ def test_multimodel_regression_day(timeseries_cubes_day, span, calendar): name=name, span=span, cubes=cubes, + engine=engine, ) diff --git a/tests/unit/preprocessor/_multimodel/test_multimodel.py b/tests/unit/preprocessor/_multimodel/test_multimodel.py index 42b28a9525..fa17ddf4d1 100644 --- a/tests/unit/preprocessor/_multimodel/test_multimodel.py +++ b/tests/unit/preprocessor/_multimodel/test_multimodel.py @@ -4,20 +4,24 @@ import iris import numpy as np +import pytest from cf_units import Unit import tests -from esmvalcore.preprocessor import multi_model_statistics -from esmvalcore.preprocessor._multimodel import (_assemble_data, - _compute_statistic, - _get_time_slice, _plev_fix, - _put_in_cube, - _unify_time_coordinates) +from esmvalcore.preprocessor._multimodel import ( + _assemble_data, + _compute_statistic, + _get_time_slice, + _plev_fix, + _put_in_cube, + _unify_time_coordinates, + multicube_statistics, + multicube_statistics_iris, +) class Test(tests.Test): """Test class for preprocessor/_multimodel.py.""" - def setUp(self): """Prepare tests.""" # Make various time arrays @@ -92,7 +96,9 @@ def test_compute_statistic(self): def test_compute_full_statistic_mon_cube(self): data = [self.cube1, self.cube2] - stats = multi_model_statistics(data, 'full', ['mean']) + stats = multicube_statistics(cubes=data, + statistics=['mean'], + span='full') expected_full_mean = np.ma.ones((5, 3, 2, 2)) expected_full_mean.mask = np.ones((5, 3, 2, 2)) expected_full_mean.mask[1] = False @@ -100,7 +106,9 @@ def test_compute_full_statistic_mon_cube(self): def test_compute_full_statistic_yr_cube(self): data = [self.cube4, self.cube5] - stats = multi_model_statistics(data, 'full', ['mean']) + stats = multicube_statistics(cubes=data, + statistics=['mean'], + span='full') expected_full_mean = np.ma.ones((4, 3, 2, 2)) expected_full_mean.mask = np.zeros((4, 3, 2, 2)) expected_full_mean.mask[2:4] = True @@ -108,16 +116,43 @@ def test_compute_full_statistic_yr_cube(self): def test_compute_overlap_statistic_mon_cube(self): data = [self.cube1, self.cube1] - stats = multi_model_statistics(data, 'overlap', ['mean']) + stats = multicube_statistics(cubes=data, + statistics=['mean'], + span='overlap') expected_ovlap_mean = np.ma.ones((2, 3, 2, 2)) self.assert_array_equal(stats['mean'].data, expected_ovlap_mean) def test_compute_overlap_statistic_yr_cube(self): data = [self.cube4, self.cube4] - stats = multi_model_statistics(data, 'overlap', ['mean']) + stats = multicube_statistics(cubes=data, + statistics=['mean'], + span='overlap') expected_ovlap_mean = np.ma.ones((2, 3, 2, 2)) self.assert_array_equal(stats['mean'].data, expected_ovlap_mean) + def test_multicube_statistics_fail(self): + data = [self.cube1, self.cube1 * 2.0] + with pytest.raises(ValueError): + multicube_statistics(data, + span='overlap', + statistics=['non-existant']) + + def test_multicube_statistics_iris(self): + data = [self.cube1, self.cube1 * 2.0] + statistics = ['mean', 'min', 'max'] + stats = multicube_statistics_iris(data, statistics=statistics) + expected_mean = np.ma.ones((2, 3, 2, 2)) * 1.5 + expected_min = np.ma.ones((2, 3, 2, 2)) * 1.0 + expected_max = np.ma.ones((2, 3, 2, 2)) * 2.0 + self.assert_array_equal(stats['mean'].data, expected_mean) + self.assert_array_equal(stats['min'].data, expected_min) + self.assert_array_equal(stats['max'].data, expected_max) + + def test_multicube_statistics_iris_fail(self): + data = [self.cube1, self.cube1 * 2.0] + with pytest.raises(ValueError): + multicube_statistics_iris(data, statistics=['non-existent']) + def test_compute_std(self): """Test statistic.""" data = [self.cube1.data[0], self.cube2.data[0] * 2]