-
Notifications
You must be signed in to change notification settings - Fork 44
Description
Here is the design for the new way of organizing our configuration files, that we decided on at the last workshop. We will use a simplified version of the Dask configuration system. That means that the configuration will consist of nested dictionaries that are organized per component of the tool. There is a default configuration shipped with the ESMValCore and ESMValTool, that will be updated with the configuration specified by the user through one or more YAML files in a user-specified directory.
The configuration will be stored in arbitrarily named files in a directory, e.g. ~/.config/esmvaltool. The users can decide if they want to use multiple files or keep everything in one file. This directory will be configurable from the command line, or by using an environmental variable, or possibly from the Python API as well. Having a configuration that is merged from multiple files into a single dictionary, like Dask has, will make it easy for us to provide a command like esmvaltool config that will create relevant example configuration files for the user, instead of a single large configuration file with commented out details.
For a smooth transition, we will keep supporting the existing configuration key: value pairs in config-user.yml, but add new ones as well.
Extensive example
Below is an example of what the future configuration file(s) could look like. Note that this very extensive to show all possibilities, real users would very likely need something much smaller.
config.yml (from current config-user.yml)
output_dir: ~/esmvaltool_output
auxiliary_data_dir: ~/auxiliary_data
max_parallel_tasks: null
log_level: info
remove_preproc_dir: true
# This could be replaced by the section `data` under `projects` (see below) in the future
# config_developer_file: null
# rootpath:
# default: ~/climate_data
# drs:
# CMIP3: ESGF
# CMIP5: ESGF
# CMIP6: ESGF
# CORDEX: ESGF
# obs4MIPs: ESGFdask.yml (from current dask.yml, see #2040 and #2369)
dask:
client:
run: compute # Start the `compute` cluster defined below
clusters:
local:
type: distributed.LocalCluster
n_workers: 2
threads_per_worker: 2
memory_limit: 4GiB
compute:
type: dask_jobqueue.SLURMCluster
queue: compute
account: bk1088
cores: 64
memory: 4GiB
processes: 32
interface: ib0
local_directory: "/scratch/b/b381141/dask-tmp"
n_workers: 32
basic:
type: default
scheduler: threaded
num_workers: 4
debug:
type: default
scheduler: single-threadedesgf-pyclient.yml (from current config-user.yml and esgf-pyclient.yml)
esgf:
search_esgf: when_missing
download_dir: ~/climate_data
search_connection:
expire_after: 2592000 # the number of seconds in a month
URLs:
- 'https://esg-dn1.nsc.liu.se/esg-search'
- 'https://esgf.ceda.ac.uk/esg-search'
- 'https://esgf-data.dkrz.de/esg-search'
- 'https://esgf-node.llnl.gov/esg-search'
logon:
hostname: "esgf-data.dkrz.de"
username: "cookiemonster"
password: "Welcome01"data-dkrz.yml
This would replace rootpath and drs in config-user.yml and the related input_dir and input_file in config-developer.yml). This has the advantage that all information is available in one place, making it easier to understand. See #1894 for previous discussion. The format is also extensible, to add support for e.g. intake-esm or intake-esgf (see next example).
projects:
CMIP6:
data:
CMIP6-local:
type: esmvalcore.local.LocalDataSource # this could be omitted for local data
path: /work/bd0854/DATA/ESMValTool2/CMIP6_DKRZ
dirname: '{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}'
filename: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}_{grid}*.nc'data-intake.yml (example of how intake-esm could be configured #31)
projects:
CMIP6:
data:
CMIP6-intake-esm:
type: esmvalcore.intake.IntakeDataSource
file: '/pool/data/Catalogs/levante-cmip6.json'
facets:
# mapping from recipe facets to intake-esm catalog facets
activity: activity_id
dataset: source_id
ensemble: member_id
exp: experiment_id
grid: grid_label
institute: institution_id
mip: table_id
short_name: variable_id
version: versionprojects.yml (would replace the CMOR table related and output_file settings in config-developer.yml
projects:
CMIP6:
cmor_table:
strict: true
type: 'CMIP6'
output_file: '{project}_{dataset}_{mip}_{exp}_{ensemble}_{short_name}_{grid}'
XYZ_project:
# Example of a custom project
cmor_table:
strict: false
type: CMIP6
path: /path/to/CMOR_table/
default_table_prefix: XYZ_
output_file: '{project}_{dataset}_{short_name}'extra_facets.yml (see esmvalcore/config/extra_facets for defaults)
projects:
CMIP5:
extra_facets:
'ACCESS1-0':
'*':
'*':
institute: ['CSIRO-BOM']
'ACCESS1-3':
'*':
'*':
institute: ['CSIRO-BOM']
'bcc-csm1-1':
'*':
'*':
institute: ['BCC']references.yml (see config-references.yml for current defaults, this would finally make it easier to avoid #28)
references:
citation_dir: ~/ESMValTool/esmvaltool/references
authors:
andela_bouwe:
name: Andela, Bouwe
institute: NLeSC, Netherlands
email: b.andela@esciencecenter.nl
orcid: https://orcid.org/0000-0001-9005-8940
github: bouweandela
schlund_manuel:
name: Schlund, Manuel
institute: DLR, Germany
email: manuel.schlund@dlr.de
orcid: https://orcid.org/0000-0001-5251-0158
github: schlunmaesmvaltool.yml (from current config-user.yml with potential future diagnostics package specification)
diagnostics:
package: esmvaltool
package_path: ~/ESMValTool
output_file_type: pngSimple example
This example shows what my current configuration on my laptop would look like in the new format
config.yml
output_dir: ~/esmvaltool_output
auxiliary_data_dir: ~/auxiliary_data
max_parallel_tasks: 1data-esgf.yml
esgf:
download_dir: ~/climate_data
search_esgf: always
search_connection:
expire_after: 864000 # 10 days
urls:
- 'https://esg-dn1.nsc.liu.se/esg-search'
- 'https://esgf-data.dkrz.de/esg-search'
projects:
CMIP6:
data:
CMIP6-ESGF:
path: ~/climate_data
dirname: '{project}/{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}'
filename: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}_{grid}*.nc'
CMIP5-ESGF:
path: ~/climate_data
dirname: '{project.lower}/{product}/{institute}/{dataset}/{exp}/{frequency}/{modeling_realm}/{mip}/{ensemble}/{version}'
filename: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}*.nc'
obs4MIPs-ESGF:
path: ~/climate_data
dirname: '{project}/{dataset}/{version}'
filename: '{short_name}_*.nc'data-obs.yml
projects:
native6:
data:
native6-local:
path: ~/climate_data
dirname: 'Tier{tier}/{dataset}/{version}/{frequency}/{short_name}'
filename: '*.nc'dask.yml
dask:
run: local
clusters:
local:
type: distributed.LocalCluster
n_workers: 2
threads_per_worker: 2
memory_limit: 4GiB
basic:
type: default
scheduler: threaded
num_workers: 2
debug:
type: default
scheduler: single-threadedCompute cluster example
On a compute cluster, e.g. Levante, the simple example above would be extended with an extra data sources file:
data-levante.yml
projects:
CMIP6:
data:
CMIP6-levante:
path: /work/bd0854/DATA/ESMValTool2/CMIP6_DKRZ
dirname: '{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}'
filename: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}_{grid}*.nc'
CMIP5:
data:
CMIP5-levante:
path: /work/bd0854/DATA/ESMValTool2/CMIP5_DKRZ
dirname: '{institute}/{dataset}/{exp}/{frequency}/{modeling_realm}/{mip}/{ensemble}/{version}/{short_name}'
filename: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}*.nc'
native6:
data:
native6-levante:
path: /work/bd0854/DATA/ESMValTool2/RAWOBS
dirname: 'Tier{tier}/{dataset}/{version}/{frequency}/{short_name}'
filename: '*.nc'About replacing the rootpath, drs, input_dir, and input_filename settings
I realize that the way to specify rootpath/dirname/filename looks more complicated than what we currently have in the above examples. What I like about it is that it is explicit and simple: there is no longer a need to find out about the 'hidden' config-developer.yml file to understand what this is actually doing, and there is no longer the complicating factor that there is a lot of magic going on (is this settings a string or a list, what does default mean?) and I think that will benefit new users. See also #1894 (comment) for previous discussions on the topic.
Timeline for implementation
To set the expectations: this design is intended as a long-term strategy that can give guidance when making smaller improvements to the tool, not something that can immediately be implemented. Currently, no member of the @ESMValGroup/technical-lead-development-team has a funded proposal in which a large task like this could be taken on.
Ideas welcome
@ESMValGroup/esmvaltool-developmentteam If you have ideas how to make this better, please share them in a comment below or at one of the community meetings.