New configuration file plan

Here is the design for the new way of organizing our configuration files, that we decided on at the last workshop. We will use a simplified version of the [Dask configuration system](https://docs.dask.org/en/stable/configuration.html). That means that the configuration will consist of nested dictionaries that are organized per component of the tool. There is a default configuration shipped with the ESMValCore and ESMValTool, that will be updated with the configuration specified by the user through one or more YAML files in a user-specified directory.

The configuration will be stored in arbitrarily named files in a directory, e.g. `~/.config/esmvaltool`. The users can decide if they want to use multiple files or keep everything in one file. This directory will be configurable from the command line, or by using an environmental variable, or possibly from the Python API as well. Having a configuration that is merged from multiple files into a single dictionary, like Dask has, will make it easy for us to provide a command like `esmvaltool config` that will create _relevant_ example configuration files for the user, instead of a single large configuration file with commented out details.

For a smooth transition, we will keep supporting the existing configuration key: value pairs in config-user.yml, but add new ones as well.

## Extensive example

Below is an example of what the future configuration file(s) could look like. Note that this very extensive to show all possibilities, real users would very likely need something much smaller.

config.yml (from current [config-user.yml](https://docs.esmvaltool.org/projects/ESMValCore/en/latest/quickstart/configure.html#user-configuration-file))
```yaml
output_dir: ~/esmvaltool_output
auxiliary_data_dir: ~/auxiliary_data
max_parallel_tasks: null
log_level: info
remove_preproc_dir: true

# This could be replaced by the section `data` under `projects` (see below) in the future
# config_developer_file: null
# rootpath:
#   default: ~/climate_data
# drs:
#   CMIP3: ESGF
#   CMIP5: ESGF
#   CMIP6: ESGF
#   CORDEX: ESGF
#   obs4MIPs: ESGF
```

dask.yml (from current [dask.yml](https://docs.esmvaltool.org/projects/ESMValCore/en/latest/quickstart/configure.html#dask-distributed-configuration), see #2040 and #2369)
```yaml
dask:
  client:
  run: compute  # Start the `compute` cluster defined below
  clusters:
    local:
      type: distributed.LocalCluster
      n_workers: 2
      threads_per_worker: 2
      memory_limit: 4GiB
    compute:
      type: dask_jobqueue.SLURMCluster
      queue: compute
      account: bk1088
      cores: 64
      memory: 4GiB
      processes: 32
      interface: ib0
      local_directory: "/scratch/b/b381141/dask-tmp"
      n_workers: 32
    basic:
      type: default
      scheduler: threaded
      num_workers: 4
    debug:
      type: default
      scheduler: single-threaded
```

esgf-pyclient.yml (from current config-user.yml and [esgf-pyclient.yml](https://docs.esmvaltool.org/projects/ESMValCore/en/latest/quickstart/configure.html#esgf-configuration))
```yaml
esgf:
  search_esgf: when_missing
  download_dir: ~/climate_data
  search_connection:
    expire_after: 2592000  # the number of seconds in a month
    URLs:
      - 'https://esg-dn1.nsc.liu.se/esg-search'
      - 'https://esgf.ceda.ac.uk/esg-search'
      - 'https://esgf-data.dkrz.de/esg-search'
      - 'https://esgf-node.llnl.gov/esg-search'
  logon:
    hostname: "esgf-data.dkrz.de"
    username: "cookiemonster"
    password: "Welcome01"
```

data-dkrz.yml

This would replace `rootpath` and `drs` in [config-user.yml](https://github.com/ESMValGroup/ESMValCore/blob/20943c1b7f1da83827a6dd9bb663832372f368ea/esmvalcore/config-user.yml#L94-L110) and the related `input_dir` and `input_file` in [config-developer.yml](https://github.com/ESMValGroup/ESMValCore/blob/20943c1b7f1da83827a6dd9bb663832372f368ea/esmvalcore/config-developer.yml#L32-L40)). This has the advantage that all information is available in one place, making it easier to understand. See https://github.com/ESMValGroup/ESMValCore/pull/1894 for previous discussion. The format is also extensible, to add support for e.g. intake-esm or intake-esgf (see next example).

```yaml
projects:
  CMIP6:
    data:
      CMIP6-local:
        type: esmvalcore.local.LocalDataSource  # this could be omitted for local data
        path: /work/bd0854/DATA/ESMValTool2/CMIP6_DKRZ
        dirname: '{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}'
        filename: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}_{grid}*.nc'
```

data-intake.yml (example of how intake-esm could be configured #31)
```yaml
projects:
  CMIP6:
    data:
      CMIP6-intake-esm:
        type: esmvalcore.intake.IntakeDataSource
        file: '/pool/data/Catalogs/levante-cmip6.json'
        facets:
          # mapping from recipe facets to intake-esm catalog facets
          activity: activity_id
          dataset: source_id
          ensemble: member_id
          exp: experiment_id
          grid: grid_label
          institute: institution_id
          mip: table_id
          short_name: variable_id
          version: version
```

projects.yml (would replace the CMOR table related and `output_file` settings in [config-developer.yml](https://github.com/ESMValGroup/ESMValCore/blob/20943c1b7f1da83827a6dd9bb663832372f368ea/esmvalcore/config-developer.yml#L108-L121)
```yaml
projects:
  CMIP6:
    cmor_table:
      strict: true
      type: 'CMIP6'
    output_file: '{project}_{dataset}_{mip}_{exp}_{ensemble}_{short_name}_{grid}'
  XYZ_project:
    # Example of a custom project
    cmor_table:
      strict: false
      type: CMIP6
      path: /path/to/CMOR_table/
      default_table_prefix: XYZ_
    output_file: '{project}_{dataset}_{short_name}'
```

extra_facets.yml (see [esmvalcore/config/extra_facets](https://github.com/ESMValGroup/ESMValCore/tree/main/esmvalcore/config/extra_facets) for defaults)
```yaml
projects:
  CMIP5:
    extra_facets:
      'ACCESS1-0':
        '*':
          '*':
            institute: ['CSIRO-BOM']
      'ACCESS1-3':
        '*':
          '*':
            institute: ['CSIRO-BOM']
      'bcc-csm1-1':
        '*':
          '*':
            institute: ['BCC']
```

references.yml (see [config-references.yml](https://github.com/ESMValGroup/ESMValTool/blob/main/esmvaltool/config-references.yml) for current defaults, this would finally make it easier to avoid #28)
```yaml
references:
  citation_dir: ~/ESMValTool/esmvaltool/references
  authors:
    andela_bouwe:
      name: Andela, Bouwe
      institute: NLeSC, Netherlands
      email: b.andela@esciencecenter.nl
      orcid: https://orcid.org/0000-0001-9005-8940
      github: bouweandela
    schlund_manuel:
      name: Schlund, Manuel
      institute: DLR, Germany
      email: manuel.schlund@dlr.de
      orcid: https://orcid.org/0000-0001-5251-0158
      github: schlunma
```

esmvaltool.yml (from current config-user.yml with potential future diagnostics package specification)
```yaml
diagnostics:
  package: esmvaltool
  package_path: ~/ESMValTool
  output_file_type: png
```

## Simple example

This example shows what my current configuration on my laptop would look like in the new format

config.yml
```yaml
output_dir: ~/esmvaltool_output
auxiliary_data_dir: ~/auxiliary_data
max_parallel_tasks: 1
```

data-esgf.yml
```yaml
esgf:
  download_dir: ~/climate_data
  search_esgf: always
  search_connection:
    expire_after: 864000 # 10 days
    urls:
      - 'https://esg-dn1.nsc.liu.se/esg-search'
      - 'https://esgf-data.dkrz.de/esg-search'

projects:
  CMIP6:
    data:
      CMIP6-ESGF:
        path: ~/climate_data
        dirname: '{project}/{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}'
        filename: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}_{grid}*.nc'
      CMIP5-ESGF:
        path: ~/climate_data
        dirname: '{project.lower}/{product}/{institute}/{dataset}/{exp}/{frequency}/{modeling_realm}/{mip}/{ensemble}/{version}'
        filename: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}*.nc'
      obs4MIPs-ESGF:
        path: ~/climate_data
        dirname: '{project}/{dataset}/{version}'
        filename: '{short_name}_*.nc'
```

data-obs.yml
```yaml
projects:
  native6:
    data:
      native6-local:
        path: ~/climate_data
        dirname: 'Tier{tier}/{dataset}/{version}/{frequency}/{short_name}'
        filename: '*.nc'
```

dask.yml
```yaml
dask:
  run: local
  clusters:
    local:
      type: distributed.LocalCluster
      n_workers: 2
      threads_per_worker: 2
      memory_limit: 4GiB
    basic:
      type: default
      scheduler: threaded
      num_workers: 2
    debug:
      type: default
      scheduler: single-threaded
```

## Compute cluster example

On a compute cluster, e.g. Levante, the simple example above would be extended with an extra data sources file:

data-levante.yml
```yaml

projects:
  CMIP6:
    data:
      CMIP6-levante:
        path: /work/bd0854/DATA/ESMValTool2/CMIP6_DKRZ
        dirname: '{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}'
        filename: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}_{grid}*.nc'
  CMIP5:
    data:
      CMIP5-levante:
        path: /work/bd0854/DATA/ESMValTool2/CMIP5_DKRZ
        dirname: '{institute}/{dataset}/{exp}/{frequency}/{modeling_realm}/{mip}/{ensemble}/{version}/{short_name}'
        filename: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}*.nc'
  native6:
    data:
      native6-levante:
        path: /work/bd0854/DATA/ESMValTool2/RAWOBS
        dirname: 'Tier{tier}/{dataset}/{version}/{frequency}/{short_name}'
        filename: '*.nc'
```

## About replacing the `rootpath`, `drs`, `input_dir`, and `input_filename` settings

I realize that the way to specify rootpath/dirname/filename looks more complicated than what we currently have in the above examples. What I like about it is that it is explicit and simple: there is no longer a need to find out about the 'hidden' config-developer.yml file to understand what this is actually doing, and there is no longer the complicating factor that there is a lot of magic going on (is this settings a string or a list, what does `default` mean?) and I think that will benefit new users. See also https://github.com/ESMValGroup/ESMValCore/pull/1894#issuecomment-1428667217 for previous discussions on the topic.

## Timeline for implementation

To set the expectations: this design is intended as a long-term strategy that can give guidance when making smaller improvements to the tool, not something that can immediately be implemented. Currently, no member of the @ESMValGroup/technical-lead-development-team has a funded proposal in which a large task like this could be taken on.

## Ideas welcome

@ESMValGroup/esmvaltool-developmentteam If you have ideas how to make this better, please share them in a comment below or at one of the community meetings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New configuration file plan #2371

Extensive example

Simple example

Compute cluster example

About replacing the `rootpath`, `drs`, `input_dir`, and `input_filename` settings

Timeline for implementation

Ideas welcome

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

New configuration file plan #2371

Description

Extensive example

Simple example

Compute cluster example

About replacing the rootpath, drs, input_dir, and input_filename settings

Timeline for implementation

Ideas welcome

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

About replacing the `rootpath`, `drs`, `input_dir`, and `input_filename` settings