From c35f3a04721bf0b9b474bd4df5aa8e8da863ad32 Mon Sep 17 00:00:00 2001 From: juacrumar Date: Thu, 30 Mar 2023 17:40:58 +0200 Subject: [PATCH 1/4] squash first draft of commondata documentation as starting point first draft of the documented new commondata format add definition of the version key add explanation for the variants and the theory Apply suggestions from code review Co-authored-by: Felix Hekhorn Update new-commondata.rst update docs Update doc/sphinx/source/data/new-commondata.rst Update doc/sphinx/source/data/new-commondata.rst Update doc/sphinx/source/data/new-commondata.rst Update new-commondata.rst update docs with the definition of the old:new mapping Update doc/sphinx/source/data/new-commondata.rst --- ...ntion.md => dataset-naming-convention.rst} | 55 ++-- doc/sphinx/source/data/new-commondata.rst | 234 ++++++++++++++++++ 2 files changed, 261 insertions(+), 28 deletions(-) rename doc/sphinx/source/data/{dataset-naming-convention.md => dataset-naming-convention.rst} (55%) create mode 100644 doc/sphinx/source/data/new-commondata.rst diff --git a/doc/sphinx/source/data/dataset-naming-convention.md b/doc/sphinx/source/data/dataset-naming-convention.rst similarity index 55% rename from doc/sphinx/source/data/dataset-naming-convention.md rename to doc/sphinx/source/data/dataset-naming-convention.rst index 38cd86f5e3..daed57803d 100644 --- a/doc/sphinx/source/data/dataset-naming-convention.md +++ b/doc/sphinx/source/data/dataset-naming-convention.rst @@ -3,58 +3,57 @@ NNPDF's dataset naming convention ================================= Each dataset implemented in NNPDF must have a unique name, which is a string -constructed following this [Backus–Naur form]: +constructed following this [Backus–Naur form]:: -``` - ::= "_" - | "_" "_" - | "_" "_" - | "_" "_" "_" + ::= "_" + | "_" "_" + | "_" "_" + | "_" "_" "_" - ::= "ATLAS" | "BCDMS" | "CHORUS" | "CMS" | "E605" | "E866" - | "E906" | "EMC" | "HERA" | "LHCB" | "NMC" | "NNPDF" | "NUTEV" + ::= "ATLAS" | "BCDMS" | "CHORUS" | "CMS" | "E605" | "E866" + | "E906" | "EMC" | "HERA" | "LHCB" | "NMC" | "NNPDF" | "NUTEV" - ::= "1JET" | "2JET" | "CC" | "DY" | "H" | "HVBF" | "INTEG" | "NC" - | "POS" | "TTB" | "WM" | "WMWP" | "WP" | "WPZ" | "ZPT" + ::= "1JET" | "2JET" | "CC" | "DY" | "H" | "HVBF" | "INTEG" | "NC" + | "POS" | "TTB" | "WM" | "WMWP" | "WP" | "WPZ" | "ZPT" - ::= TODO + ::= TODO - ::= TODO + ::= TODO - ::= "GEV" | "TEV" + ::= "GEV" | "TEV" - ::= - | "_" - | "_" "_" - | "_" "_" "_" + ::= + | "_" + | "_" "_" + | "_" "_" "_" -``` Experiments =========== -- [`ATLAS`](https://home.cern/science/experiments/atlas): A Large Toroidal +- `ATLAS `_: A Large Toroidal Aparatus - BCDMS: TODO - CHORUS: TODO -- [`CMS`](https://home.cern/science/experiments/cms): Compact Muon Solenoid +- `CMS `_: Compact Muon Solenoid - E605: TODO - E866: TODO - E906: TODO - EMC: TODO -- [`HERA`](https://dphep.web.cern.ch/accelerators/hera): Hadron Elektron Ring +- `HERA `: Hadron Elektron Ring Anlage. While technically speaking this is an accelerator, this string is used for the combined analyses of H1 and ZEUS. -- [`LHCB`](https://home.cern/science/experiments/lhcb): +- `LHCB `_: - NMC: TODO -- [`NNPDF`](https://nnpdf.mi.infn.it/): This experiment name is used for two +- `NNPDF `_: This experiment name is used for two purposes: - 1. for auxiliary datasets needed in the PDF fit, for instance `INTEG` and - `POS` - 2. for predictions used in NNPDF papers to study the impact of PDFs in - processes not included in its PDF fit + + 1. for auxiliary datasets needed in the PDF fit, for instance `INTEG` and `POS` + 2. for predictions used in NNPDF papers to study the impact of PDFs in processes not included in its PDF fit - NUTEV: TODO + + Processes ========= @@ -81,4 +80,4 @@ Processes - `ZPT`: production of two same-flavor opposite-sign leptons with non-zero total transverse momentum (Z-boson pt spectrum) -[Backus–Naur form](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form) +`Backus–Naur form `_ diff --git a/doc/sphinx/source/data/new-commondata.rst b/doc/sphinx/source/data/new-commondata.rst new file mode 100644 index 0000000000..ef549a0c9d --- /dev/null +++ b/doc/sphinx/source/data/new-commondata.rst @@ -0,0 +1,234 @@ +Naming convention and organization of the datasets +-------------------------------------------------- + +All datasets in the new data format follow the exact same naming convention:: + + __{_}_ + +The data is contained in folders, each folder containing one single hepdata publication. +In all cases one can reconstruct the name of the folder by separating the observable name on the last ``_``, i.e., the folder will always be named:: + + __{_} + +Where all observables contained in one hepdata entry are separated by their observable name. + +Each folder will contain one single metadata file named ``metadata.yaml`` which defines all observables implemented for a given dataset. + +In order to keep backward compatibility and ease the comparison between new and old commondata, the ``buildmaster/dataset_names.yml`` file keeps a mapping of the datasets implemented in both formats. +When a ``legacy`` variant is available, the usage of the old name automatically enables such variants. The format of this mapping is as follow (which enables using variants): + +.. code-block:: yaml + + old_name_1: new_name_1 + old_name_2: + dataset: new_name_2 + variant: this_particular_variant + + +Metadata Format +--------------- + +This ``metadata.yaml`` file contains a first portion of general information which might be shared by several sets and a list of ``implemented_observables`` which define the separate observables. + + +.. code-block:: yaml + + setname: "EXPERIMENT_PROCESS_ENERGY{_EXTRA}" + + version: 1 + version_comment: "Initial implementation" + + # References + arXiv: + url: "" + iNSPIRE: + url: "https://inspirehep.net/literature/302822" + hepdata: + url: "https://www.hepdata.net/record/ins302822" + version: 1 + + nnpdf_metadata: + nnpdf31_process: "PROCESS" + experiment: "EXPERIMENT_NAME" + + implemented_observables: + - observable_name: "OBS" + observable: + description: "Description of the observable" + label: "Latex label for the observable" + units: "[u]" + ndata: n_of_datapoints + tables: [n, j, k] # (optional) corresponding tables in the hepdata entry + npoints: [n, j, k] # (optional) number of points per table + process_type: INC # for instance, INC, JET, DIJET, etc + + # Plotting information (for instance, the kinematics variable could be pt, mt, q2) + plotting: + dataset_label: "Label to be used in reports" + kinematics_override: identity + x_scale: log + plot_x: var_1 + figure_by: + - var_2 + + kinematic_coverage: [var_1, var_2, var_3] + + kinematics: + variables: + var_1: {description: "Description of var", label: "latex", units: "u"} + var_2: {description: "Description of var", label: "latex", units: "u"} + var_3: {description: "Description of var", label: "latex", units: "u"} + file: kinematics.yaml + + data_central: data.yaml + data_uncertainties: + - uncertainties.yaml + - uncertainties_2.yaml + + # Having variants is optional + # variants can overwrite the data_uncertainties + variants: + different_errors: + data_uncertainties: + - uncertainties.yaml + - uncertainties_3.yaml + + # The theory field is always optional + theory: + FK_tables: + - - DYE605 + operation: 'null' + + + + +Versioning +~~~~~~~~~~ + +The initial version of a dataset should be set to ``version: 1``. +Any change on a dataset should be *always* accompanied of a version bump and a ``version_comment`` explaining the update. +This will allow to keep an exact tracking of all changes to every dataset even if they change over time. + +Variants +~~~~~~~~ + +In some occasions we might want to maintain two variations of the same observable. +For instance, we might have two incompatible sources of uncertainties. In such case a variant can be added. +The syntax of the ``variants`` is. + +Theory +~~~~~~ + +The theory field defines how predictions for the dataset are to be computed. +It includes two entries: + +- ``FK_tables``: this is a list of lists which defines the FK Tables to be loaded. The outermost list are the operands (in case an operation is needed to recover the observable, more on that below). The innermost list are the grids that are to be concatenated in order to form the operands. +- ``operaton``: operation to be applied in order to compute the observable + +Example: + +.. code-block:: yaml + theory: + FK_tables: + - - Z_contribution + - Wp_contribution + - Wm_total + - - total_xs + operation: 'ratio' + +In this case the ``fktables`` for the Z, W+ and W- contributions will be concatenated (the dataset might include predictions for all three contributions). +After that, the final observable will be computed by taking the ratio of the concatenation of all those observables and the total cross section (``total_xs``). + + +.. code-block:: yaml + + data_uncertainties: + - uncertainties.yaml + + variants: + name_of_the_variant: + data_uncertainties: + - uncertainties.yaml + - extra_uncertainties.yaml + another_variant: + data_uncertainties: + - different_uncertainties.yaml + + +When loading this dataset with no variant only the ``uncertainties.yaml`` file will be read. +Instead, when choosing ``variant: name_of_the_variant``, both ``uncertainties.yaml`` and ``extra_uncertainties.yaml`` will be loaded. +Note that if we want to substitute the default set of uncertainties we just need to not include it in the variant (as done in ``another_variant``). + + +Data +---- + +The format of the data is a ``yaml`` file with an entry ```data_central``` which is a list for all values for all bins. + +.. code-block:: yaml + + data_central: + - val1 + - val2 + - val3 + +Uncertainties +------------- + +The uncertainties are (also) ``.yaml`` files. +Note that in the ``metadata.yaml`` the ``data_uncertainties`` entry is given as a list. +When using more than one uncertainty file they will be concatenated. +This allows the user the flexibility of creating variants where only a subset of the uncertainties are modified. + +The format of the uncertainty files is of two fields, a ``definitions`` field that contains metadata about all the uncertainties (their name, their treatment (``ADD`` or ``MULT``) and their type) and a second field ``bins`` which is a list of mappings with as many entries as the `data_central` with the named uncertainties. + +Note that, regardless of their treatment type, the uncertainties should always be written as absolute values and not relative to the data values. + +.. code-block:: yaml + + definitions: + stat: + description: + treatment: + type: + error_name: + description: + treatment: + type: + error_name_2: + description: + treatment: + type: + bins: + - stat: + error_name: + error_name_2: + +Kinematics: +----------- +The kinematics file follow a convention very similar to the uncertainties file, where the ``definitions`` field is skipped since that information is already contained in the parent ``metadata.yaml`` file. + +Therefore, we have a list of ``bins`` (of the same size as the list for `data_central`) and for each entry we have the information of all the variables. + +.. code-block:: yaml + + bins: + - var_1: + min: 0 + max: 1 + mid: 0.5 + var_2: + min: 0 + max: 1 + mid: 0.5 + +Plotting +~~~~~~~~ + +The ``plotting`` section defines the plotting style inside ``validphys``. +In previous implementations there were per-process options that defined plotting options for family of processes. +In the commondata format defined in this page every plotting option must be defined in the ``plotting`` section of each observable. + +Internally within ``validphys`` only 3 kinematic variables are taken into account. The 3 selected variables (and their order) is defined by ``plotting::kinematic_coverage``. + +The name of the variables (which in this example are `var_1`, `var_2`, `var_3`) need to be the same in the plotting and kinematics. From af1be35341d3d93c7102edc06cd7dedd54c94584 Mon Sep 17 00:00:00 2001 From: juacrumar Date: Sun, 3 Mar 2024 12:51:13 +0100 Subject: [PATCH 2/4] add documentation for the new commondata format ; remove documentation for the old format --- conda-recipe/meta.yaml | 2 +- doc/sphinx/source/conf.py | 2 +- doc/sphinx/source/data/commondata.rst | 355 ++++++++++++++++++ doc/sphinx/source/data/data-config.rst | 32 +- .../source/data/dataset-naming-convention.rst | 35 +- .../source/data/example-fk-preamble.rst | 212 ----------- doc/sphinx/source/data/exp-data-files.rst | 245 ------------ .../source/data/fk-config-variables.rst | 18 - doc/sphinx/source/data/index.rst | 7 +- doc/sphinx/source/data/intro.rst | 44 +-- doc/sphinx/source/data/new-commondata.rst | 234 ------------ doc/sphinx/source/data/plotting-format.rst | 268 +++++++++++++ doc/sphinx/source/data/plotting_format.md | 303 --------------- doc/sphinx/source/external-code/apfelcomb.md | 59 --- doc/sphinx/source/external-code/index.rst | 1 - doc/sphinx/source/tutorials/apfelcomb.md | 300 --------------- doc/sphinx/source/tutorials/index.rst | 2 - pyproject.toml | 2 +- 18 files changed, 668 insertions(+), 1453 deletions(-) create mode 100644 doc/sphinx/source/data/commondata.rst delete mode 100644 doc/sphinx/source/data/example-fk-preamble.rst delete mode 100644 doc/sphinx/source/data/exp-data-files.rst delete mode 100644 doc/sphinx/source/data/fk-config-variables.rst delete mode 100644 doc/sphinx/source/data/new-commondata.rst create mode 100644 doc/sphinx/source/data/plotting-format.rst delete mode 100644 doc/sphinx/source/data/plotting_format.md delete mode 100644 doc/sphinx/source/external-code/apfelcomb.md delete mode 100644 doc/sphinx/source/tutorials/apfelcomb.md diff --git a/conda-recipe/meta.yaml b/conda-recipe/meta.yaml index 163a88fa78..3e4a615a98 100644 --- a/conda-recipe/meta.yaml +++ b/conda-recipe/meta.yaml @@ -50,7 +50,7 @@ requirements: - requests - prompt_toolkit - validobj - - sphinx >=4.0.2 # documentation. Needs pinning becasue https://github.com/sphinx-doc/sphinx/issues/9216 + - sphinx >=5.0.2,<6 # documentation. Needs pinning temporarily due to markdown - recommonmark - sphinx_rtd_theme >0.5 - sphinxcontrib-bibtex diff --git a/doc/sphinx/source/conf.py b/doc/sphinx/source/conf.py index a155d04c18..f6caf9cac3 100644 --- a/doc/sphinx/source/conf.py +++ b/doc/sphinx/source/conf.py @@ -85,7 +85,7 @@ # # This is also used if you do content translation via gettext catalogs. # Usually you set "language" from the command line for these cases. -language = None +language = "en" # List of patterns, relative to source directory, that match files and # directories to ignore when looking for source files. diff --git a/doc/sphinx/source/data/commondata.rst b/doc/sphinx/source/data/commondata.rst new file mode 100644 index 0000000000..0ea37f666d --- /dev/null +++ b/doc/sphinx/source/data/commondata.rst @@ -0,0 +1,355 @@ +.. _commondata: + +======================= +Experimental data files +======================= + +Data made available by experimental collaborations comes in a variety of +formats. For use in a fitting code, this data must be converted into a common +format that contains all the required information for use in PDF fitting. +Existing formats commonly used by the community, such as in `HepData `_, +are generally unsuitable. Principally as they often do not fully describe the +breakdown of systematic uncertainties. Therefore over several years an NNPDF +standard data format has been iteratively developed, now denoted ``CommonData``. + +This documentation describes the ``CommonData`` format +used in NNPDF starting from code version 4.0.10 and compatible with releases beyond 4.0. + + +Naming convention and organization of the datasets +-------------------------------------------------- + +All datasets in the new data format follow the exact same naming convention:: + + _ + +where the setname is defined by:: + + __{_} + +The naming convention for the set names is defined in the :ref:`naming convention documentation`. + +Each ```` defines a folder in which the data is contained. +While the separation of data in different folders can be arbitrary, +a folder cannot contain more than one hepdata entry +or datasets that mix different processes, energies or experiment. +Due to historical reasons and for backwards compatibility the special energy ``NOTFIXED`` is used +for datasets where more than one center of mass energy is used. +When in doubt, it is preferable to utilize two different folders. +The ```` string is free and can be used to disambiguate. + +The data downloaded or parsed from hepdata or other sources is kept in the +``/`` folder and it is not installed with the rest of the code. +Each folder must contain a ``/metadata.yaml`` file which will define +all datasets implemented within the folder and that will be described below. +Only ``.yaml`` file are allowed to be installed together with the ``nnpdf`` code. + +In order to keep backward compatibility and allow the reproducibility of the 4.0 family of fits +a ``dataset_names.yml`` file keeps a mapping of the datasets that were used in 4.0. +When using the old names in a runcard, ``validphys`` will automatically translate +them using this file. +The format of this mapping is as follow: + +.. code-block:: yaml + + old_name_1: + dataset: new_name_1 + variant: legacy + + +CommonData Metadata specification +--------------------------------- + +The ``metadata.yaml`` file defines unequivocally the datasets implemented within a folder. +The general structure is a first portion of general information (references, name of the set) +and a list of ``implemented_observables`` which define separate datasets. + + +Shared information +================== + + +.. code-block:: yaml + + setname: "EXPERIMENT_PROCESS_ENERGY{_EXTRA}" + + version: 1 + version_comment: "A comment about this version" + + # References + arXiv: + url: "https://arxiv.org/abs/XYZ.ABC" + iNSPIRE: + url: "https://inspirehep.net/literature/XYZ" + hepdata: + url: "https://www.hepdata.net/record/insXYZ" + version: 1 + + nnpdf_metadata: + nnpdf31_process: "PROCESS" + experiment: "EXPERIMENT_NAME" + + implemented_observables: + - observable_metadata_1 + - observable_metadata_2 + + +The header of the ``metadata.yaml`` file contains information shared among different datasets. + +Setname +~~~~~~~ + +Correspond to the name of the set and must be equal to the folder. It acts a s a sanity check. + +Versioning +~~~~~~~~~~ + +The initial version of a dataset should be set to ``version: 1``. +Any change on a dataset should be *always* accompanied of a version bump and a ``version_comment`` explaining the update. +This will allow to keep an exact tracking of all changes to every dataset even if they change over time due to bugs, updates in hepdata, etc. + +References +~~~~~~~~~~ + +References to the original source of the data. +This can be ``arXiv``, ``iNSPIRE`` or ``hepdata``. +All information must be provided unless it is explicitly missing. + +nnpdf_metadata +~~~~~~~~~~~~~~ + +Grouping information used internally by ``validphys`` up to NNPDF4.0. +It accepts the keys ``experiment``, which should in general coincide +with the ``EXPERIMENT`` key in the ```` and the key ``nnpdf31_process`` +which is the process grouping information used in the 3.1 and 4.0 MHOU papers. + +Observable specific information +=============================== + +Within a ``metadata.yaml`` we can find one or more implemented datasets. +These correspond to different observables of a single measurement. +For instance, the LHCB publication of Z rapidity measurements at 13 TeV +(``setname: LHCB_Z0_13TEV``) contains two observables: Z decay into two electrons +and Z decay into 2 muons. +This setname contain two datasets: ``LHCB_Z0_13TEV_DIELECTRON-Y`` and ``LHCB_Z0_13TEV_DIMUON-Y``. + +In the following we describe the metadata corresponding to the observable within the ``metadata.yaml`` file. + + +.. code-block:: yaml + + implemented_observables: + - observable_name: "DIMUON-Y" + process_type: "EWK_RAP" + tables: [5] + ndata: 18 + observable: + description: "Differential cross-section of Z-->µµ as a function of Z-rapidity" + label: r"$d\sigma / d|y|$" + units: "[fb]" + kinematics: + file: kinematics_dimuon.yaml + variables: + y: {description: "Z boson rapidity", label: "$y$", units: ""} + M2: {description: "Z boson Mass", label: "$M^2$", units: "$GeV^2$"} + sqrts: {description: "Center of Mass Energy", label: '$\sqrt{s}$', units: "$GeV$"} + kinematic_coverage: [y, M2, sqrts] + data_central: data_dimuon.yaml + data_uncertainties: + - uncertainties_dimuon.yaml + variants: + - example_variant: + data_uncertainties: + - uncertainties_different_treatment.yaml + theory: + FK_tables: + - - LHCB_DY_13TEV_DIMUON + operation: 'null' + conversion_factor: 1000.0 + # Plotting information + plotting: + dataset_label: "LHCb $Z\\to µµ$" + plot_x: y + y_label: '$d\sigma_{Z}/dy$ (fb)' + +``observable_name`` +~~~~~~~~~~~~~~~~~~~ +The observable name is used to construct the full name of the dataset ``_``. +It must be unique within a set and contain no ``_`` (as it could lead to confusion). + +``process_name`` +~~~~~~~~~~~~~~~~ +One of the processes defined in the ``process_options`` module at +``validphys/src/validphys2/process_options.py``. +This is used internally by validphys to describe the combination of observable +and process in various plots, to check that the kinematic variables utilized by the +dataset are sensible and to generate derived plots such as the ``x-q2`` kinematic coverage plots. + +``tables`` +~~~~~~~~~~ +Tables from the hepdata entries that have been used to construct the dataset + +``ndata`` +~~~~~~~~~ +Number of datapoints in the dataset. +While this quantity could be derived from the data itself, +many other pieces (crucially backwards compatibility with cuts and theories) requires +the number of datapoints to be set in stone. +If an update requires to change the number of datapoint, +it should be added as a separate observable. + +``observable`` +~~~~~~~~~~~~~~ +This is a dictionary with the entries ``description``, ``label`` and ``units``. +All entries must be latex-compilable as they are used by various plotting routines in ``validphys``. + +``kinematics::file`` +~~~~~~~~~~~~~~~~~~~~ +A reference to a ``.yaml`` file containing all kinematic information. +The file contain a list of ``ndata`` ``bins`` for which information about all variables +is included for all bins. +When ``mid`` is not given, it will be automatically filled with the midpoint between min and max. +Only ``mid`` is used for cuts, while ``min`` and ``max`` may be used for plotting routines. + +.. code-block:: yaml + + bins: + - var_1: + min: 0 + max: 1 + mid: 0.5 + var_2: + min: 0 + max: 1 + mid: 0.5 + +``kinematics::variables`` +~~~~~~~~~~~~~~~~~~~~~~~~~ +Metadata for each of the variables contained in the ``kinematics::file`` +and which can be ``description``, ``label`` and ``units``. +Latex syntax is accepted and encouraged since they will be used by plotting routines. + +.. code-block:: yaml + + variables: + var_1: {description: "my var 1", label: "$m$", "units: "GeV"} + + +``kinematic_coverage`` +~~~~~~~~~~~~~~~~~~~~~~ +A list of the variables within the kinematic files + + +``data_central`` +~~~~~~~~~~~~~~~~ +A reference to a ``yaml`` file containing the measurement central data. +The format of the data is a ``yaml`` file with an entry ``data_central`` which +list for all values for all bins. + +.. code-block:: yaml + + data_central: + - val1 + - val2 + - val3 + +``data_uncertainties`` +~~~~~~~~~~~~~~~~~~~~~~ +A list of ``.yaml`` file containing the uncertainty information for the measurement. +When using more than one uncertainty file they will be concatenated. +This allows the user the flexibility of creating variants +where only a subset of the uncertainties are modified. + +The format of the uncertainty files is of two fields, a ``definitions`` field that contains +metadata about all the uncertainties: name, treatment (``ADD`` or ``MULT``) and type +and a second field ``bins`` which is a list of mappings with ``ndata`` entries +with the named uncertainties. + +Note that, regardless of their treatment, uncertainties should always be written as absolute values +and not relative to the data values. If the data should be updated, the uncertainties should be too. + +.. code-block:: yaml + + definitions: + stat: + description: + treatment: + type: + error_name: + description: + treatment: + type: + error_name_2: + description: + treatment: + type: + bins: + - stat: + error_name: + error_name_2: + + + + +``variants`` +~~~~~~~~~~~~ + +In some occasions we might want to maintain two variations of the same observable. +For instance, we might have two incompatible sources of uncertainties. In such case a variant can be added. +These variants can overwrite certain keys if necessary. +When a variant is used, the key under the variant will be used instead of the key defined in the observable. + +A ``variant`` can only overwrite the entries ``data_central``, ``theory`` and ``data_uncertainties``. +Example: + +.. code-block:: yaml + + data_uncertainties: + - uncertainties.yaml + + variants: + name_of_the_variant: + data_uncertainties: + - uncertainties.yaml + - extra_uncertainties.yaml + another_variant: + data_central: different_data.yaml + data_uncertainties: + - different_uncertainties.yaml + +When loading this dataset with no variant only the ``uncertainties.yaml`` file will be read. +Instead, when choosing ``variant: name_of_the_variant``, both ``uncertainties.yaml`` and ``extra_uncertainties.yaml`` will be loaded. +If we select ``variant: another_variant`` both the ``data_uncertainties`` and the ``data_central`` keys will be substituted. +Note that if we want to substitute the default set of uncertainties we just need to not include it in the variant (as done in ``another_variant``). + +``theory`` +~~~~~~~~~~ + +The theory field defines how predictions for the dataset are to be computed. +It includes two entries: + +- ``FK_tables``: this is a list of lists which defines the FK Tables to be loaded. The outermost list are the operands (in case an operation is needed to recover the observable, more on that below). The innermost list are the grids that are to be concatenated in order to form the operands. +- ``operaton``: operation to be applied in order to compute the observable + +Example: + +.. code-block:: yaml + + theory: + FK_tables: + - - Z_contribution + - Wp_contribution + - Wm_total + - - total_xs + operation: 'ratio' + +In this case the ``fktables`` for the Z, W+ and W- contributions will be concatenated (the dataset might include predictions for all three contributions). +After that, the final observable will be computed by taking the ratio of the concatenation of all those observables and the total cross section (``total_xs``). + +``plotting`` +~~~~~~~~~~~~ + +The ``plotting`` section defines the plotting style inside ``validphys`` +and is described in detail in :ref:`plotting-format`. + +Note that name of the variables need to be the same in the plotting and kinematics. diff --git a/doc/sphinx/source/data/data-config.rst b/doc/sphinx/source/data/data-config.rst index 86c6d65539..9e8c909fd2 100644 --- a/doc/sphinx/source/data/data-config.rst +++ b/doc/sphinx/source/data/data-config.rst @@ -22,25 +22,10 @@ located in the ``nnpdf`` git repository at ``validphys/src/validphys2/datafiles/commondata`` where a separate ``CommonData`` file is stored for each *Dataset* with the -filename format - - ``DATA_.dat`` - -Information on the treatment of systematic uncertainties, provided in -``SYSTYPE`` files, is located in the subdirectory - - ``commondata/systypes`` +filename format described in :ref:`dataset-naming-convention`. +The data is installed as part of the python package of ``nnpdf``, +all data files to be installed must have a ``.yaml`` extension. -Here several ``SYSTYPE`` files may be supplied for each *Dataset*. The -various options are enumerated by suffix to the filename. The filename format -for ``SYSTYPE`` files is therefore - - ``SYSTYPE__.dat`` - -Where the default systematic ID is **DEFAULT**. As an example, consider -the first ``SYSTYPE`` file for the D0ZRAP *Dataset*: - - ``SYSTYPE_D0ZRAP_DEFAULT.dat`` Theory lookup table =================== @@ -78,25 +63,20 @@ contains the following directory structure | ``theory_X/`` | ``-cfactor/`` - | ``-compound/`` | ``-fastkernel/`` Inside the directory ``theory_X/cfactor/`` are stored ``CFACTOR`` files with the filename format - ``CF__.dat`` + ``CF__.dat`` where ```` is a three-letter designation for the source of the C-factor -(e.g. EWK or QCD) and ```` is the typical *Dataset* designation. -The directory ``theory_X/compound/`` contains the ``COMPOUND`` files -described earlier, this time with the filename format - - ``FK_-COMPOUND.dat`` +(e.g. EWK or QCD) and ```` is the FK-Table to which it should be applied. Finally the ``FK`` tables themselves are stored in ``theory_X/fastkernel/`` with the filename format - ``FK_.dat`` + ``.pineappl.lz4`` Naturally, all of the FastKernel and C-factor files within the directory ``theory_X/`` have been determined with the theoretical parameters specified in diff --git a/doc/sphinx/source/data/dataset-naming-convention.rst b/doc/sphinx/source/data/dataset-naming-convention.rst index daed57803d..87c43c3278 100644 --- a/doc/sphinx/source/data/dataset-naming-convention.rst +++ b/doc/sphinx/source/data/dataset-naming-convention.rst @@ -1,3 +1,6 @@ +.. _dataset-naming-convention: + + ================================= NNPDF's dataset naming convention ================================= @@ -5,27 +8,21 @@ NNPDF's dataset naming convention Each dataset implemented in NNPDF must have a unique name, which is a string constructed following this [Backus–Naur form]:: - ::= "_" - | "_" "_" - | "_" "_" - | "_" "_" "_" - - ::= "ATLAS" | "BCDMS" | "CHORUS" | "CMS" | "E605" | "E866" - | "E906" | "EMC" | "HERA" | "LHCB" | "NMC" | "NNPDF" | "NUTEV" + ::= "_" "_" + | "_" "_" "_" - ::= "1JET" | "2JET" | "CC" | "DY" | "H" | "HVBF" | "INTEG" | "NC" - | "POS" | "TTB" | "WM" | "WMWP" | "WP" | "WPZ" | "ZPT" + ::= "_" - ::= TODO + ::= "ATLAS" | "BCDMS" | "CDF" | "CHORUS" | "CMS" | "D0" | "DYE605" | "DYE866" + | "DYE906" | "EMC" | "H1" | "HERA" | "LHCB" | "NMC" | "NNPDF" | "NUTEV" | "SLAC" + | "ZEUS" - ::= TODO + ::= "1JET" | "2JET" | "CC" | "DY" | "INTEG" | "NC" | "PH" | "POS" | "SINGLETOP" + | "TTBAR" | "WCHARM" | "WJ" | "WMWP" | "WP" | "WPWM" | "Z0" | "Z0J" - ::= "GEV" | "TEV" + ::= | "P" | "NOTFIXED" - ::= - | "_" - | "_" "_" - | "_" "_" "_" + ::= Experiments @@ -60,7 +57,7 @@ Processes - `1JET`: single-jet inclusive production - `2JET`: dijet production - `CC`: DIS charged-current -- `DY`: lepton-pair production (neutral current off-shell Drell–Yan) +- `Z0`: lepton-pair production (neutral current off-shell Drell–Yan) - `H`: on-shell Higgs-boson production - `HVBF`: production of an on-shell Higgs-boson with two jets (vector-boson fusion) @@ -69,7 +66,7 @@ Processes - `NC`: DIS neutral-current - `POS`: auxiliary dataset for positivity constraints; only valid for `NNPDF` experiment -- `TTB`: top–anti-top production +- `TTBAR`: top–anti-top production - `WM`: production of a single negatively-charged lepton (charged current off-shell Drell–Yan) - `WMWP`: production of two opposite-sign different flavor leptons (W-diboson @@ -77,7 +74,7 @@ Processes - `WP`: production of a single positively-charged lepton (charged current off-shell Drell–Yan) - `WPZ`: production of three leptons (WZ-diboson production) -- `ZPT`: production of two same-flavor opposite-sign leptons with non-zero +- `Z0PT`: production of two same-flavor opposite-sign leptons with non-zero total transverse momentum (Z-boson pt spectrum) `Backus–Naur form `_ diff --git a/doc/sphinx/source/data/example-fk-preamble.rst b/doc/sphinx/source/data/example-fk-preamble.rst deleted file mode 100644 index a5736644c3..0000000000 --- a/doc/sphinx/source/data/example-fk-preamble.rst +++ /dev/null @@ -1,212 +0,0 @@ -.. _example_fk_preamble: - -======================== -Example: ``FK`` preamble -======================== - -DIS preamble - BCDMSD -===================== - - | {GridDesc___________________________________________________ - | ------------------------------- - | FK_BCDMSD.dat - | ------------------------------- - | _VersionInfo________________________________________________ - | *APFEL: 2.6.1 - | *libnnpdf: 1.1.0b - | _GridInfo___________________________________________________ - | *HADRONIC: 0 - | *NDATA: 254 - | *NX: 50 - | *SETNAME: BCDMSD - | {FlavourMap_________________________________________________ - | 0 1 1 0 0 0 0 0 0 0 1 1 0 0 - | _TheoryInfo_________________________________________________ - | *DAMP: 1 - | *FNS: FONLL-C - | *GF: 1.16638e-05 - | *HQ: MSBAR - | *IC: 0 - | *MP: 0.938 - | *MW: 80.398 - | *MZ: 91.1876 - | *MaxNfAs: 5 - | *MaxNfPdf: 5 - | *ModEv: TRN - | *NfFF: 5 - | *PTO: 2 - | *Q0: 1 - | *QED: 0 - | *Qedref: 1.777 - | *Qmb: 4.18 - | *Qmc: 3 - | *Qmt: 162.7 - | *Qref: 91.2 - | *SIN2TW: 0.23126 - | *SxOrd: LL - | *SxRes: 0 - | *TMC: 1 - | *TheoryID: 7 - | *XIF: 1 - | *XIR: 1 - | *alphaqed: 0.00749625 - | *alphas: 0.118 - | *mb: 4.18 - | *mc: 0.986 - | *mt: 162.7 - | {xGrid______________________________________________________ - | 6.9265888619991195e-02 - | 7.7677574001058236e-02 - | 8.6760599033455912e-02 - | 9.6515727077269992e-02 - | 1.0693847246838524e-01 - | 1.1801962180968653e-01 - | 1.2974586013120251e-01 - | 1.4210045166737728e-01 - | 1.5506393063634324e-01 - | 1.6861476611854062e-01 - | 1.8272997502743873e-01 - | 1.9738566676226815e-01 - | 2.1255751145471796e-01 - | 2.2822113029454361e-01 - | 2.4435241115381084e-01 - | 2.6092775579054239e-01 - | 2.7792426659347097e-01 - | 2.9531988146590743e-01 - | 3.1309346535041777e-01 - | 3.3122486633420206e-01 - | 3.4969494345265562e-01 - | 3.6848557237516494e-01 - | 3.8757963421332448e-01 - | 4.0696099179998674e-01 - | 4.2661445698241623e-01 - | 4.4652575176931059e-01 - | 4.6668146557197077e-01 - | 4.8706901027900074e-01 - | 5.0767657449372061e-01 - | 5.2849307792672917e-01 - | 5.4950812667484750e-01 - | 5.7071196990123374e-01 - | 5.9209545827198862e-01 - | 6.1365000437161166e-01 - | 6.3536754522794392e-01 - | 6.5724050700057512e-01 - | 6.7926177183385794e-01 - | 7.0142464683629069e-01 - | 7.2372283512038826e-01 - | 7.4615040881848282e-01 - | 7.6870178397770295e-01 - | 7.9137169723166323e-01 - | 8.1415518414141896e-01 - | 8.3704755910070550e-01 - | 8.6004439670038091e-01 - | 8.8314151445118372e-01 - | 9.0633495676848319e-01 - | 9.2962098012648797e-01 - | 9.5299603929602150e-01 - | 9.7645677458414570e-01 - | {FastKernel_________________________________________________ - -Hadronic preamble - CDFR2KT -=========================== - - | {GridDesc___________________________________________________ - | ----------------------------------------------------------- - | FK_CDFR2KT.dat - | ----------------------------------------------------------- - | _VersionInfo________________________________________________ - | *APFEL: 2.6.1 - | *libnnpdf: 1.1.0b - | {Readme_____________________________________________________ - | *********************************************************************** - | ExpName: CDFR2KT - | Author: FastNLO authors - | Date: 2010 - | CodesUsed: NLOjet++/FastNLO (scenario fnt2004 from FastNLO webpage) - | AdditionalInfo: incl. jets, kT algo D=0.7 - | *********************************************************************** - | _GridInfo___________________________________________________ - | *HADRONIC: 1 - | *NDATA: 76 - | *NX: 30 - | *SETNAME: CDFR2KT - | {FlavourMap_________________________________________________ - | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 - | 0 1 1 1 0 0 1 0 0 0 1 1 0 0 - | 0 1 1 1 0 0 1 0 0 0 1 1 0 0 - | 0 1 1 1 0 1 1 0 0 0 0 1 0 0 - | 0 0 0 0 1 0 0 0 0 0 0 0 0 0 - | 0 0 0 1 0 1 1 0 0 0 1 0 0 0 - | 0 1 1 1 0 1 1 0 0 0 0 1 0 0 - | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 - | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 - | 0 0 0 0 0 0 0 0 0 1 0 0 0 0 - | 0 1 1 0 0 1 0 0 0 0 1 1 0 0 - | 0 1 1 1 0 0 1 0 0 0 1 1 0 0 - | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 - | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 - | _TheoryInfo_________________________________________________ - | *DAMP: 1 - | *FNS: FONLL-C - | *GF: 1.16638e-05 - | *HQ: MSBAR - | *IC: 0 - | *MP: 0.938 - | *MW: 80.398 - | *MZ: 91.1876 - | *MaxNfAs: 5 - | *MaxNfPdf: 5 - | *ModEv: TRN - | *NfFF: 5 - | *PTO: 2 - | *Q0: 1 - | *QED: 0 - | *Qedref: 1.777 - | *Qmb: 4.18 - | *Qmc: 3 - | *Qmt: 162.7 - | *Qref: 91.2 - | *SIN2TW: 0.23126 - | *SxOrd: LL - | *SxRes: 0 - | *TMC: 1 - | *TheoryID: 7 - | *XIF: 1 - | *XIR: 1 - | *alphaqed: 0.00749625 - | *alphas: 0.118 - | *mb: 4.18 - | *mc: 0.986 - | *mt: 162.7 - | {xGrid______________________________________________________ - | 4.0941945000024672e-03 - | 5.9356426849003037e-03 - | 8.5647477735742213e-03 - | 1.2278230204351056e-02 - | 1.7448602544560710e-02 - | 2.4515641282009264e-02 - | 3.3957625320032526e-02 - | 4.6241012256902900e-02 - | 6.1757804939792604e-02 - | 8.0769759935090835e-02 - | 1.0337895878919207e-01 - | 1.2953267418094364e-01 - | 1.5905525671030885e-01 - | 1.9169158055350388e-01 - | 2.2714813737177489e-01 - | 2.6512436628283159e-01 - | 3.0533281023729242e-01 - | 3.4750997595899380e-01 - | 3.9142071068612511e-01 - | 4.3685860760309952e-01 - | 4.8364426537988547e-01 - | 5.3162257521672562e-01 - | 5.8065972288573631e-01 - | 6.3064027352226959e-01 - | 6.8146451295139832e-01 - | 7.3304610913825119e-01 - | 7.8531009886079706e-01 - | 8.3819117643580765e-01 - | 8.9163224991215573e-01 - | 9.4558322764065939e-01 - | {FastKernel_________________________________________________ diff --git a/doc/sphinx/source/data/exp-data-files.rst b/doc/sphinx/source/data/exp-data-files.rst deleted file mode 100644 index 016523c96a..0000000000 --- a/doc/sphinx/source/data/exp-data-files.rst +++ /dev/null @@ -1,245 +0,0 @@ -.. _exp_data_files: - -======================= -Experimental data files -======================= - -Data made available by experimental collaborations comes in a variety of -formats. For use in a fitting code, this data must be converted into a common -format that contains all the required information for use in PDF fitting. -Existing formats commonly used by the community, such as in `HepData `_, -are generally unsuitable. Principally as they often do not fully describe the -breakdown of systematic uncertainties. Therefore over several years an NNPDF -standard data format has been iteratively developed, now denoted -``CommonData``. In addition to the ``CommonData`` files themselves, in the -``nnpdf++`` project the user has the ability to vary the treatment of individual -systematic errors by use of parameter files denoted ``SYSTYPE`` files. In this -section we shall detail the specifications of these two files. - -In principle, the file specification and classes described in this section are -independent of the ``nnpdf++`` project and may be generated by whatever means -the user sees fit. In practice, the ``CommonData`` and ``SYSTYPE`` files -are generated by the ``buildmaster`` project of ``nnpdf++`` from the raw -experimental data files. - -.. _process_type_label: - -Process types and kinematics -============================ - -Before going into the file formats, we shall summarise the identifying features -used for data in the ``nnpdf++`` code. - -Each data point has an associated *process type* string. This can be -specified by the user, but **must** begin with the appropriate identifying -base process type. Additionally for each data point three kinematic values are -given, the *process type* being primarily to identify the nature of these -values. Typically the first kinematic variable is the principal differential -quantity used in the measurement. The second kinematic variable defines the -scale of the process. The third is generally the centre-of-mass energy of the -process, or inelasticity in the case of DIS. The allowed basic process types, -and their corresponding three kinematic variables are outlined below. - -* **DIS** - Deep inelastic scattering measurements: :math:`(x,Q^2,y)` -* **DYP** - Fixed-target Drell-Yan measurements: :math:`(y,M^2,\sqrt{s})` -* **JET** - Jet production: :math:`(\eta,p_T^2,\sqrt{s})` -* **DIJET** - Dijet production: :math:`(\eta,m_{12},\sqrt{s})` -* **PHT** - Photon production: :math:`(\eta_\gamma,E_{T,\gamma}^2,\sqrt{s})` -* **INC** - A total inclusive cross-section: :math:`(0,\mu^2,\sqrt{s})` -* **EWK\_RAP** - Collider electroweak rapidity distribution: :math:`(\eta/y,M^2,\sqrt{s})` -* **EWK\_PT** - Collider electroweak :math:`p_T` distribution: :math:`(p_T,M^2,\sqrt{s})` -* **EWK\_PTRAP** - Collider electroweak :math:`p_T, y` distribution: :math:`(\eta/y, p_T^2,\sqrt{s})` -* **EWK\_MLL** - Collider electroweak lepton-pair mass distribution: :math:`(M_{ll},M_{ll}^2,\sqrt{s})` -* **EWJ\_(J)RAP** - Collider electroweak + jet boson(jet) rapidity distribution: :math:`(\eta/y,M^2,\sqrt{s})` -* **EWJ\_(J)PT** - Collider electroweak + jet boson(jet) :math:`p_T` distribution: :math:`(p_T,M^2,\sqrt{s})` -* **EWJ\_(J)PTRAP** - Collider electroweak + jet boson(jet) :math:`p_T, y` distribution: :math:`(\eta/y, p_T^2,\sqrt{s})` -* **EWJ\_MLL** - Collider electroweak+jet lepton-pair mass distribution: :math:`(M_{ll},M_{ll}^2,\sqrt{s})` -* **HQP\_YQQ** - Heavy diquark system rapidity :math:`(y^{QQ},\mu^2,\sqrt{s})` -* **HQP\_MQQ** - Heavy diquark system mass :math:`(M^{QQ},\mu^2,\sqrt{s})` -* **HQP\_PTQQ** - Heavy diquark system :math:`p_T` :math:`(p_T^{QQ},\mu^2,\sqrt{s})` -* **HQP\_YQ** - Heavy quark rapidity :math:`(y^Q,\mu^2,\sqrt{s})` -* **HQP\_PTQ** - Heavy quark :math:`p_T` :math:`(p_T^Q,\mu^2,\sqrt{s})` -* **HIG\_RAP** - Higgs boson rapidity distribution :math:`(y,M_H^2,\sqrt{s})` - -As examples of *process type* strings, consider **EWK\_RAP** for a -collider :math:`W` boson asymmetry measurement binned in rapidity, and -**DIS\_F2P** for the :math:`F_2^p` structure function in DIS. The user is free to -choose something identifying for the second segment of the process type, the -important feature being the basic process type. However, users are encouraged to -only use this freedom when absolutely necessary (such as when used in -combination with APFEL). - -One special case is that of :math:`W` boson lepton asymmetry measurements, which being -cross-section asymmetries may occasionally have negative data points. Therefore -asymmetry measurements must have the final tag **ASY** to ensure that -artificial data generation permits negative data values. An example -*process type* string would be **EWK\_RAP\_ASY**. - -Notes for the future --------------------- - -In the future it would be nice to have a more flexible treatment of the -kinematic variables, both in their number and labelling. - -``CommonData`` file format -============================== - -Each experimental *Dataset* has its own ``CommonData`` file. -``CommonData`` files contain the bulk of the experimental information used in the -``nnpdf++`` project, with the only other experimental data files controlling -the treatment and correlation of systematic errors. Each ``CommonData`` file -is a plaintext file whose layout is described in the following. - -The first line begins with the *Dataset* name, the number of systematic -errors, and the number of data points in the set, whitespace separated. For -example, for the ATLAS 2010 jet measurement the first line of the file reads: - - ATLASR04JETS36PB 91 90 - -Which demonstrates that the set *name* is 'ATLASR04JETS36PB', that there -are 91 sources of systematic uncertainty, 90 data points, one associated ``FK`` -table, and that the ``FK`` table corresponds to a proton initial state. As -another example, consider the NMCPD *Dataset*: - - NMCPD 5 211 - -Here there are 5 sources of systematic uncertainty and 211 data points. -Following this, each line specifies the details of a single data point. The first -value being the data point index :math:`1< i_{\text{dat}} \leq N_{\mathrm{dat}}`, -followed by the *process type* string as outlined above, and the three -kinematic variables in order. These are followed by the value of the -experimental data point itself, and the value of the statistical uncertainty -associated with it (absolute value). Finally the systematic uncertainties are -specified. The layout per data point is therefore - - :math:`i_{\mathrm{dat}}` *ProcessType* :math:`\text{kin}_1 \text{kin}_2 \text{kin}_3` data\_value stat\_error :math:`[..` systematics :math:`..]` - -For example, in the case of a DIS data point from the BCDMSD *Dataset*: - - 1 DIS\_F2D 7.0e-02 8.75e+00 5.666e-01 3.6575e-01 6.43e-03 :math:`[..` systematics :math:`..]` - -In these lines the systematic uncertainties are laid out as so. For each -uncertainty, additive and multiplicative versions are given. The additive -uncertainty is given by absolute value, and the multiplicative as a percentage -of the data value (that is, relative error multiplied by 100). The systematics -string is formed by the sequence of :math:`N_{\text{sys}}` pairs of systematic -uncertainties: - - :math:`[..` systematics :math:`..] = \sigma^{\mathrm{add}}_0 \quad \sigma^{\mathrm{mul}}_0\quad \sigma^{\mathrm{add}}_1 \quad \sigma^{\mathrm{mul}}_1 \quad....\quad \sigma^{\mathrm{add}}_n \quad\sigma^{\mathrm{mul}}_n` - -where :math:`\sigma^{\mathrm{add}}_i` and :math:`\sigma^{\mathrm{mul}}_i` are the additive -and multiplicative versions respectively of the systematic uncertainty arising -from the :math:`i\text{th}` source. While it may seem at first that the multiplicative error -is spurious given the presence of the additive error and data central value, -this may not be the case. For example, in a closure test scenario, the data -central values may have been replaced in the ``CommonData`` file by -theoretical predictions. Therefore if you wish to use a covariance matrix -generated with the original multiplicative uncertainties via the :math:`t_0` method, -you must also store the original multiplicative (percentage) error. For -flexibility and ease of I/O this is therefore done in the ``CommonData`` file -itself. - -For a *Dataset* with :math:`N_{\text{dat}}` data points and :math:`N_{\text{sys}}` -sources of systematic uncertainty, the total ``CommonData`` file should -therefore be :math:`N_{\text{dat}}+1` lines long. Its first line contains the set -parameters, and every subsequent line should consist of the description of a -single data point. Each data point line should therefore contain :math:`7 + -2N_{\text{sys}}` columns. - -``SYSTYPE`` file format -======================= - -The explicit presentation of the systematic uncertainties in the -``CommonData`` file allows for a great deal of flexibility in the treatment of -these errors. Specifically, whether they should be treated as additive or -multiplicative uncertainties, and how they are correlated, both within the -*Dataset* and within a larger *Experiment*. A specification for how -the systematic uncertainties should be treated is provided by a ``SYSTYPE`` -file. As there is not always an unambiguous method for the treatment of these -uncertainties, these information is kept outside the (unambiguous) -``CommonData`` file. Several options for this treatment are often provided in the -form of multiple ``SYSTYPE`` files which may be selected between in the fit. - -Each ``SYSTYPE`` file begins with a line specifying the total number of -systematics. Naturally this must match with the :math:`N_{\text{sys}}` variable -specified in the associated ``CommonData`` file. This is presented as a single -integer. For example, in the case of the BCDMSD ``SYSTYPE`` files, the first line is - - 8 - -as there are :math:`N_{\text{sys}}=8` sources of systematic uncertainty for this -*Dataset*. Following this line there are :math:`N_{\text{sys}}` lines describing each -source of systematic uncertainty. For each source two parameters are provided, -the *uncertainty treatment* and the *uncertainty description*. These -are laid out for each systematic as: - - :math:`i_{\text{sys}}` [*uncertainty treatment*] [*uncertainty description*] - -where :math:`1< i_{\text{sys}} \leq N_{\mathrm{sys}}` enumerates each systematic. The -*uncertainty treatment* determines whether the uncertainty should be -treated as additive, multiplicative, or in cases where the choice is unclear, as -randomised on a replica by replica basis. These choices are selected by using -the strings **ADD**, **MULT**, or **RAND**. The *uncertainty -description* specifies how the systematic is to be correlated with other -data points. There are three special cases for the *uncertainty -description*, specified by the strings **CORR**, **UNCORR**, -**THEORYCORR**, **THEORYUNCORR** and **SKIP**. The first two -specify whether the systematic is fully correlated **only** within the -*Dataset* (**CORR**), or whether the systematic is totally -uncorrelated (**UNCORR**). The **THEORY** descriptor is used to -describe theoretical systematics due to e.g missing NNLO corrections, which are -treated as either **CORR** or **UNCORR** according to their suffix, -but are not included in the generation of artificial replicas (their only -contribution is to the fitting error function). If the user wishes to correlate -a specific uncertainty between multiple *Datasets* within an -*Experiment*, then they should use a custom *uncertainty description*. -When building a covariance matrix for an *Experiment*, the ``nnpdf++`` -code checks for matches between the *uncertainty descriptions* of -systematics of its constituent *Datasets*. If a match is found, the code -will correlate those systematics over the relevant datasets. The **SKIP** -descriptor removes the systematic from the covariance matrices for debugging -purposes. - -As an example, let us consider an NNPDF2.3 standard ``SYSTYPE`` for the BCDMSD -*Dataset*: - - | 8 - | 1 ADD BCDMSFB - | 2 ADD BCDMSFS - | 3 ADD BCDMSFR - | 4 MULT BCDMSNORM - | 5 MULT BCDMSRELNORMTARGET - | 6 MULT CORR - | 7 MULT CORR - | 8 MULT CORR - -Here the first five systematics have custom *uncertainty descriptions*, -thereby allowing them to be cross-correlated with other *Datasets* in a -larger *Experiment*. Systematics six to eight are specified as being fully -correlated, but only within the BCDMSD *Dataset*. Additionally note that -the first three systematics are specified as additive, and the remainder are -multiplicative. If we compare now to the equivalent ``SYSTYPE`` file for the -BCDMSP *Dataset*: - - | 11 - | 1 ADD BCDMSFB - | 2 ADD BCDMSFS - | 3 ADD BCDMSFR - | 4 MULT BCDMSNORM - | 5 MULT BCDMSRELNORMTARGET - | 6 MULT CORR - | 7 MULT CORR - | 8 MULT CORR - | 9 MULT CORR - | 10 MULT CORR - | 11 MULT CORR - -it is clear that the first five systematics are the same as in the BCDMSD -*Dataset*, and therefore should the two sets be combined into a common -*Experiment*, the code will cross-correlate them appropriately. The -combination of ``SYSTYPE`` and ``CommonData`` is quite flexible. As stated -previously, once generated from the original raw experimental data, the -``CommonData`` file is fixed and should not be altered apart from for the purpose -of correcting errors. In practice the full details on the systematic correlation -and their treatment is often not precisely specified. This system allows for the -safe variation of these parameters for testing purposes. diff --git a/doc/sphinx/source/data/fk-config-variables.rst b/doc/sphinx/source/data/fk-config-variables.rst deleted file mode 100644 index 142ba92b19..0000000000 --- a/doc/sphinx/source/data/fk-config-variables.rst +++ /dev/null @@ -1,18 +0,0 @@ -.. _fk_config_variables: - -============================== -``FK`` configuration variables -============================== - -Table specifying the required elements of the GridInfo ``FK`` header -segment. The Key column specifies the exact format of the Key in the K-V pair -used in the GridInfo segment. - -======== ======= ====================== ================================== -Key Type Description Comments -======== ======= ====================== ================================== -SETNAME String *SetName* N/A -HADRONIC Boolean Hadronic flag 0 or 1 -NDATA Integer :math:`N_{\text{dat}}` Number of data points -NX Integer :math:`N_x` Number of :math:`x`-points in grid -======== ======= ====================== ================================== diff --git a/doc/sphinx/source/data/index.rst b/doc/sphinx/source/data/index.rst index d18ed4b31d..3483436b5f 100644 --- a/doc/sphinx/source/data/index.rst +++ b/doc/sphinx/source/data/index.rst @@ -8,10 +8,9 @@ namely data files and the corresponding files containing theoretical predictions :maxdepth: 1 ./intro - ./exp-data-files + ./commondata + ./dataset-naming-convention ./th-data-files ./data-config - ./fk-config-variables ./example-cfactor-file - ./example-fk-preamble - ./plotting_format + ./plotting-format diff --git a/doc/sphinx/source/data/intro.rst b/doc/sphinx/source/data/intro.rst index 30428a450c..7fdddc0a6f 100644 --- a/doc/sphinx/source/data/intro.rst +++ b/doc/sphinx/source/data/intro.rst @@ -2,19 +2,20 @@ Introduction ============ -In the ``nnpdf++`` project, data files used by the code may be grouped into +In the ``nnpdf`` project, data files used by the code may be grouped into two categories, theory and experiment. Experimental data and the information -pertaining to the treatment of systematic errors are held in ``CommonData`` -and ``SYSTYPE`` files. ``FK`` tables, ``COMPOUND`` and ``CFACTOR`` files +pertaining to the treatment of systematic errors are held in the ``CommonData`` files. +``FK`` tables, and ``CFACTOR`` files store the precomputed information for use when calculating theoretical -predictions corresponding to information held in the equivalent ``CommonData`` -file. In this section the file formats and naming conventions for these files +predictions corresponding to information held in the equivalent ``CommonData``. +In this section the file formats and naming conventions for these files will be detailed, along with the directory structure employed by the -``nnpdf++`` code. +``nnpdf`` code. -For NNPDF3.1 and later fits, a considerably larger number of theory options will -be explored than in previous determinations. In NNPDF3.0 the main theory -variations used were perturbative order, value of the strong coupling and the +For NNPDF4.0 and later fits, a considerably larger number of theory options will +be explored than in previous determinations. +The current theory documentation only refers to 4.0 and previous fits and is thus outdated. +In NNPDF3.0 the main theory variations used were perturbative order, value of the strong coupling and the number of active flavours in the VFNS. For NNPDF3.1 and later, it has been necessary to accommodate variations in additional parameters, such as treatments of the heavy quark mass (pole vs MS-bar), scale variations, intrinsic charm, resummation @@ -24,9 +25,9 @@ here. This section will begin by detailing the specifications for the file formats used by the code, first with the experimental data file formats and layouts in -:ref:`exp_data_files` and secondly with the file formats used for +:ref:`commondata` and secondly with the file formats used for theoretical predictions in :ref:`th_data_files`. Finally the organisation of -these files within the ``nnpdf++`` structure will be described in +these files within the ``nnpdf`` structure will be described in :ref:`org_data_files`. Important definitions @@ -39,10 +40,10 @@ terminological points to note. ------------------------- When referring to a collection of data points two words are used in the -``nnpdf++`` code which have specific meanings. *Dataset* refers to the result +``nnpdf`` code which have specific meanings. *Dataset* refers to the result of a specific measurement, typically associated with a single experimental paper -and corresponds to the *DataSet* class in the ``nnpdf++`` code. -*Experiment* refers to a collection of *Datasets* which are associated +and corresponds to the *DataSet* class in the ``nnpdf`` code. +*Experiment* refers to a collection of *Datasets* which might be associated by experimental cross-correlations. For example, the ATLAS 2010 R=0.4 inclusive jet measurement and the ATLAS 2011 high-mass Drell-Yan measurement are both examples of *Datasets* as used in the NNPDF3.0 analysis. Both of these @@ -50,19 +51,8 @@ datasets are grouped into the ATLAS *Experiment* as they have systematic uncertainties that are cross-correlated with each other. In this document, when using these terms in this sense, they will be italicised for clarity. -Note however that the concept of an *Experiment* is being phased out in the NNPDF -code. For more information on this see :ref:`data_specification`. - -*Dataset* and *Experiment* names --------------------------------- - -When referred to, the *Dataset* and *Experiment* names refer to the -short identifying string used in the code for each *Dataset* and -*Experiment*. For example, the *Dataset* name for the aforementioned -ATLAS 2010 inclusive jet measurement with R=0.4 is ATLASR04JETS36PB. - -New dataset naming conventions ------------------------------- +Dataset naming conventions +-------------------------- See :ref:`dataset_naming_convention` for a definition of how datasets should be named. diff --git a/doc/sphinx/source/data/new-commondata.rst b/doc/sphinx/source/data/new-commondata.rst deleted file mode 100644 index ef549a0c9d..0000000000 --- a/doc/sphinx/source/data/new-commondata.rst +++ /dev/null @@ -1,234 +0,0 @@ -Naming convention and organization of the datasets --------------------------------------------------- - -All datasets in the new data format follow the exact same naming convention:: - - __{_}_ - -The data is contained in folders, each folder containing one single hepdata publication. -In all cases one can reconstruct the name of the folder by separating the observable name on the last ``_``, i.e., the folder will always be named:: - - __{_} - -Where all observables contained in one hepdata entry are separated by their observable name. - -Each folder will contain one single metadata file named ``metadata.yaml`` which defines all observables implemented for a given dataset. - -In order to keep backward compatibility and ease the comparison between new and old commondata, the ``buildmaster/dataset_names.yml`` file keeps a mapping of the datasets implemented in both formats. -When a ``legacy`` variant is available, the usage of the old name automatically enables such variants. The format of this mapping is as follow (which enables using variants): - -.. code-block:: yaml - - old_name_1: new_name_1 - old_name_2: - dataset: new_name_2 - variant: this_particular_variant - - -Metadata Format ---------------- - -This ``metadata.yaml`` file contains a first portion of general information which might be shared by several sets and a list of ``implemented_observables`` which define the separate observables. - - -.. code-block:: yaml - - setname: "EXPERIMENT_PROCESS_ENERGY{_EXTRA}" - - version: 1 - version_comment: "Initial implementation" - - # References - arXiv: - url: "" - iNSPIRE: - url: "https://inspirehep.net/literature/302822" - hepdata: - url: "https://www.hepdata.net/record/ins302822" - version: 1 - - nnpdf_metadata: - nnpdf31_process: "PROCESS" - experiment: "EXPERIMENT_NAME" - - implemented_observables: - - observable_name: "OBS" - observable: - description: "Description of the observable" - label: "Latex label for the observable" - units: "[u]" - ndata: n_of_datapoints - tables: [n, j, k] # (optional) corresponding tables in the hepdata entry - npoints: [n, j, k] # (optional) number of points per table - process_type: INC # for instance, INC, JET, DIJET, etc - - # Plotting information (for instance, the kinematics variable could be pt, mt, q2) - plotting: - dataset_label: "Label to be used in reports" - kinematics_override: identity - x_scale: log - plot_x: var_1 - figure_by: - - var_2 - - kinematic_coverage: [var_1, var_2, var_3] - - kinematics: - variables: - var_1: {description: "Description of var", label: "latex", units: "u"} - var_2: {description: "Description of var", label: "latex", units: "u"} - var_3: {description: "Description of var", label: "latex", units: "u"} - file: kinematics.yaml - - data_central: data.yaml - data_uncertainties: - - uncertainties.yaml - - uncertainties_2.yaml - - # Having variants is optional - # variants can overwrite the data_uncertainties - variants: - different_errors: - data_uncertainties: - - uncertainties.yaml - - uncertainties_3.yaml - - # The theory field is always optional - theory: - FK_tables: - - - DYE605 - operation: 'null' - - - - -Versioning -~~~~~~~~~~ - -The initial version of a dataset should be set to ``version: 1``. -Any change on a dataset should be *always* accompanied of a version bump and a ``version_comment`` explaining the update. -This will allow to keep an exact tracking of all changes to every dataset even if they change over time. - -Variants -~~~~~~~~ - -In some occasions we might want to maintain two variations of the same observable. -For instance, we might have two incompatible sources of uncertainties. In such case a variant can be added. -The syntax of the ``variants`` is. - -Theory -~~~~~~ - -The theory field defines how predictions for the dataset are to be computed. -It includes two entries: - -- ``FK_tables``: this is a list of lists which defines the FK Tables to be loaded. The outermost list are the operands (in case an operation is needed to recover the observable, more on that below). The innermost list are the grids that are to be concatenated in order to form the operands. -- ``operaton``: operation to be applied in order to compute the observable - -Example: - -.. code-block:: yaml - theory: - FK_tables: - - - Z_contribution - - Wp_contribution - - Wm_total - - - total_xs - operation: 'ratio' - -In this case the ``fktables`` for the Z, W+ and W- contributions will be concatenated (the dataset might include predictions for all three contributions). -After that, the final observable will be computed by taking the ratio of the concatenation of all those observables and the total cross section (``total_xs``). - - -.. code-block:: yaml - - data_uncertainties: - - uncertainties.yaml - - variants: - name_of_the_variant: - data_uncertainties: - - uncertainties.yaml - - extra_uncertainties.yaml - another_variant: - data_uncertainties: - - different_uncertainties.yaml - - -When loading this dataset with no variant only the ``uncertainties.yaml`` file will be read. -Instead, when choosing ``variant: name_of_the_variant``, both ``uncertainties.yaml`` and ``extra_uncertainties.yaml`` will be loaded. -Note that if we want to substitute the default set of uncertainties we just need to not include it in the variant (as done in ``another_variant``). - - -Data ----- - -The format of the data is a ``yaml`` file with an entry ```data_central``` which is a list for all values for all bins. - -.. code-block:: yaml - - data_central: - - val1 - - val2 - - val3 - -Uncertainties -------------- - -The uncertainties are (also) ``.yaml`` files. -Note that in the ``metadata.yaml`` the ``data_uncertainties`` entry is given as a list. -When using more than one uncertainty file they will be concatenated. -This allows the user the flexibility of creating variants where only a subset of the uncertainties are modified. - -The format of the uncertainty files is of two fields, a ``definitions`` field that contains metadata about all the uncertainties (their name, their treatment (``ADD`` or ``MULT``) and their type) and a second field ``bins`` which is a list of mappings with as many entries as the `data_central` with the named uncertainties. - -Note that, regardless of their treatment type, the uncertainties should always be written as absolute values and not relative to the data values. - -.. code-block:: yaml - - definitions: - stat: - description: - treatment: - type: - error_name: - description: - treatment: - type: - error_name_2: - description: - treatment: - type: - bins: - - stat: - error_name: - error_name_2: - -Kinematics: ------------ -The kinematics file follow a convention very similar to the uncertainties file, where the ``definitions`` field is skipped since that information is already contained in the parent ``metadata.yaml`` file. - -Therefore, we have a list of ``bins`` (of the same size as the list for `data_central`) and for each entry we have the information of all the variables. - -.. code-block:: yaml - - bins: - - var_1: - min: 0 - max: 1 - mid: 0.5 - var_2: - min: 0 - max: 1 - mid: 0.5 - -Plotting -~~~~~~~~ - -The ``plotting`` section defines the plotting style inside ``validphys``. -In previous implementations there were per-process options that defined plotting options for family of processes. -In the commondata format defined in this page every plotting option must be defined in the ``plotting`` section of each observable. - -Internally within ``validphys`` only 3 kinematic variables are taken into account. The 3 selected variables (and their order) is defined by ``plotting::kinematic_coverage``. - -The name of the variables (which in this example are `var_1`, `var_2`, `var_3`) need to be the same in the plotting and kinematics. diff --git a/doc/sphinx/source/data/plotting-format.rst b/doc/sphinx/source/data/plotting-format.rst new file mode 100644 index 0000000000..62e31c4df0 --- /dev/null +++ b/doc/sphinx/source/data/plotting-format.rst @@ -0,0 +1,268 @@ +.. _plotting-format: + +=============== +Plotting format +=============== + +The ``plotting`` dictionary within the metadata of a dataset +defines a set of options that are used for analysis +and representation purposes, particularly to determine how datasets +should be represented in plots. + +.. warning:: the information in this page is not up to date + +Format +====== + +The plotting file specifies the variable in which the data +is to be plotted (in the *x* axis) as well as the variables +in which the data will be split in different lines in the +same figure or in different figures. The possible variables +('*kinematic labels*') are described below. + +The format also allows the control of several plotting properties, such +as whether to use log scale, or the axes labels. + +Kinematic labels +================ + +.. note:: very outdated information that only applies to legacy data + +When a dataset has been ported from the old implementation and thus +it has no well defined kinematic variables (but instead just k1, k2, k3) +the default kinematic variables are inferred from the *process type* +declared in the commondata files (more specifically from +a substring). Currently they are: + +.. code-block:: python + + 'DIS': ('$x$', '$Q^2 (GeV^2)$', '$y$'), + 'DYP': ('$y$', '$M^2 (GeV^2)$', '$\\sqrt{s} (GeV)$'), + 'EWJ_JPT': ('$p_T (GeV)$', '$M^2 (GeV^2)$', '$\\sqrt{s} (GeV)$'), + 'EWJ_JRAP': ('$\\eta/y$', '$M^2 (GeV^2)$', '$\\sqrt{s} (GeV)$'), + 'EWJ_MLL': ('$M_{ll} (GeV)$', '$M_{ll}^2 (GeV^2)$', '$\\sqrt{s} (GeV)$'), + 'EWJ_PT': ('$p_T (GeV)$', '$M^2 (GeV^2)$', '$\\sqrt{s} (GeV)$'), + 'EWJ_PTRAP': ('$\\eta/y$', '$p_T^2 (GeV^2)$', '$\\sqrt{s} (GeV)$'), + 'EWJ_RAP': ('$\\eta/y$', '$M^2 (GeV^2)$', '$\\sqrt{s} (GeV)$'), + 'EWK_MLL': ('$M_{ll} (GeV)$', '$M_{ll}^2 (GeV^2)$', '$\\sqrt{s} (GeV)$'), + 'EWK_PT': ('$p_T$ (GeV)', '$M^2 (GeV^2)$', '$\\sqrt{s} (GeV)$'), + 'EWK_PTRAP': ('$\\eta/y$', '$p_T^2 (GeV^2)$', '$\\sqrt{s} (GeV)$'), + 'EWK_RAP': ('$\\eta/y$', '$M^2 (GeV^2)$', '$\\sqrt{s} (GeV)$'), + 'HIG_RAP': ('$y$', '$M_H^2 (GeV^2)$', '$\\sqrt{s} (GeV)$'), + 'HQP_MQQ': ('$M^{QQ} (GeV)$', '$\\mu^2 (GeV^2)$', '$\\sqrt{s} (GeV)$'), + 'HQP_PTQ': ('$p_T^Q (GeV)$', '$\\mu^2 (GeV^2)$', '$\\sqrt{s} (GeV)$'), + 'HQP_PTQQ': ('$p_T^{QQ} (GeV)$', '$\\mu^2 (GeV^2)$', '$\\sqrt{s} (GeV)$'), + 'HQP_YQ': ('$y^Q$', '$\\mu^2 (GeV^2)$', '$\\sqrt{s} (GeV)$'), + 'HQP_YQQ': ('$y^{QQ} (GeV)$', '$\\mu^2 (GeV^2)$', '$\\sqrt{s} (GeV)$'), + 'INC': ('$0$', '$\\mu^2 (GeV^2)$', '$\\sqrt{s} (GeV)$'), + 'JET': ('$\\eta$', '$p_T^2 (GeV^2)$', '$\\sqrt{s} (GeV)$'), + 'PHT': ('$\\eta_\\gamma$', '$E_{T,\\gamma}^2 (GeV^2)$', '$\\sqrt{s} (GeV)$'), + 'SIA': ('$z$', '$Q^2 (GeV^2)$', '$y$') + + +The three kinematic variables are referred to as `k1`, `k2` and `k3` +in the plotting files. For example, for DIS processes, `k1` refers to `x`, +`k2` to `Q`, and `k3` to `y`. + +These kinematic values can be overridden by some transformation of +them. For that purpose, it is possible to define +a `kinematics_override` key. The value must be a class defined +in: `validphys2/src/validphys/plotoptions/kintransforms.py` + +The class must have a `__call__` method that takes three parameters: +`(k1, k2 k3)` as defined in the dataset implementation, and returns +three new values `('k1', 'k2', k3')` which are the "transformed" +kinematical variables, which will be used for plotting purposes every +time the kinematic variables `k1`, `k2` and `k3` are referred to. +Additionally, the class must implement a `new_labels` method, that +takes the old labels and returns the new ones, and an `xq2map` +function that takes the kinematic variables and returns a tuple of (x, +Q²) with some approximate values. An example of such transform is: + +.. code-block:: python + + class dis_sqrt_scale: + def __call__(self, k1, k2, k3): + ecm = sqrt(k2/(k1*k3)) + return k1, sqrt(k2), ceil(ecm) + + def new_labels(self, *old_labels): + return ('$x$', '$Q$ (GeV)', r'$\sqrt{s} (GeV)$') + + def xq2map(self, k1, k2, k3, **extra_labels): + return k1, k2*k2 + + +Additional labels +================= +Additional labels can be specified by declaring an **extra_labels** +key in the plotting file, and specifying for each new label a value +for each point in the dataset. + +For example: + +.. code-block:: yaml + + extra_labels: + idat2bin: [0, 0, 0, 0, 0, 0, 0, 0, 100, 100, 100, 100, 100, 200, 200, 200, 300, 300, 300, 400, 400, 400, 500, 500, 600, 600, 700, 700, 800, 800, 900, 1000, 1000, 1100] + +defines one label where the values for each of the datapoints are +given in the list. Note that the name of the extra_label (in this case +`idat2bin` is completely arbitrary, and will be used for plotting +purposes (LaTeX math syntax is allowed as well). However, adding labels +manually for each point can be tedious. This should only be reserved +for information that cannot be recovered from the kinematics as +defined in the CommonData file. Instead, new labels can be generated +programmatically: every function defined in `validphys2/src/validphys/plotoptions/labelers.py` +is a valid label. These functions take as keyword arguments the +(possibly transformed) kinematical variables, as well as any extra +label declared in the plotting file. For example, one might declare: + +.. code-block:: python + + def high_xq(k1, k2, k3, **kwargs): + return k1 > 1e-2 and k2 > 1000 + + +Note that it is convenient to always declare the `**kwargs` +parameter so that the code doesn't crash when the function is called +with extra arguments. Similarly to the kinematics transforms, it is +possible to decorate them with a `@label` describing a nicer latex +label than the function name. For example: + +.. code-block:: python + + @label(r"$I(x>10^{-2})\times I(Q > 1000 GeV)$") + def high_xq(k1, k2, k3, **kwargs): + return (k1 > 1e-2) & (k2 > 1000) + + +Plotting and grouping +===================== + +The variable in which the data is plotted is simply +declared as + +.. code-block:: yaml + + x: