Skip to content

Analysis JSON schema#878

Closed
mhabedan wants to merge 15 commits into
HEPData:mainfrom
mhabedan:analysisSchema
Closed

Analysis JSON schema#878
mhabedan wants to merge 15 commits into
HEPData:mainfrom
mhabedan:analysisSchema

Conversation

@mhabedan
Copy link
Copy Markdown
Collaborator

@mhabedan mhabedan commented May 2, 2025

Following up on a discussion of the OpenMAPP project, this PR adds a JSON schema to HEPData that defines the format of the input "analysis" JSONs, currently used by Rivet, MadAnalysis5, SModelS, CheckMATE, and Combine.

An example JSON file for a tool would then look like this:

{
    "tool": "SModelS",
    "version": "3.0.0",
    "url_templates": {
        "main_url": "https://github.com/SModelS/smodels-database-release/tree/main/{path}",
        "val_url": "https://smodels.github.io/docs/Validation#{name}_ul"
    },
    "analyses" : [
      {
        "inspire_id": 1795076,
        "signature_type": "prompt",
        "pretty_name": "di-top resonance",
        "implementations": [
          {
            "name" : "ATLAS-EXOT-2018-48",
            "path": "13TeV/ATLAS/{name}/"
          }
        ]
      }
    ],
    "implementations_license": {
        "name": "cc-by-4.0",
        "url": "https://creativecommons.org/licenses/by/4.0"
    }
}
Advantages of this JSON schema over currently used format. Click to expand
  • Self-descriptiveness: the new JSON format includes information about the tool and tool version it's valid for. It also allows tools to include very rough human-readable information (instead of just bare identifiers).
  • Redundancy reduction: the new JSON format allows to codify URLs such that the URL stem doesn't have to be repeated. This makes it more compact, better human-readable and better maintainable.
  • Standardisation: If we agree on one format now for everyone, HEPData won't need to handle the JSON interfaces of the tools on a case-by-case basis any more.
  • Opportunity: With OpenMAPP, we now have the person power and the mandate to establish this kind of standard. So we should aim to foresee future needs now and establish a format that we don't have to change any time soon.
Changes to version discussed here. Click to expand
  • Renamed templates -> url_templates and used snake case throughout.
  • Changed type of "analyses" to be a list instead of a dictionary. That allows to explicitly spell-out that 1795076 is the inspire ID instead of an unexplained dictionary key.
  • Added a field "implementations" that is a list of name-path pairs. I assume those entries are always paired and that we do not want X different names and Y different paths that can be mixed-and-matched?
  • Added an optional field "implementations_license" to give a license if it's supposed to be something different from CC0.

Using a JSON schema then also has the advantage that everyone can validate their JSON files against the schema following steps similar to this script.

Dear authors of Rivet, MadAnalysis5, SModelS, CheckMATE, and Combine: Does this JSON format work for your tools?
@GraemeWatt: Any further comments from HEPData's perspective?

Comment thread hepdata/templates/analysis_schema.json
@lenzip
Copy link
Copy Markdown

lenzip commented May 5, 2025

Hello,
This is Piergiulio Lenzi (@lenzip) for Combine.
In the case of CMS Combine cards the url pointing to the cards is a doi.
How would one implement it here? Like with
"main_url": "https://doi.org/"
And then
"path": "10.17181/c2948-e8875"
Would we need to split this in such a way that both name and path are provided? I believe so.

The current json:

{
    "2705044": [
        "10.17181/z0382-yz736"
    ],
}

would then become, to a minimum:

{
    "tool": "Combine",
    "version": "10.2.X",
    "url_templates": {
        "main_url": "https://doi.org/",
    },
    "analyses" : [
      {
        "inspire_id": 2705044,
        "signature_type": "???????",
        "pretty_name": "Search for supersymmetry in final states with disappearing tracks",
        "implementations": [
          {
            "name" : "c2948-e8875",
            "path": "10.17181/{name}/"
          }
        ]
      }
    ],
    "implementations_license": {
        "name": "cc-by-4.0",
        "url": "https://creativecommons.org/licenses/by/4.0"
    }
}

If this is valid json according to the schema, I think we are fine, and it is not difficult to change from our side.

@mhabedan
Copy link
Copy Markdown
Collaborator Author

mhabedan commented May 5, 2025

Hi @lenzip,

Thanks for the quick feedback! Indeed, that would be a valid JSON (apart from a superfluous comma in the "main_url" line). The "path" field in the "implementations" wouldn't even have to contain the "{name}" bit. So you could either use

          {
            "name" : "<some name used for the analysis within Combine>",
            "path": "10.17181/c2948-e8875"
          }

or, as the "path" field is not mandatory,

          {
            "name" : "10.17181/c2948-e8875",
          }

Similarly, not all fields you mentioned would have to be used (see here for the fields that are required). A minimal version could be

{
  "tool": "Combine",
  "version": "10.2.X",
  "url_templates": {
      "main_url": "https://doi.org/"
  },
  "analyses" : [
    {
      "inspire_id": 2705044,
      "implementations": [
        {
          "name" : "10.17181/c2948-e8875"
        }
      ]
    }
  ]
}

if that makes more sense for you.

But overall: Great to hear that this would work for you! The "implementations" field would be a bit of a misnomer though if the format should also work for Combine. Don't really have a better name at the moment though. "codes" maybe?

@lenzip
Copy link
Copy Markdown

lenzip commented May 5, 2025

Hello @mhabedan ,
Sorry for jumping late in the discussion, but for my understanding, what is the use case for more than one implementation of an analysisi within the same tool?
Thanks
Giulio

@mhabedan
Copy link
Copy Markdown
Collaborator Author

mhabedan commented May 5, 2025

Hi @lenzip!

Very reasonable question. The use mostly comes from MadAnalysis (and LLPrecasting). They have two recasting approaches for detector emulation which are treated completely differently. Some analyses therefore have been recasted by different teams or just in general two implementations with two DOIs. Hopefully, allowing for multiple implementations in the JSON schema gives enough flexibility for all relevant needs without adding too much complication to the syntax.

@mhabedan
Copy link
Copy Markdown
Collaborator Author

mhabedan commented May 6, 2025

I added a required "tool_type" field as per @GraemeWatt's suggestion so HEPData knows whether the tool is a "Simplified analysis" or "Statistical model" (given that the schema seems to work for CMS Combine as well). Names are obviously open for discussion but I'd suggest to use an enum so the "tool_type" values are categorised.

So the example above would now be

{
    "tool": "SModelS",
    "version": "3.0.0",
    "tool_type": "Simplified analysis",
    "url_templates": {
        "main_url": "https://github.com/SModelS/smodels-database-release/tree/main/{path}",
        "val_url": "https://smodels.github.io/docs/Validation#{name}_ul"
    },
    "analyses" : [
      {
        "inspire_id": 1795076,
        "signature_type": "prompt",
        "pretty_name": "di-top resonance",
        "implementations": [
          {
            "name" : "ATLAS-EXOT-2018-48",
            "path": "13TeV/ATLAS/{name}/"
          }
        ]
      }
    ],
    "implementations_license": {
        "name": "cc-by-4.0",
        "url": "https://creativecommons.org/licenses/by/4.0"
    }
}

@GraemeWatt
Copy link
Copy Markdown
Member

I've merged and deployed a PR #881 that adds support for HackAnalysis and also adds a description for Rivet, MadAnalysis, SModelS and CheckMATE. The ANALYSES_ENDPOINTS in the config is now:

hepdata/hepdata/config.py

Lines 322 to 368 in 3a4eae4

ANALYSES_ENDPOINTS = {
'rivet': {
'endpoint_url': 'https://cedar-tools.web.cern.ch/rivet/analyses.json',
'url_template': 'http://rivet.hepforge.org/analyses/{0}',
'description': 'Rivet analysis'
},
'MadAnalysis': {
'endpoint_url': 'https://madanalysis.irmp.ucl.ac.be/raw-attachment/wiki/MA5SandBox/analyses.json',
'url_template': 'https://doi.org/{0}',
'description': 'MadAnalysis 5 analysis'
},
'SModelS': {
'endpoint_url': 'https://zenodo.org/records/13952092/files/smodels-analyses.hepdata.json?download=1',
'url_template': '{0}',
'description': 'SModelS analysis',
'subscribe_user_id': 7766
},
'CheckMATE': {
'endpoint_url': 'https://checkmate.hepforge.org/AnalysesList/analyses.json',
'url_template': '{0}',
'description': 'CheckMATE analysis',
'subscribe_user_id': 6977
},
'HackAnalysis': {
'endpoint_url': 'https://goodsell.pages.in2p3.fr/hackanalysis/json/HackAnalysis_HEPData.json',
'url_template': '{0}',
'description': 'HackAnalysis analysis',
'subscribe_user_id': 7919,
'license': {
'name': 'gnu-gpl-3.0',
'url': 'https://www.gnu.org/licenses/gpl-3.0.html'
},
},
'Combine': {
'endpoint_url': 'https://cms-public-likelihoods-list.web.cern.ch/artifacts/output.json',
'url_template': 'https://doi.org/{0}',
'description': 'Statistical models',
'license': {
'name': 'cc-by-4.0',
'url': 'https://creativecommons.org/licenses/by/4.0'
},
},
#'ufo': {},
#'xfitter': {},
#'applgrid': {},
#'fastnlo': {},
}

The description is mainly for display in the resource landing pages. I'm not sure that we need tool_type in the new JSON schema, but maybe there could be a string field like implementations_description instead (taking the same value for all analyses within a given framework).

The analyses schema should be compatible with the existing additional_resources_schema.json. In particular, I'm not sure how to impose that the location (i.e. the analysis code URL) has a maximum length of 256 characters if it is built from multiple fields in the new JSON schema.

@mhabedan
Copy link
Copy Markdown
Collaborator Author

mhabedan commented May 8, 2025

Hi @GraemeWatt,

Re. description: My idea was for tool_type to do what you have implemented in your example as description. So

  • You don't want description to be broad categories but a specific name for each tool? Are the tools supposed to supply that field themselves or does HEPData want to fill it in?
  • Would there be any benefit of having both, description and implementations_description? Or shouldn't it be enough to have either?

Re. location: Yes, you're right that the 256 character limit is difficult to enforce if we expect to format the path and name fields with one another. Is it important though to enforce the character limit? Why do we have to match the additional_resources_schema.json?

@GraemeWatt
Copy link
Copy Markdown
Member

GraemeWatt commented May 8, 2025

The analysis links are stored in the database as DataResource objects via the update_analyses function, therefore the fields need to match the database table. External links can also be specified in the additional_resources field of the submitted submission.yaml file, which are also stored in the database as DataResource objects, therefore the restrictions imposed in additional_resources_schema.json match the restrictions of the database model. We don't currently check that the resource URL has a maximum length of 256 characters, but maybe this check could be made in the Python code rather than the JSON schema. It has not been a practical issue so far.

For the first HEPData record with a Combine link (https://www.hepdata.net/record/ins2796231?version=2), CMS added the link manually by uploading a revised submission.yaml file containing:

additional_resources:
- description: Statistical models
  location: https://doi.org/10.17181/bp9fx-6qs64
  license:
   name: cc-by-4.0
   url: https://creativecommons.org/licenses/by/4.0

When automating the procedure (PR #847 to address issue #846) I added 'description': 'Statistical models' to the config.py file to match the existing link that was added manually. Later I realised that it would make sense to add a description for the other analysis frameworks (PR #881), mainly so that the resource landing pages (example) have some text describing the link (e.g. it is not obvious from the DOI that the link is a MadAnalysis 5 analysis), but also because the description is a required field in additional_resources_schema.json. Another motivation was that with the new INSPIRE Data collection, which harvests HEPData metadata, they display external links from HEPData under "links" (example), showing the description if it is present (e.g. for Combine), but otherwise just the URL (e.g. for SModelS).

We can either leave description in the config.py file or move it to the new analysis JSON file as something like implementations_description (renamed to avoid confusion with the description used to describe each field in the JSON schema). It should be free text and not an enum, but it should match the values in the config.py file for the currently implemented frameworks. You can also keep tool_type if you want to distinguish statistical models from analysis code, but the distinction is not required in the current HEPData code.

@WolfgangWaltenberger
Copy link
Copy Markdown

additional_resources:
- description: Statistical models
  location: https://doi.org/10.17181/bp9fx-6qs64
  license:
   name: cc-by-4.0
   url: https://creativecommons.org/licenses/by/4.0

Hi sorry for the late reply, am implementing things on SModelS, but I think it all fits us. Just quick question,
for the SModelS analyses, what would be an appropriate description of "tool_type"? It's not a statistical model, not a simplified analysis, it's something like an "analysis result interpreted in the context of simplified model".
Maybe "sms_analysis_result"? Is that too cumbersome, detailed? "analysis_result", then?

Wolfgang

@mhabedan
Copy link
Copy Markdown
Collaborator Author

Hi @GraemeWatt,

Thanks for the detailed explanation! I understand now why you want to match the additional_resources_schema.json. We could try to limit the path and name fields to 128 characters. That's not a full protection against going longer than 265 characters when we combine the two but should make problems even more unlikely.

I wasn't fully aware how the description field was used by HEPData but I do see the advantage now. Having a description for the link, both in HEPData and in inspireHEP, is really helpful! In that case I'd suggest to move the description indeed to the new analysis JSON file and rename it implementations_description. That makes it more descriptive and takes the load of handling cases off HEPData's config.py.

Hi @WolfgangWaltenberger,
Thanks for your input! After the discussion above, you'd have to supply implementations_description and it wouldn't necessarily have to be as short as possible, but as informative as possible (@GraemeWatt, please correct me if I'm wrong). Insofar, something like "SMS analysis results" or "SModelS analysis" might be best?

@GraemeWatt
Copy link
Copy Markdown
Member

GraemeWatt commented May 12, 2025

Summarising a couple of points from my previous message:

  • Checking that the resource URL has a maximum length of 256 characters could be made in the Python code rather than the JSON schema.
  • Values of the new implementations_description for currently implemented frameworks should match the description values currently given in the config.py file (e.g. "SModelS analysis") to avoid needing to update the descriptions currently stored in the database (unless there is a strong reason to change them).

@WolfgangWaltenberger
Copy link
Copy Markdown

smodels-analyses.json

Alright attached is what our current version would look like. Tell me in case you want something changed.
Cheers

Wolfgang

@mhabedan
Copy link
Copy Markdown
Collaborator Author

mhabedan commented May 13, 2025

Hi @WolfgangWaltenberger,
Yes, that would mostly work. Three small comments:

  • "implementations": [, should be "implementations": [
  • After the discussion above, tool_type has been renamed to implementations_description, which as of 503ff8f is also reflected in the JSON schema. So you don't need to supply tool_type in addition to implementations_description.
  • There are a number of additional fields (created, url_templates/val_url, url_templates/publication, ...) that aren't required but you're certainly aware of that.

For future reference, the example I gave above would be

{
    "tool": "SModelS",
    "version": "3.0.0",
    "implementations_description": "SModelS analysis",
    "url_templates": {
        "main_url": "https://github.com/SModelS/smodels-database-release/tree/main/{path}",
        "val_url": "https://smodels.github.io/docs/Validation#{name}_ul"
    },
    "analyses" : [
      {
        "inspire_id": 1795076,
        "signature_type": "prompt",
        "pretty_name": "di-top resonance",
        "implementations": [
          {
            "name" : "ATLAS-EXOT-2018-48",
            "path": "13TeV/ATLAS/{name}/"
          }
        ]
      }
    ],
    "implementations_license": {
        "name": "cc-by-4.0",
        "url": "https://creativecommons.org/licenses/by/4.0"
    }
}

as of 503ff8f.

@WolfgangWaltenberger
Copy link
Copy Markdown

Hi @WolfgangWaltenberger, Yes, that would mostly work. Three small comments:

  • "implementations": [, should be "implementations": [

Ha, eluded me, thanks!

  • After the discussion above, tool_type has been renamed to implementations_description, which as of 503ff8f is also reflected in the JSON schema. So you don't need to supply tool_type in addition to implementations_description.

Right, will take out.

  • There are a number of additional fields (created, url_templates/val_url, url_templates/publication, ...) that aren't required but you're certainly aware of that.

Right, would leave in for our own "gain". That is not a problem, right?

Wolfgang

For future reference, the example I gave above would be

{
    "tool": "SModelS",
    "version": "3.0.0",
    "implementations_description": "SModelS analysis",
    "url_templates": {
        "main_url": "https://github.com/SModelS/smodels-database-release/tree/main/{path}",
        "val_url": "https://smodels.github.io/docs/Validation#{name}_ul"
    },
    "analyses" : [
      {
        "inspire_id": 1795076,
        "signature_type": "prompt",
        "pretty_name": "di-top resonance",
        "implementations": [
          {
            "name" : "ATLAS-EXOT-2018-48",
            "path": "13TeV/ATLAS/{name}/"
          }
        ]
      }
    ],
    "implementations_license": {
        "name": "cc-by-4.0",
        "url": "https://creativecommons.org/licenses/by/4.0"
    }
}

as of 503ff8f.

@mhabedan
Copy link
Copy Markdown
Collaborator Author

Hi!

After thinking about it and discussing with Andy, I'm convinced that adding the date_created field is a very useful addition to the standard.

So the example from above would now be

{
    "tool": "SModelS",
    "version": "3.0.0",
    "date_created": "2018-11-13T20:20:39+00:00",
    "implementations_description": "SModelS analysis",
    "url_templates": {
        "main_url": "https://github.com/SModelS/smodels-database-release/tree/main/{path}",
        "val_url": "https://smodels.github.io/docs/Validation#{name}_ul"
    },
    "analyses" : [
      {
        "inspire_id": 1795076,
        "signature_type": "prompt",
        "pretty_name": "di-top resonance",
        "implementations": [
          {
            "name" : "ATLAS-EXOT-2018-48",
            "path": "13TeV/ATLAS/{name}/"
          }
        ]
      }
    ],
    "implementations_license": {
        "name": "cc-by-4.0",
        "url": "https://creativecommons.org/licenses/by/4.0"
    }
}

as of 0d972a9.

WolfgangWaltenberger added a commit to SModelS/smodels-utils that referenced this pull request Jun 24, 2025
@WolfgangWaltenberger
Copy link
Copy Markdown

| from datetime import datetime, timezone
| from zoneinfo import ZoneInfo
| # now = datetime.now(timezone.utc)
| now = datetime.now(ZoneInfo("Europe/Vienna"))
| timestamp = now.isoformat()
| self.f.write (f' "date_created": "{timestamp}",\n' )

I assume you dont care much, but am I oscillating between utc and the timezone of the json file production :)

Wolfgang

@agbuckley
Copy link
Copy Markdown

I checked and dateutils was happy to parse any variant, including pure date with the time info discarded... so I don't think we need to overspecify that. The main use-case will be to know roughly what era a given file dates from, so we know easily if it's ancient and needs to updated/ignored!

@mhabedan
Copy link
Copy Markdown
Collaborator Author

As per @GraemeWatt's suggestion, the standard includes a new schema_version field now. The current schema version that should be given there is 1.0.0.
A minimal example now looks like this

{
  "schema_version" : "1.0.0",
  "tool": "SModelS",
  "version": "3.0.0",
  "date_created": "2018-11-13T20:20:39+00:00",
  "implementations_description": "SModelS analysis",
  "url_templates": {
    "main_url": "https://github.com/SModelS/smodels-database-release/tree/main/{name}"
  },
  "analyses" : [
    {
      "inspire_id": 1795076,
      "implementations": [
        {
          "name" : "ATLAS-EXOT-2018-48",
        }
      ]
    }
  ]
}

Furthermore, by popular demand, I've added a readme which describes in detail all fields that are required or defined, gives usage examples and instructions on how to test against the standard.

From my point of view, the standard is finalised now.
As soon as #886 as a first test case is merged, I'll include a schema checker in this PR and then it's good to go.

Copy link
Copy Markdown
Collaborator Author

@mhabedan mhabedan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following @GraemeWatt's suggestion, I've pulled the logic to parse the new analyses JSON files from #886 to this PR. I've also added the schema validation. At the moment, on failure, the old JSON file handling is invoked. That should be changed to a harder enforcement of the new JSON format once all tools have done the transition.

I've added a couple more discussion points directly in the code. Let me know what you think!

Comment thread hepdata/modules/records/utils/analyses.py
Comment thread hepdata/modules/records/utils/analyses.py
@GraemeWatt
Copy link
Copy Markdown
Member

Following @GraemeWatt's suggestion, I've pulled the logic to parse the new analyses JSON files from #886 to this PR.

I didn't actually suggest moving the parsing code to this PR. My comment (in yesterday's email) was:

By the way, I think it would make sense to merge "Analysis JSON schema” (#878) first, then "Add GAMBIT analysis JSON” (#886) could add a validation check against the schema.

I just wanted #878 containing the JSON schema to be merged first, so that #886 could use it for validation. But it doesn't matter too much which PR contains the parsing code, so keep it here if you prefer.

I think you can now check for the presence of schema_version in the analysis JSON file to decide whether the JSON file is in the new format. We can also define a JSON schema (call it version 0.1.0, say) for the old JSON format and use it for validation.

@mhabedan mhabedan mentioned this pull request Sep 2, 2025
@GraemeWatt
Copy link
Copy Markdown
Member

Closing this PR without merging as it has now been replaced by a new PR #906.

@mhabedan
Copy link
Copy Markdown
Collaborator Author

mhabedan commented Sep 9, 2025

As of yesterday, #906 is merged and the new analyses JSON schema is live! 🎉 All tools are now encouraged to move to the new schema. (Apart from SModelS who thankfully already did that so we could use them as a test case.)

@mhabedan mhabedan deleted the analysisSchema branch September 22, 2025 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants