Skip to content

Add support for HS3 and SimpleAnalysis resource files#925

Merged
GraemeWatt merged 7 commits into
mainfrom
simpleanalysis-hs3
Oct 17, 2025
Merged

Add support for HS3 and SimpleAnalysis resource files#925
GraemeWatt merged 7 commits into
mainfrom
simpleanalysis-hs3

Conversation

@GraemeWatt
Copy link
Copy Markdown
Member

@GraemeWatt GraemeWatt commented Oct 15, 2025

  • Remove identification of HistFactory files by case-insensitive trigger words ("histfactory", "pyhf", "likelihoods", "workspaces") in description. Now require type: HistFactory.
  • Identify new HS3 files via type: HS3 or by the string HS3 in the description (closes Highlight HS3 similar to pyHF #921).
  • SimpleAnalysis files will be identified by either the string SimpleAnalysis in the description or an explicit type: SimpleAnalysis (closes Adding “analysis:SimpleAnalysis” search query #864).
  • Repurpose add_histfactory_analyses.py to find existing SimpleAnalysis and HS3 files by running hepdata fix add-analyses -a SimpleAnalysis and hepdata fix add-analyses -a HS3 in the production environment.

* Remove identification of HistFactory by trigger words in description.
* Repurpose add_histfactory_analyses.py to find SimpleAnalysis files.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds explicit support for HS3 and SimpleAnalysis resource files and removes implicit HistFactory detection by description keywords.

  • Require explicit type: HistFactory or HS3 for those formats; no longer infer HistFactory from description/filename.
  • Add SimpleAnalysis detection via description term "SimpleAnalysis" or explicit type, update indexing to surface HS3 and SimpleAnalysis in analyses, and repurpose the fix script to tag existing SimpleAnalysis resources.
  • Update tests and search help documentation accordingly.

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/submission_test.py Updates tests to validate SimpleAnalysis detection and explicit HS3/HistFactory typing.
tests/search_test.py Adjusts search tests to include SimpleAnalysis resource type and uses SITE_URL for analysis URLs.
hepdata/version.py Bumps version string.
hepdata/modules/search/templates/hepdata_search/modals/search_help.html Adds HS3 and SimpleAnalysis examples to search help.
hepdata/modules/records/utils/common.py Removes HistFactory term-based detection; adds SimpleAnalysis detection and explicit handling for HS3.
hepdata/ext/opensearch/document_enhancers.py Indexes HS3 and SimpleAnalysis in the analyses field similarly to HistFactory.
hepdata/config.py Introduces HS3_FILE_TYPE and SIMPLEANALYSIS_FILE_TYPE constants.
fixes/add_simpleanalysis_analyses.py Repurposes the fix command to tag SimpleAnalysis resources and reindex.

Comment thread hepdata/ext/opensearch/document_enhancers.py
@mhabedan
Copy link
Copy Markdown
Collaborator

Hi @GraemeWatt,

Thanks a lot for this, looks good to me!

I'd merely have a question regarding the SimpleAnalysis files: this PR's mechanism relies on the SimpleAnalysis snippets that have been uploaded to HEPData as part of a record. I assume that's not the complete list of public SimpleAnalysis codes though. Would it make sense to have SimpleAnalysis expose an analysis JSON file similar to Rivet, MadAnalysis etc.? How would the overlaps between HEPData files and the external files from the SimpleAnalysis server be treated?
Apologies for coming with that question only now, I wasn't aware of #864.

* Follow suggestion made by Copilot review.
@coveralls
Copy link
Copy Markdown

coveralls commented Oct 15, 2025

Coverage Status

coverage: 84.445% (-0.02%) from 84.463%
when pulling b55a335 on simpleanalysis-hs3
into f133226 on main.

@GraemeWatt
Copy link
Copy Markdown
Member Author

@mhabedan : good point, this should be clarified before this PR is merged. Currently, the highlighted analyses are either hosted externally and specified in an analyses JSON file (e.g. SModelS) or hosted directly on HEPData (e.g. HistFactory). It would be difficult to combine the two approaches without duplication (or we'd need to write more code to filter out duplicates), so I would prefer to choose only one approach for SimpleAnalysis. There are a significant number of SimpleAnalysis code snippets already hosted on HEPData as found by a search query resources:SimpleAnalysis returning 27 records. The issue #864 mentions the additional resource files, so I would prefer that option. Do you want to contact the responsible people within ATLAS to clarify or I can reply to Judita's original email from March and copy you in?

By the way, I realised I can generalise the add_simpleanalysis_analyses.py file to work also for HS3 just by looking for the string HS3 in the description of the resource files. The list of HEPData records with HS3 files can currently be obtained with a search query resources:HS3 returning 8 records.

@mhabedan
Copy link
Copy Markdown
Collaborator

If it's an either-or to simplify the bookkeeping that HEPData has to do, I think I personally would prefer to link another JSON file directly from SimpleAnalysis. That would hopefully allow better coverage of what is actually available without relying on analysers to upload their code to HEPData (evidently difficult as 52/79 codes are still missing) as well as being able to always refer to the latest version.

I've nonetheless contacted the SimpleAnalysis authors to ask their opinion as well.

Nice idea about extending the search-description to HS3. I just hope that keyword will prove less ambiguous than it turned out to be for pyHF.

* Rename as add_analyses.py and modify to work also for HS3.
* Rename as is_analysis and modify to work also for HS3.
* Allow case where some analyses included locally and some via endpoint.
@GraemeWatt GraemeWatt requested a review from Copilot October 16, 2025 21:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Comment thread hepdata/modules/records/utils/common.py
Comment thread fixes/add_analyses.py
* Allow case where some analyses included locally and some via endpoint.
@GraemeWatt
Copy link
Copy Markdown
Member Author

@mhabedan : I can see that it would be useful if both local files and external links (defined in a JSON file) were tagged as SimpleAnalysis and I think it should be possible after minor changes made in commits 4a0db94 and 8e0a111 (not yet fully tested). So I could merge this PR now and add the SimpleAnalysis links for local files, then we could add the links to GitLab provided via a JSON file at a later date if desired? It probably doesn't matter too much if there are two SimpleAnalysis links (one local and one to GitLab) for some HEPData records. Maybe the SimpleAnalysis JSON file could only include SimpleAnalysis links where there is not a local file already on HEPData to avoid duplication, although for uses outside HEPData you would probably want the SimpleAnalysis JSON file to be complete.

I can update hepdata_lib and the submission docs after this PR is deployed, but since uploaders might not immediately know they need to specify type: HS3 when including HS3 files, I added identification via a string HS3 in the description of the resource file. Similarly, SimpleAnalysis files are identified with either type: SimpleAnalysis or a string SimpleAnalysis in the description of the resource file. I looked through the 27 HEPData records found from a search query resources:SimpleAnalysis and the 8 HEPData records found from resources:HS3. Tagging via SimpleAnalysis or HS3 in the description should mostly work with some isolated exceptions:

  • https://www.hepdata.net/record/ins2845789 has a link to https://gitlab.cern.ch/atlas-sa/simple-analysis/-/blob/master/SimpleAnalysisCodes/src/ANA-HIGP-2024-32.cxx?ref_type=heads instead of including the .cxx file locally. This won't be tagged as SimpleAnalysis by the current code since it is an external link not a local file. The same record has a .tar.gz file with likelihoods labelled as type: HistFactory although the description contains HS^3 (not HS3), so this will be tagged as HistFactory not HS3 by the current code.

  • https://www.hepdata.net/record/ins2829504 has two HS3 JSON files attached to the first data table not to the whole submission. The analyses tagging code only works for resource files attached to the whole submission, therefore these files won't be tagged as HS3 despite having the string HS3 in the description. They will be findable via a search query resources:HS3 but not analysis:HS3.

  • https://www.hepdata.net/record/ins2745375 has two JSON files with type: HistFactory and a Python file with description "Run likelihood minimization using HS3 json file" which will be tagged as HS3 despite not being a JSON file itself.

These few exceptions could be cleaned up by either re-uploading in a better format or by manually editing the relevant database fields, but it probably does not matter too much if there are a few imperfections.

@GraemeWatt
Copy link
Copy Markdown
Member Author

@mhabedan : although the mixed case of local files and external links (both tagged as SimpleAnalysis) with possible duplication should now be supported by the code in this PR, I could remove the tagging of local files if you think it's simpler. Then after processing a future SimpleAnalysis JSON file (to be provided), the "SimpleAnalysis" badge would only apply to the external links and not to the local files. A search query analysis:SimpleAnalysis would return only records with external links, but resources:SimpleAnalysis would return both local files and external links that have the string SimpleAnalysis in the description. Let me know if you prefer this option and I'll modify this PR so that it only makes changes to support HS3 (stored locally).

@mhabedan
Copy link
Copy Markdown
Collaborator

Hi @GraemeWatt!

Fantastic, thanks for all this!

SimpleAnalysis will indeed expose a JSON file to HEPData soon, I'll create a PR as soon as that happens. No reason to delay this PR though.
I don't have a strong preference regarding duplicating the SimpleAnalysis batch or splitting into analysis:SimpleAnalysis (1 hit) and resources:SimpleAnalysis (up to 2 hits). Insofar, I'm happy with keeping it as it is now in this PR.

Re. the accuracy of the keyword tagging: thanks for having a look at this so quickly! I'll get in touch with the HEPData record creators as soon as this PR is merged and ask them to improve the labels/ use the new HS3 tag.

@GraemeWatt GraemeWatt merged commit 033fbc0 into main Oct 17, 2025
7 checks passed
@GraemeWatt GraemeWatt deleted the simpleanalysis-hs3 branch October 17, 2025 09:13
@GraemeWatt
Copy link
Copy Markdown
Member Author

This PR is now deployed in production on hepdata.net and I ran the commands to find existing SimpleAnalysis and HS3 files. Now a search query analysis:SimpleAnalysis returns 27 records and a search query analysis:HS3 returns 6 records. The difference of 2 records with respect to a search query resources:HS3 (8 records) is explained in my last comment.

Actually, https://www.hepdata.net/record/ins2845789 does have a SimpleAnalysis tag because my update command just looks for the string SimpleAnalysis in the description, but it wouldn't have a SimpleAnalysis tag for a new upload where the processing code is slightly different if the resource starts with http. This resource will be deleted when the future SimpleAnalysis JSON file is processed, unless the URL matches a URL specified by the SimpleAnalysis JSON file.

I'll update hepdata_lib and the submission docs next week.

@GraemeWatt
Copy link
Copy Markdown
Member Author

I made some manual database updates to address the anomalous three records mentioned in my previous comment:

@mhabedan
Copy link
Copy Markdown
Collaborator

Thanks a lot!

Two comments:

  • With this setup, https://www.hepdata.net/record/ins2829504 still doesn't show up when searching "analysis:HS3" as you explained in your comment. But that's a bit unintuitive, isn't it? Wouldn't it be better to mark the whole record as type: HS3?
  • Upon inspection, the "HistFactory" files in https://www.hepdata.net/record/ins2745375 are actually also HS3 files. The Python file mainly gives instructions how to run these files. Should the type of these JSON files and thereby of the record be corrected?

@GraemeWatt
Copy link
Copy Markdown
Member Author

It's only individual resources that have a type not a whole record. The only resource attached to the whole record is the ATLAS analysis web page, which shouldn't be tagged as HS3. We only implemented the resource tagging (and DOI minting) code for resources attached to a whole record, not for resources attached to individual tables. That would have been more complicated and there wasn't a clear use case for it. The problem is that uploaders keep making slight deviations from the expected/documented behaviour for no good reason. The submission docs already mention the "first document of the submission.yaml file", but I'll try to clarify further. In this case, I don't see an obvious reason why HS3 files should be attached to the first "Measured cross-sections" table and not to the whole record. I can't easily correct the database, but I can write to the original submitters and ask them to make a new upload.

  • Upon inspection, the "HistFactory" files in https://www.hepdata.net/record/ins2745375 are actually also HS3 files. The Python file mainly gives instructions how to run these files. Should the type of these JSON files and thereby of the record be corrected?

OK, I've changed the type from HistFactory to HS3 for these two files and reindexed, but the descriptions are misleading ("in the HistFactory JSON format"). And to repeat my comment above, only individual resources (not whole records) have a type. The indexing procedure then looks for resources attached to a whole record (not individual tables) with a particular type (e.g. HS3) and adds these resources to analyses associated with that record.

@mhabedan
Copy link
Copy Markdown
Collaborator

It's only individual resources that have a type not a whole record. The only resource attached to the whole record is the ATLAS analysis web page, which shouldn't be tagged as HS3. We only implemented the resource tagging (and DOI minting) code for resources attached to a whole record, not for resources attached to individual tables. That would have been more complicated and there wasn't a clear use case for it. The problem is that uploaders keep making slight deviations from the expected/documented behaviour for no good reason. The submission docs already mention the "first document of the submission.yaml file", but I'll try to clarify further. In this case, I don't see an obvious reason why HS3 files should be attached to the first "Measured cross-sections" table and not to the whole record. I can't easily correct the database, but I can write to the original submitters and ask them to make a new upload.

Agreed, I think it's a mistake rather than a feature. I can also reach out to the submitters internally, if you prefer.

  • Upon inspection, the "HistFactory" files in https://www.hepdata.net/record/ins2745375 are actually also HS3 files. The Python file mainly gives instructions how to run these files. Should the type of these JSON files and thereby of the record be corrected?

OK, I've changed the type from HistFactory to HS3 for these two files and reindexed, but the descriptions are misleading ("in the HistFactory JSON format"). And to repeat my comment above, only individual resources (not whole records) have a type. The indexing procedure then looks for resources attached to a whole record (not individual tables) with a particular type (e.g. HS3) and adds these resources to analyses associated with that record.

Thanks for the clarification and for updating the type of the files!

@GraemeWatt
Copy link
Copy Markdown
Member Author

GraemeWatt commented Oct 21, 2025

@GraemeWatt
Copy link
Copy Markdown
Member Author

From the list of HEPData records with HS3 files I found that https://www.hepdata.net/record/ins2689657 and https://www.hepdata.net/record/ins2616326 had attached HS3 files misidentified as HistFactory because the descriptions contained the trigger word "likelihoods" via the URL https://opendata.atlas.cern/docs/tutresearch/public_likelihoods/, so I changed the type of these files from HistFactory to HS3. Now the search analysis:HS3 returns 10 records.

@GraemeWatt
Copy link
Copy Markdown
Member Author

  • Written to submitters (Coordinator/Uploader/Reviewer) of https://www.hepdata.net/record/ins2829504 via the "Ask a Question" interface to request a new upload with HS3 JSON files moved from the first table to the whole submission.

Now updated in https://www.hepdata.net/record/ins2829504?version=2

@GraemeWatt
Copy link
Copy Markdown
Member Author

@cburgard : as mentioned at the end of my email yesterday, the three JSON files attached to https://www.hepdata.net/record/ins2905253 have type: HistFactory in the submission.yaml file without the string “HS3” in the description, but inspection of the JSON files reveals that they are in HS3 format not HistFactory. That record was finalised on 2025-09-23 before we added the support for type: HS3 in this PR. I've edited the database to change the type of the three JSON files from HistFactory to HS3. Let me know if you find any other records that need to be updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Highlight HS3 similar to pyHF Adding “analysis:SimpleAnalysis” search query

4 participants