Skip to content

Extend importer module to allow bulk import from Rivet#958

Merged
GraemeWatt merged 5 commits into
mainfrom
copilot/extend-importer-module-bulk-import
May 11, 2026
Merged

Extend importer module to allow bulk import from Rivet#958
GraemeWatt merged 5 commits into
mainfrom
copilot/extend-importer-module-bulk-import

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 17, 2026

The importer module was hardcoded to fetch INSPIRE IDs and submission files exclusively from hepdata.net, and always assigned user ID 1 as the Coordinator. This blocked bulk import of ~780 Rivet analyses hosted at an alternate web location.

Changes

api.py

  • get_inspire_ids: new ids_url parameter — when set, fetches the INSPIRE ID list directly from that URL (expects a JSON array of integers, e.g. inspire.json) instead of constructing the HEPData /search/ids endpoint. n_latest still applies client-side; last_updated is ignored when ids_url is used.
  • _download_file: new files_url parameter — when set, downloads from {files_url}/ins{inspire_id}.tar.gz instead of {base_url}/download/submission/ins{inspire_id}/original.
  • _import_record / import_records: new coordinator_id (default 1) and files_url parameters, replacing the hardcoded admin_user_id = 1.

cli.py

  • import-records: adds --coordinator-id/-c, --files-url/-f
  • bulk-import-records: adds --ids-url, --files-url/-f, --coordinator-id/-c

Example — bulk import from a Rivet mirror:

hepdata importer bulk-import-records \
  --ids-url https://example.com/hepdata/inspire.json \
  --files-url https://example.com/hepdata \
  --coordinator-id 42

Tests

  • Extended test_get_inspire_ids to cover ids_url success, ids_url + n_latest, and ids_url 404.
  • Updated test_import_records call-signature assertions and added a case for non-default coordinator_id/files_url propagation.
  • Added files_url download-error case to test_import_record.
Original prompt

This section details on the original issue you should resolve

<issue_title>records: extend importer module to allow bulk import from Rivet</issue_title>
<issue_description>The importer module (CLI) was written to import records from hepdata.net to a developer's local instance. It uses a list of INSPIRE IDs given at https://www.hepdata.net/search/ids?inspire_ids=true and it downloads files using a URL pattern url = "{0}/download/submission/ins{1}/original".format(base_url, inspire_id) where base_url = 'https://hepdata.net'.

The importer module should be extended to get the list of INSPIRE IDs and the download files from an alternate location, for example, a simple web directory with the INSPIRE IDs contained in the name of the files. It should also be possible to create records with any user assigned as the Coordinator (rather than just admin_user_id = 1). The ability to import only a subset of the complete list of INSPIRE IDs would be useful.

These changes should be carefully tested locally and on the QA system before importing to the production instance. Such an extension would be a quicker way of importing the 780 records obtained from Rivet than using the normal submission web interface.

See also discussion with @20DM in HEPData/hepdata_lib#229.

A list of the Rivet analyses can be seen at https://gitlab.com/hepcedar/rivet/-/issues/485 .</issue_description>

<agent_instructions>Most relevant information is given in the first and last comments.</agent_instructions>

Comments on the Issue (you are @copilot in this section)

@GraemeWatt @20DM : thanks, that's great! I'll look into modifying the [`importer`](https://github.com/HEPData/hepdata/blob/main/hepdata/modules/records/importer/api.py) module soon.

I picked a random submission (ins2705058.tar.gz) and uploaded it to my Sandbox. Few (optional) comments for your consideration:

  1. You give http://rivet.hepforge.org/analyses#BESIII_2023_I2705058 as an additional resource. This is not strictly necessary (see submission docs) since the link will automatically be added after the record is finalised from the nightly harvesting of the analyses.json file. Moreover, the automatic link added will be http://rivet.hepforge.org/analyses/BESIII_2023_I2705058 with a / instead of a #. So if you want to keep the Rivet analysis in the submission.yaml file, better to use a link with a / instead of a #, or just remove it completely.
  2. The comment has a weird markup that is not rendered by HEPData. It looks like you are taking this from the journal abstract given by the INSPIRE record (JSON). The INSPIRE JSON also provides the arXiv abstract (second item of abstracts) that uses LaTeX markup and can be rendered by HEPData. HEPData uses the arXiv abstract from INSPIRE if possible (code). Since HEPData already stores the paper abstract (although it is only displayed if there is no comment), I don't think you need to duplicate it in the comment. So I would just use the additional information "NUMERICAL VALUES HAVE BEEN DIGITISED FROM THE PAPER." as the comment or omit the comment completely if there is nothing to add. (Another possibility is to use the Description from the Rivet .info file as the comment, but in this case it contains Beam energy must be specified as analysis option "ENERGY" when rivet-merging samples. which is not relevant to the HEPData record.)
  3. It looks like Tables 1 and 2 share a common independent variable axis, so it would make sense to combine them into one table with two dependent variables, then the "Custom Rivet identifier" would not need to be given since the YODA export would give the correct identifiers automatically. Of course, I realise that some compromises need to be made in the interest of automation, and so the best overall encoding for 780 submissions is going to be different than if each submission was prepared separately.</comment_new>
    <comment_new>@GraemeWatt
    Thanks for making the changes to the tarballs. I haven't started looking at this yet, since I didn't see that it was particularly urgent, but I'll try to look into it within the next couple of months.</comment_new>
    <comment_new>@GraemeWatt
    The links given in the [previous comment](https://github.com/HEPData/hepdata/...

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI changed the title [WIP] Extend importer module to allow bulk import from Rivet Extend importer module to allow bulk import from Rivet Mar 17, 2026
Copilot AI requested a review from GraemeWatt March 17, 2026 17:17
Co-authored-by: GraemeWatt <11544204+GraemeWatt@users.noreply.github.com>
@GraemeWatt GraemeWatt force-pushed the copilot/extend-importer-module-bulk-import branch from 5a67acc to cf561e5 Compare May 10, 2026 18:18
@codecov
Copy link
Copy Markdown

codecov Bot commented May 10, 2026

Codecov Report

❌ Patch coverage is 96.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 86.05%. Comparing base (ccda75a) to head (4ed6cdf).

Files with missing lines Patch % Lines
hepdata/modules/records/importer/api.py 95.83% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #958      +/-   ##
==========================================
- Coverage   86.06%   86.05%   -0.01%     
==========================================
  Files          59       59              
  Lines        5755     5759       +4     
==========================================
+ Hits         4953     4956       +3     
- Misses        802      803       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@coveralls
Copy link
Copy Markdown

coveralls commented May 10, 2026

Coverage Report for CI Build 25670381315

Coverage decreased (-0.03%) to 86.039%

Details

  • Coverage decreased (-0.03%) from the base build.
  • Patch coverage: No coverable lines changed in this PR.
  • 5 coverage regressions across 2 files.

Uncovered Changes

No uncovered changes found.

Coverage Regressions

5 previously-covered lines in 2 files lost coverage.

File Lines Losing Coverage Coverage
modules/records/importer/api.py 3 97.97%
modules/theme/views.py 2 85.37%

Coverage Stats

Coverage Status
Relevant Lines: 5759
Covered Lines: 4955
Line Coverage: 86.04%
Coverage Strength: 0.86 hits per line

💛 - Coveralls

* allow_old_schema should be False for new validated submissions.
* Change old_inspire_id in test from 944937 (valid) to 214970 (invalid).
@GraemeWatt GraemeWatt marked this pull request as ready for review May 11, 2026 12:22
@GraemeWatt GraemeWatt requested review from Copilot May 11, 2026 12:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Extends the records importer module so bulk imports can pull INSPIRE ID lists and submission archives from alternate URLs (e.g. a Rivet mirror), and so imported records can be assigned to a configurable coordinator user instead of always user ID 1.

Changes:

  • Added ids_url support to get_inspire_ids() and files_url/coordinator_id/allow_old_schema plumbing through import_records()_import_record()_download_file().
  • Extended the CLI (hepdata importer import-records / bulk-import-records) with --ids-url, --files-url, --coordinator-id, and --allow-old-schema options.
  • Updated/extended importer tests for the new parameters and propagation.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
hepdata/modules/records/importer/api.py Adds alternate ID-list URL support and configurable coordinator + mirror download URL support during import.
hepdata/cli.py Exposes new importer parameters via CLI flags for single/bulk import commands.
tests/importer_test.py Adds test cases for ids_url, files_url error handling, and new call signatures/parameter propagation.
tests/conftest.py Updates the _download_file test wrapper to accept/forward files_url.
tests/admin_index_test.py Adds a fixed sleep before searching the OpenSearch index (likely to address refresh timing).
hepdata/version.py Version bump.

Comment thread tests/admin_index_test.py Outdated
Comment thread hepdata/modules/records/importer/api.py Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@GraemeWatt GraemeWatt merged commit 22e48d4 into main May 11, 2026
9 of 12 checks passed
@GraemeWatt GraemeWatt deleted the copilot/extend-importer-module-bulk-import branch May 11, 2026 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

records: extend importer module to allow bulk import from Rivet

4 participants