The goal of this GROBID module is to recognize any software mentions in scholar textual documents, publisher XML and PDF. It uses as training data the Softcite Dataset developed by James Howison Lab at the University of Texas at Austin. This annotated corpus and the present software text mining component have been developed supported by a grant from the Alfred P. Sloan foundation to improve credit for research software.
Code with paper: the following article is available in CC-BY:
Patrice Lopez, Caifan Du, Johanna Cohoon, Karthik Ram, and James Howison. 2021.
Mining Software Entities in Scientific Literature: Document-level NER for an Extremely Imbalance and Large-scale Task.
In Proceedings of the 30th ACM International Conference on Information and Knowledge Management (CIKM ’21),
November 1–5, 2021, QLD, Australia. https://doi.org/10.1145/3459637.3481936
[Best Applied Research Paper Award runner-up]
For more recent evaluations and a description of a use case in production to monitor Open Science in France, see:
Aricia Bassinet, Laetitia Bracco, Anne L'Hôte, Eric Jeangirard, Patrice Lopez, et al. 2023.
Large-scale Machine-Learning analysis of scientific PDF for monitoring the production and the openness of research data and software in France. ⟨hal-04121339v3⟩
https://hal.science/hal-04121339v3
As the other GROBID models, the module relies only on state-of-the-art machine learning. The tool can use linear CRF (via Wapiti JNI integration) or Deep Learning model such as BiLSTM-CRF, ELMo or fine-tuned transformers BERT, e.g. SciBERT and LinkBERT (via DeLFT JNI integration) and any combination of them.
Thanks to its integration in the GROBID framework, the software mention extraction on scholar PDF is:
-
structure-aware: the extraction is realized on the relevant textual zones, skipping for instance figure content, headnotes, formulas, bibliographical reference section, etc. and exploiting the knowledge of inline reference markers, section and paragraph boundaries, etc. for the textual zones
-
robust: text stream in PDF is recovered and cleaned with various additional process going beyond traditional pdf-to-text low level PDF extraction tool, for instance line number removal, de-hyphenation, unicode character combination, multi-column support, handling of page/figure breaks, unicode normalization, removal of manuscript line numbers, etc.
-
combined with bibliographical reference recognition: the bibliographical reference markers possibly used in combination with the software mention are recognized, attached to the software mention when possible, matched with the full bibliographical reference in the bibliographical section of the article and disambiguated against CrossRef and Unpaywall
-
combined with PDF coordinates: in case the input is a PDF, the bounding boxes of the extracted software mentions, software attributes and attached bibliographical references in the original PDF are provided, in order to create augmented interative PDF to visualize and interact directly with software mentions on the PDF, see the console demo
-
combined with entity disambiguation: extracted software names are disambiguated in context against software entities in Wikidata via entity-fishing
-
combined with software usage, creation and sharing characterizations: based on the different mentions of a software in a document, each mention is classified to estimate if the software is likely used in the research work, a creation part of the research work and if it is publicly shared. A consolidation step then provides at document-level these characterization for each mentioned software.
-
scaling: as we want to scale to the complete scientific corpus, the process is optimized in runtime and memory usage. We are able to process entirely around 2 PDF per second with the CRF model (including PDF processing and structuring, extractions, bibliographical reference disambiguation against crossref and entity disambiguation against WikiData) on one low/medium cost Ubuntu server, Intel i7-4790 (4 CPU), 4.00 GHz with 16 GB memory. Around 0.5 PDF per second is processed when using the fine-tuned SciBERT model, the best performing model - an additional GPU is however necessary when using Deep Learning models and runtime, depending on the DL architecture of choice.
Latest performance (accuracy and runtime) can be found in the most recent cited publication above, and more model comparisons below.
A public demo of the service is available at the following address: https://cloud.science-miner.com/software/
The web console allows you to test the processing of text or of a full scholar PDF. The component is developed targeting complete PDF, so the output of a PDF processing will be richer (attachment, parsing and DOI-matching of the bibliographical references appearing with a software mention, coordinates in the PDF of the mentions, document level propagation of mentions). The console displays extracted mentions directly on the PDF pages (via PDF.js), with infobox describing when possible Wikidata entity linking and full reference metadata (with Open Access links when found via Unpaywall).
This demo is only provided for test, without any guaranties regarding the service quality and availability. If you plan to use this component at scale or in production, you need to install it locally (see how to deploy a docker image).
Note: The demo run with the CRF model to reduce the computational load, as the server is used for other demos and has no GPU (for cost reasons). For significantly more accurate results (see the benchmarking), sciBERT/LinkBERT models are required, the Docker image being the easiest way to achieve this (fine-tuned transformer models are included and used by default in the image).
For sampling, training and evaluation of the sequence labeling model and additional attribute attachment mechanisms, we use the Softcite dataset, a gold standard manually annotated corpus of 4,971 scholar articles, available on Zenodo (version 2.0):
More details on the Softcite dataset can be found in the following publication:
Du, C, Cohoon, J, Lopez, P, Howison, J. Softcite dataset: A dataset of software mentions
in biomedical and economic research publications. J Assoc Inf Sci Technol. (JASIST) 2021; 1–15.
https://doi.org/10.1002/asi.24454
The latest version of the dataset is maintained on the following GitHub repository: https://github.com/softcite/softcite_dataset_v2
Original development was carried out at https://github.com/howisonlab/softcite-dataset
It is recommended to use the Docker image for running the service. The best Deep Learning models are included and are used by default by this image. To use a Docker image via docker HUB, pull the image (around 11GB) as follow:
docker pull grobid/software-mentions:0.8.1After pulling or building the Docker image, you can now run the software-mentions service as a container:
> docker run --rm --gpus all -it --ulimit core=0 -p 8060:8060 grobid/software-mentions:0.8.1The build image includes the automatic support of GPU when available on the host machine via the parameter --gpus all (with automatic recognition of the CUDA version). The support of GPU is only available on Linux host machine. If no GPU are available on your host machine, just remove the --gpus all parameter, but usage of GPU is recommended for best runtime:
> docker run --rm -it --ulimit core=0 -p 8060:8060 grobid/software-mentions:0.8.1To specify to use only certain GPUs (see the nvidia container toolkit user guide for more details):
> docker run --rm --gpus '"device=1,2"' -it --init --ulimit core=0 -p 8060:8060 grobid/software-mentions:0.8.1Note that starting for convenience the container with option --ulimit core=0 avoids having possible core dumped inside the container, which can happen overwise due to the very rare crash of the PDF parsing C++ component. Starting the container with parameter -it allows to interact with the docker process, which is of limited use here, except conveniently stopping the docker container with control-c.
The software-mentions service is available at the default host/port localhost:8060, but it is possible to map the port at launch time of the container as follow:
> docker run --rm --gpus all -it --init --ulimit core=0 -p 8080:8060 grobid/software-mentions:0.8.1In this image, the best deep learning models are used by default. The selection of models can be modified, for example to use faster models or requiring less GPU memory. To modify the configuration without rebuilding the image - for instance rather use the CRF model, it is possible to mount a modified config file at launch as follow:
> docker run --rm --gpus all -it --init --ulimit core=0 -p 8060:8060 -v /home/lopez/grobid/software-mentions/resources/config/config.yml:/opt/grobid/software-mentions/resources/config/config.yml:ro grobid/software-mentions:0.8.1As an alternative, a docker image for the software-mentions service can be built with the project Dockerfile to match the current master version. The complete process is as follow:
- copy the
Dockerfile.softwareat the root of the GROBID installation:
~/grobid/software-mentions$ cp ./Dockerfile.software ..- from the GROBID root installation (
grobid/), launch the docker build:
> docker build -t grobid/software-mentions:0.8.2-SNAPSHOT --build-arg GROBID_VERSION=0.8.2-SNAPSHOT --file Dockerfile.software .Building the Docker image takes several minutes: installing GROBID, software-mentions, a complete Python Deep Learning environment based on DeLFT and deep learning models downloaded from the internet (one fine-tuned model with a BERT layer has a size of around 400 MB). The resulting image is thus very large, around 8GB, due to the deep learning resources and models.
The easiest way to deploy and run the service is to use the Docker image, see previous section. If you're courageous or would like to contribute to the development, this section presents the install and build process.
Building the module requires JDK 1.8 or higher (tested up to Java 15). First install and build the latest development version of GROBID as explained by the documentation, together with DeLFT for Deep Learning model support. An installation of Pub2TEI is also necessary to process a variety of publisher XML formats (including for example JATS).
Under the installed and built grobid/ directory, clone the present module software-mentions (it will appear as sibling sub-project to grobid-core, grobid-trainer, etc.):
cd grobid/
git clone https://github.com/softcite/software-mentions
cd software-mentionsCopy the provided pre-trained models into the standard grobid-home path:
./gradlew copyModels Install larger models (fine-tuned transformers, currently the best performing one, total 1.5 GB size and too large to be stored in the GitHub repo), they need to be downloaded and installed with the following command:
./gradlew installModelsTry compiling everything with:
./gradlew clean install Run some test:
./gradlew testTo start the service:
./gradlew runJavascript demo/console web app is then accessible at http://localhost:8060. From the console and the RESTfull services tab, you can process chunk of text (select ProcessText) or process a complete PDF document (select Annotate PDF document).
When processing text, it is possible to examine the JSON output of the service with the Response tab:
When processing the PDF of a scientific article, the tool will also identify bibliographical reference markers and, when possible, attach the full parsed bibliographical reference to the identified software entity. In addition, bibliographical references can be resolved via biblio-glutton, providing a unique DOI, and optionally additional identifiers like PubMed ID, PMC ID, etc. and a link to the Open Access full text of the reference work when available (via Unpaywall).
Software entity linking against Wikidata is realized by entity-fishing and provides when possible Wikidata ID and English Wikipedia page ID. The web console allows to interact and view the entity information in the infobox:
To exploit the Softcite software mention recognition service efficiently (concurrent calls) and robustly, a Python client is available here.
If you want to process a directory of PDF and/or XML documents, this is the best and simplest solution: deploy a Dokcer image of the server and use this client.
A tutorial is available at https://github.com/softcite/tutorials/blob/master/process_all_of_plos.md describing how to process the "All of PLOS" collection, step by step. You can apply the same approach for any collection of XML or PDF scientific articles.
The resulting software mention extractions include many attributes and information. These extractions follow the JSON format documented on this page.
This dataset is the result of the extraction of software mentions from the set of publications of the CORD-19 corpus (https://allenai.org/data/cord-19) by the Softcite software recognizer using SciBERT fine-tuned model: https://zenodo.org/record/5235661
Identify the software mentions in text and optionally disambiguate the extracted software mentions against Wikidata.
| method | request type | response type | parameters | requirement | description |
|---|---|---|---|---|---|
| GET, POST | multipart/form-data |
application/json |
text |
required | the text to be processed |
disambiguate |
optional | disambiguate is a string of value 0 (no disambiguation, default value) or 1 (disambiguate and inject Wikidata entity id and Wikipedia pageId) |
Response status codes:
| HTTP Status code | reason |
|---|---|
| 200 | Successful operation. |
| 204 | Process was completed, but no content could be extracted and structured |
| 400 | Wrong request, missing parameters, missing header |
| 500 | Indicate an internal service error, further described by a provided message |
| 503 | The service is not available, which usually means that all the threads are currently used |
A 503 error normally means that all the threads available to Softcite service are currently used for processing concurrent requests. The client need to re-send the query after a wait time that will allow the server to free some threads. The wait time depends on the service and the capacities of the server, we suggest 1 seconds for the processSoftwareText service.
Using curl POST/GET requests with some text:
curl -X POST -d "text=We test GROBID (version 0.7.1)." localhost:8060/service/processSoftwareTextcurl -GET --data-urlencode "text=We test GROBID (version 0.7.1)." localhost:8060/service/processSoftwareTextwhich should return this:
{
"application": "software-mentions",
"version": "0.7.1",
"date": "2022-09-10T07:02+0000",
"mentions": [{
"software-name": {
"rawForm": "GROBID",
"normalizedForm": "GROBID",
"offsetStart": 8,
"offsetEnd": 14
},
"type": "software",
"version": {
"rawForm": "0.7.1",
"normalizedForm": "0.7.1",
"offsetStart": 24,
"offsetEnd": 29
},
"context": "We test GROBID (version 0.7.1).",
"mentionContextAttributes": {
"used": {
"value": true,
"score": 0.9999960660934448
},
"created": {
"value": false,
"score": 2.384185791015625E-7
},
"shared": {
"value": false,
"score": 1.1920928955078125E-7
}
},
"documentContextAttributes": {
"used": {
"value": true,
"score": 0.9999960660934448
},
"created": {
"value": false,
"score": 2.384185791015625E-7
},
"shared": {
"value": false,
"score": 1.1920928955078125E-7
}
}
}],
"runtime": 242
}Runtimes are expressed in milliseconds.
| method | request type | response type | parameters | requirement | description |
|---|---|---|---|---|---|
| POST | multipart/form-data |
application/json |
input |
required | PDF file to be processed |
disambiguate |
optional | disambiguate is a string of value 0 (no disambiguation, default value) or 1 (disambiguate and inject Wikidata entity id and Wikipedia pageId) |
Response status codes:
| HTTP Status code | reason |
|---|---|
| 200 | Successful operation. |
| 204 | Process was completed, but no content could be extracted and structured |
| 400 | Wrong request, missing parameters, missing header |
| 500 | Indicate an internal service error, further described by a provided message |
| 503 | The service is not available, which usually means that all the threads are currently used |
A 503 error normally means that all the threads available to Softcite service are currently used for processing concurrent requests. The client need to re-send the query after a wait time that will allow the server to free some threads. The wait time depends on the service and the capacities of the server, we suggest 2 seconds for the annotateSoftwarePDF service or 3 seconds when disambiguation is also requested.
Using curl POST request with a PDF file:
curl --form input=@./src/test/resources/PMC1636350.pdf --form disambiguate=1 localhost:8060/service/annotateSoftwarePDFFor PDF, each entity will be associated with a list of bounding box coordinates relative to the PDF, see here for more explanation about the coordinate system.
In addition, the response will contain the bibliographical reference information associated to a software mention when found. The bibliographical information are provided in XML TEI (similar format as GROBID).
The softcite software mention service can extract software mentions with sentence context information from a variety of publisher XML formats, including not only JATS, but also a dozen of mainstream publisher native XML (Elsevier, Nature, ScholarOne, Wiley, etc.). See Pub2TEI for the list of supported formats. Each call with an XML file (non TEI XML) will involve a transformation of the XML file into a TEI XML file, which will slow down the overall process. This additional time (a few seconds) is due to the loading and compilation of the style sheets that need to be performed for every calls.
| method | request type | response type | parameters | requirement | description |
|---|---|---|---|---|---|
| POST | multipart/form-data |
application/json |
input |
required | XML file to be processed |
disambiguate |
optional | disambiguate is a string of value 0 (no disambiguation, default value) or 1 (disambiguate and inject Wikidata entity id and Wikipedia pageId) |
Response status codes:
| HTTP Status code | reason |
|---|---|
| 200 | Successful operation. |
| 204 | Process was completed, but no content could be extracted and structured |
| 400 | Wrong request, missing parameters, missing header |
| 500 | Indicate an internal service error, further described by a provided message |
| 503 | The service is not available, which usually means that all the threads are currently used |
A 503 error normally means that all the threads available to Softcite service are currently used for processing concurrent requests. The client need to re-send the query after a wait time that will allow the server to free some threads. The wait time depends on the service and the capacities of the server, we suggest 2 seconds for the extractSoftwareXML service or 3 seconds when disambiguation is also requested.
Using curl POST request with a XML file:
curl --form input=@./src/test/resources/PMC3130168.xml --form disambiguate=1 localhost:8060/service/annotateSoftwareXMLThe softcite software mention service will extracts software mentions with sentence context information from TEI XML files directly, without then the need of further transformation as for the other publisher XML formats (see above). The process will thus be much faster and should preferably used if possible.
| method | request type | response type | parameters | requirement | description |
|---|---|---|---|---|---|
| POST | multipart/form-data |
application/json |
input |
required | TEI XML file to be processed |
disambiguate |
optional | disambiguate is a string of value 0 (no disambiguation, default value) or 1 (disambiguate and inject Wikidata entity id and Wikipedia pageId) |
Response status codes:
| HTTP Status code | reason |
|---|---|
| 200 | Successful operation. |
| 204 | Process was completed, but no content could be extracted and structured |
| 400 | Wrong request, missing parameters, missing header |
| 500 | Indicate an internal service error, further described by a provided message |
| 503 | The service is not available, which usually means that all the threads are currently used |
A 503 error normally means that all the threads available to Softcite service are currently used for processing concurrent requests. The client need to re-send the query after a wait time that will allow the server to free some threads. The wait time depends on the service and the capacities of the server, we suggest 2 seconds for the extractSoftwareXML service or 3 seconds when disambiguation is also requested.
Using curl POST request with a XML file:
curl --form input=@./src/test/resources/PMC3130168.tei.xml --form disambiguate=1 localhost:8060/service/annotateSoftwareTEIThe service check /service/isalive will return true/false whether the service is up and running.
Return the concept service base URL derived from the configured entity-fishing host/port (or a public default if not configured). This helps the frontend decide whether to call the external service directly or use the backend proxy when CORS blocks direct calls.
| method | request type | response type | parameters | requirement | description |
|---|---|---|---|---|---|
| GET | - | application/json | - | - | Returns { "conceptBaseUrl": "..." } |
Response status codes:
| HTTP Status code | reason |
|---|---|
| 200 | Successful operation |
| 500 | Internal service error |
Proxy a concept lookup to the URL derived from entityFishingHost/Port. Use this from the browser to avoid CORS issues. If host/port are not set, it falls back to https://cloud.science-miner.com/nerd/service/kb/concept.
| method | request type | response type | parameters | requirement | description |
|---|---|---|---|---|---|
| GET | - | application/json | identifier (path) |
required | Concept identifier (Wikipedia page id) |
lang (query) |
optional | Language code (e.g., en) |
Response status codes:
| HTTP Status code | reason |
|---|---|
| 200 | Successful operation |
| 400 | Missing/invalid parameters |
| 502 | Upstream concept service error |
The software-mention module inherits the configuration of GROBID.
The configuration parameters specific to the software-mention module can be modified in the file resources/config/config.yml:
- entityFishingHost / entityFishingPort: Host and port for the entity-fishing service used for concept data. The concept endpoint is derived as:
- If port is 443 (or 8443) or the host already includes an https scheme, the base is https; otherwise http.
- Final concept base URL used by the UI/proxy:
<scheme>://<host>[:port-if-in-host]/nerd/service/kb/concept - If these are not set, the system falls back to the public endpoint
https://cloud.science-miner.com/nerd/service/kb/concept.



