This repo contains the submission of the group Gut Instincts in the GutBrainIE CLEF 2025 challenge, part of the BioASQ CLEF Lab 2025. The challenge focuses on extracting structured information from biomedical abstracts related to the gut microbiota and its connections with Parkinson's disease and mental health. The goal is to develop Information Extraction (IE) systems to support experts in understanding the gut-brain interplay.
The challenge is divided into two main subtasks:
- Named Entity Recognition (NER): Identifying and classifying specific text spans into predefined categories.
- Relation Extraction (RE): Determining if a particular relationship between two categories holds.
To reproduce our results, follow these steps:
- Download the Data:
The official challenge data is not included in this repository. Download the data and place the data in the data/ directory, preserving the original folder structure.
- Prepare the Environment
Follow the guide in Setup to create an environment, activate the envirnoment, and install all dependencies.
- Preprocess and Prepare the Training Data
Choose an existing training configuration from the training_configs/ directory, or create a new one based on the template located at training_configs/_template.yaml. Then run the following command, replacing PATH_TO_TRAINING_CONFIG with the path to the chosen configuration file:
python src/preprocessing/create_datasets.py --config PATH_TO_TRAINING_CONFIGBased on the settings in the training configuration, the preprocessing script will load the specified training datasets, apply corrections and cleaning steps, optionally remove HTML content, tokenize the data using the appropriate tokenizer for the specified model, and save the processed data in the appropriate location for subsequent training.
- Train the Models
Once the datasets have been prepared, start the training process using the same training configuration. Run the following command, replacing PATH_TO_TRAINING_CONFIG with the path to the chosen configuration file:
python src/training/run_training.py --config PATH_TO_TRAINING_CONFIGThis script will load the preprocessed data, initialize the specified model architecture, and begin training according to the parameters defined in the training configuration file (such as the number of epochs, batch size, and learning rate schedule). Progress, metrics, and the best performing model (based on its F1_micro score) will be saved to the models/ directory.
- Create predictions
Once the models have been trained, the inference process can be started to generate predictions.
NER inference: To run inference with the NER models, execute the following command, replacing PATH_TO_TRAINING_CONFIG with the path to the chosen configuration file:
python src/inference/ner_inference.py --config PATH_TO_TRAINING_CONFIGThe results will be saved to data_inference_results.
NER ensemble inference: To perform ensemble inference with NER models, run the command below, replacing PATH_TO_INFERENCE_CONFIG with the path to the NER inference configuration file. A template configuration can be found at inference_configs/_template_ner_ensemble_inference.yaml.
python src/inference/ner_ensemble_inference.py --config PATH_TO_INFERENCE_CONFIGThe results will be saved to data_inference_results.
RE inference: To perform inference with an RE model, use the script located at src/inference/re_inference.py.
For pipeline-based inference, where an RE model is applied to predictions generated by a NER model, use src/inference/pipeline.py. In this case, the following needs to be specified:
- the path to the NER predictions,
- the path to the data (from the
data/Articles/json_formatfolder) used to create the NER predictions, - and the path to the folder containing the training configurations for the RE models to be used in the pipeline.
RE ensemble inference: To produce ensemble inference with RE models, run the command below, replacing PATH_TO_INFERENCE_CONFIG with the path to the RE inference configuration file. A template configuration can be found at inference_configs/_template_re_relation_ensemble_inference.yaml.
python src/inference/re_ensemble_inference.py --config PATH_TO_INFERENCE_CONFIG- All training was conducted on a computational cluster with GPU resources. Training on local machines may take significantly longer or may not be feasible depending on hardware.
- If you encounter issues with missing packages, ensure your environment matches the versions specified in
pyproject.toml.
It is recommended to use a virtual environment to avoid dependency conflicts.
Windows:
python -m venv env
env\Scripts\activateLinux/MacOS:
python3 -m venv env
source env/bin/activateTo deactivate the environment:
deactivateInstall the necessary dependencies as specified in pyproject.toml:
pip install -e .This project is licensed under the MIT License. See the LICENSE file for details.