This repository presents source code for pretraining BERT-based biomedical entity representation models on UMLS synonyms and concept graphs. The model is published at CLEF 2023 conference. For pre-training details, please see our paper.
We release two GEBERT versions that use GraphSAGE and GAT graph encoders, respectively. The checkpoints can be accessed via HuggingFace:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("andorei/gebert_eng_gat")
model = AutoModel.from_pretrained("andorei/gebert_eng_gat")
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("andorei/gebert_eng_graphsage")
model = AutoModel.from_pretrained("andorei/gebert_eng_graphsage")
To train GEBERT, we used Python version 3.10. Required packages are listed in requirements.txt file. PyTorch geometric requires the torch-cluster, torch-scatter, and torch-sparse, so we recommend to install them prior to the installation of torch-geometric.
To train a model, you need to download a latest UMLS release. In the original GEBERT paper we utilized the 2020AB version.
To train a GEBERT model, two data components are required:
- List of synonymous concept name pairs.
- UMLS graph description
To obtain both synonyms and graph description, simply run create_positive_triplets_dataset.py script with appropriate environment variables:
python gebert/data/create_positive_triplets_dataset.py \
--mrconso "${UMLS_DIR}/MRCONSO.RRF" \
--mrrel "${UMLS_DIR}/MRREL.RRF" \
--langs "ENG" \
--output_dir $GRAPH_DATA_DIR As examples of training scripts please see graphsage_gebert_train_example.sh and gat_gebert_train_example.sh. To enable/disable multi-GPU training, please add/remove the "--parallel" flag.
For evaluation, we adopted the evaluation code and data from https://github.com/insilicomedicine/Fair-Evaluation-BERT.
@inproceedings{sakhovskiy2023gebert,
author="Sakhovskiy, Andrey
and Semenova, Natalia
and Kadurin, Artur
and Tutubalina, Elena",
title="Graph-Enriched Biomedical Entity Representation Transformer",
booktitle="Experimental IR Meets Multilinguality, Multimodality, and Interaction",
year="2023",
publisher="Springer Nature Switzerland",
address="Cham",
pages="109--120",
isbn="978-3-031-42448-9"
}