Skip to content

Andoree/GEBERT

Repository files navigation

GEBERT: Graph-Enriched Biomedical Entity Representation Transformer

This repository presents source code for pretraining BERT-based biomedical entity representation models on UMLS synonyms and concept graphs. The model is published at CLEF 2023 conference. For pre-training details, please see our paper.

Pre-trained models

We release two GEBERT versions that use GraphSAGE and GAT graph encoders, respectively. The checkpoints can be accessed via HuggingFace:

GAT-GEBERT:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("andorei/gebert_eng_gat")
model = AutoModel.from_pretrained("andorei/gebert_eng_gat")

GraphSAGE-GEBERT:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("andorei/gebert_eng_graphsage")
model = AutoModel.from_pretrained("andorei/gebert_eng_graphsage")

Dependencies

To train GEBERT, we used Python version 3.10. Required packages are listed in requirements.txt file. PyTorch geometric requires the torch-cluster, torch-scatter, and torch-sparse, so we recommend to install them prior to the installation of torch-geometric.

Data

To train a model, you need to download a latest UMLS release. In the original GEBERT paper we utilized the 2020AB version.

To train a GEBERT model, two data components are required:

  • List of synonymous concept name pairs.
  • UMLS graph description

To obtain both synonyms and graph description, simply run create_positive_triplets_dataset.py script with appropriate environment variables:

python gebert/data/create_positive_triplets_dataset.py \
--mrconso "${UMLS_DIR}/MRCONSO.RRF" \
--mrrel "${UMLS_DIR}/MRREL.RRF" \
--langs "ENG" \
--output_dir $GRAPH_DATA_DIR 

GEBERT pre-training

As examples of training scripts please see graphsage_gebert_train_example.sh and gat_gebert_train_example.sh. To enable/disable multi-GPU training, please add/remove the "--parallel" flag.

Evaluation

For evaluation, we adopted the evaluation code and data from https://github.com/insilicomedicine/Fair-Evaluation-BERT.

Citation

@inproceedings{sakhovskiy2023gebert,
author="Sakhovskiy, Andrey
and Semenova, Natalia
and Kadurin, Artur
and Tutubalina, Elena",
title="Graph-Enriched Biomedical Entity Representation Transformer",
booktitle="Experimental IR Meets Multilinguality, Multimodality, and Interaction",
year="2023",
publisher="Springer Nature Switzerland",
address="Cham",
pages="109--120",
isbn="978-3-031-42448-9"
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors