Skip to content

cshan-github/TCM_NER_datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 

Repository files navigation

TCMNER_datasets

TCM NER dataset used in the paper "MC-TCMNER_A Multi-Modal Fusion Model Combining Contrast Learning Method for Traditional Chinese Medicine NER"

Prescipts

Since there is currently a lack of publicly available large-scale TCMNER datasets, we collected and annotated a TCMNER dataset independently and used it to evaluate the model’s performance. We first used web scraping techniques to extract disease related information from the website, which includes brief descriptions of diseases, symptoms, causes, treatment methods, and others. For the extracted text content, we initially cleaned it by removing duplicate sentences and characters that are not Chinese. Four clinical experts from China (chief physicians) used the YEDDA annotation tool to individually annotate text for four types of information using BIO tagging. This dataset contains a total of 387,465 Chinese characters, with five types of entities: Symptoms, Causes, Herbs, Preparations(already prepared medicine), and Effects.

About

TCM NER dataset used in the paper "MC-TCMNER_A Multi-Modal Fusion Model Combining Contrast Learning Method for Traditional Chinese Medicine NER"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors