Skip to content

am202/deepdeid-github

Repository files navigation

UF HOBI NLP GROUP

NLP package for de-identification EHR notes

implementation detail:

  • linux env
  • python 2.7
  • Tensorflow
  • BiLSTM-CRF architecture similar as Lample et al.
  • featured with a function for discrete features as embedding
  • featured with a function for fine tuning based on pre-trained model

what does the code do:

  • NER task to detect PHIs in the EHR notes
  • replace PHIs with de-identified information
  • fine tuning base model for different institutions

How to use

  • implement task using the experiment.sh as example in shell directory
  • using brat formatted output, the results can be visualized using the brat tool (https://brat.nlplab.org/)
  • we pack the application as a docker image (please see README-docker.md for usage)
  • please compile docker image accoring to the development progress version

Production Model Directory

only ship one model for UF IDR in docker image to reduce the image size

  • model_1: rnner_kb model
  • model_2: rnner model
  • model_3: rnner_attn model
  • model_4: rnner_kb_attn model

requirments

  • at least 3 times disk space as the input data
  • tensorflow==1.7.0
  • spacy==2.0.16
  • ftfy==4.4.3
  • spacy langauge model: en_core_web_md

install python packages using

pip install -r requirements.txt and install spacy language model using python -m spacy download en_core_web_md

Development Version:

  • v2.3.0

Current progress:

  • evaluation (0.0.5)
  • batch evaluation (0.1.0) (released)
  • use discrete knowledge features as embeddings
  • PHIs are replaced with MIMIC-III style: [**PHI_TYPE**] (0.1.1) (released)
  • pure python implementation done: app.py is the entry to run whole application; *.sh now only serve as the app entry for docker (1.1.1)
  • all string operations are using unicode (1.2.1)
  • logging global set up (1.2.2)
  • using global config to control file locations; logging and other shared parameters (1.2.3)
  • rule base matching function for web url and email (new rules can be integrated in the rule_based_phi_detection.py) (1.2.4)
  • code efficiency improvement: merge the tokenization, normalization and BIO-feature generation in one pass in memory without intermediate IO in the preprocess (1.3.4)
  • skip empty files and exclude those from the results (1.3.5)
  • predict_batch save one IO operation; add multi-processing to bio_output (1.4.6)
  • new models support: rnner, rnner_kb and rnner_kb_conv (1.5.6)
  • update production model: using newly trained models on 500 UF notes; using MODEL_FLAG to control which (rnner vs. rnner_kb) model to use (1.5.7)
  • add new rnner packages: rnner_attn and rnner_kb_attn. Both implemented the self-attention mechanism. (1.5.8)
  • add an API for taking configuration of PHI types that need to be kept during PHI replacement. Therefore, the user can control the PHI types that they do or do not want to remove. (1.6.8). Also, we add a usage readme to explain the parameters that user can tune.
  • add robust optimization of work iteratively bu processing chunks. (1.7.8)
  • release rnner_attention model instead of rnner model (v1.7.8)
  • release new model trained with location_other, other, location pobox tags; tested training preprocessing without problem (v1.8.0)
  • v1.8.1 fix encoding
  • v1.8.2 fix rule
  • v1.8.3 re-train the model with new annotation filter (only leave contact_web not trained)
  • v2.0.0 app change to parallel mode with one process for preprocessing and two processes for de-identificaiton.
  • v2.0.3 fix several bugs
  • v2.0.4 start from where left - handle stop and continue
  • v2.3.0 update to maxmize performance

TODO list:

  • refactor the current implementation using class (2.x.x)
  • preprocess and predict become async process, predict did not block next partition pre-process. (1.x.x)
  • load config file to overwrite the default config (not apply to model use because we do not want to have multiple models in one docker image. We will release different docker images with different models)
  • API for accepting self-defined regex of PHIs (x.1.x)
  • allow using different PHI tags and features (x.x.1)
  • allow self define PHI replacement style (x.x.1)
  • training new model function (x.1.x)
  • fine tuning model function (x.1.x)
  • batch process notes to avoid OOM issue (x.1.x)
  • not using discrete feature to enhance pre-process speed (performance may sacrificed) (x.x.1)
  • give options for BERT, XLNet or BiLSTM-CRFs architecture (issue 2) (1.x.x)
  • code robust: error handling (x.1.x)
  • Tensorflow AVX version to enhance speed
  • migrate from py2 to py3 and switch to pytorch_HOBI_NER package as backend (1.x.x)

issues:

  1. encoding will cause problem during feature generation step and will cause file not be processed and de-identified (done)
  2. integration of BERT and XLNet requires py3 env
  3. logging with root config (in production vs debug) (which information should be exposed to user) (done)
  4. Tensorflow AVX version to enhance speed
  5. "OTHER" tag performance is too low for model. Since most of them are special words like "MyUFHealth", we can handle it in the rule-base processing.

AUTHORS

  • Yonghui Wu
  • Xi Yang

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages