UF HOBI NLP GROUP

NLP package for de-identification EHR notes

implementation detail:

linux env
python 2.7
Tensorflow
BiLSTM-CRF architecture similar as Lample et al.
featured with a function for discrete features as embedding
featured with a function for fine tuning based on pre-trained model

what does the code do:

NER task to detect PHIs in the EHR notes
replace PHIs with de-identified information
fine tuning base model for different institutions

How to use

implement task using the experiment.sh as example in shell directory
using brat formatted output, the results can be visualized using the brat tool (https://brat.nlplab.org/)
we pack the application as a docker image (please see README-docker.md for usage)
please compile docker image accoring to the development progress version

Production Model Directory

only ship one model for UF IDR in docker image to reduce the image size

model_1: rnner_kb model
model_2: rnner model
model_3: rnner_attn model
model_4: rnner_kb_attn model

requirments

at least 3 times disk space as the input data
tensorflow==1.7.0
spacy==2.0.16
ftfy==4.4.3
spacy langauge model: en_core_web_md

install python packages using

pip install -r requirements.txt and install spacy language model using python -m spacy download en_core_web_md

Development Version:

v2.3.0

Current progress:

evaluation (0.0.5)
batch evaluation (0.1.0) (released)
use discrete knowledge features as embeddings
PHIs are replaced with MIMIC-III style: [**PHI_TYPE**] (0.1.1) (released)
pure python implementation done: app.py is the entry to run whole application; *.sh now only serve as the app entry for docker (1.1.1)
all string operations are using unicode (1.2.1)
logging global set up (1.2.2)
using global config to control file locations; logging and other shared parameters (1.2.3)
rule base matching function for web url and email (new rules can be integrated in the rule_based_phi_detection.py) (1.2.4)
code efficiency improvement: merge the tokenization, normalization and BIO-feature generation in one pass in memory without intermediate IO in the preprocess (1.3.4)
skip empty files and exclude those from the results (1.3.5)
predict_batch save one IO operation; add multi-processing to bio_output (1.4.6)
new models support: rnner, rnner_kb and rnner_kb_conv (1.5.6)
update production model: using newly trained models on 500 UF notes; using MODEL_FLAG to control which (rnner vs. rnner_kb) model to use (1.5.7)
add new rnner packages: rnner_attn and rnner_kb_attn. Both implemented the self-attention mechanism. (1.5.8)
add an API for taking configuration of PHI types that need to be kept during PHI replacement. Therefore, the user can control the PHI types that they do or do not want to remove. (1.6.8). Also, we add a usage readme to explain the parameters that user can tune.
add robust optimization of work iteratively bu processing chunks. (1.7.8)
release rnner_attention model instead of rnner model (v1.7.8)
release new model trained with location_other, other, location pobox tags; tested training preprocessing without problem (v1.8.0)
v1.8.1 fix encoding
v1.8.2 fix rule
v1.8.3 re-train the model with new annotation filter (only leave contact_web not trained)
v2.0.0 app change to parallel mode with one process for preprocessing and two processes for de-identificaiton.
v2.0.3 fix several bugs
v2.0.4 start from where left - handle stop and continue
v2.3.0 update to maxmize performance

TODO list:

refactor the current implementation using class (2.x.x)
preprocess and predict become async process, predict did not block next partition pre-process. (1.x.x)
load config file to overwrite the default config (not apply to model use because we do not want to have multiple models in one docker image. We will release different docker images with different models)
API for accepting self-defined regex of PHIs (x.1.x)
allow using different PHI tags and features (x.x.1)
allow self define PHI replacement style (x.x.1)
training new model function (x.1.x)
fine tuning model function (x.1.x)
batch process notes to avoid OOM issue (x.1.x)
not using discrete feature to enhance pre-process speed (performance may sacrificed) (x.x.1)
give options for BERT, XLNet or BiLSTM-CRFs architecture (issue 2) (1.x.x)
code robust: error handling (x.1.x)
Tensorflow AVX version to enhance speed
migrate from py2 to py3 and switch to pytorch_HOBI_NER package as backend (1.x.x)

issues:

encoding will cause problem during feature generation step and will cause file not be processed and de-identified (done)
integration of BERT and XLNet requires py3 env
logging with root config (in production vs debug) (which information should be exposed to user) (done)
Tensorflow AVX version to enhance speed
"OTHER" tag performance is too low for model. Since most of them are special words like "MyUFHealth", we can handle it in the rule-base processing.

AUTHORS

Yonghui Wu
Xi Yang

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
src		src
Dockerfile		Dockerfile
README-docker.md		README-docker.md
README.md		README.md
create_production_docker_image.sh		create_production_docker_image.sh
deepDEID_run_wrapper.py		deepDEID_run_wrapper.py
docker_build.sh		docker_build.sh
end2end_docker.sh		end2end_docker.sh
end2end_terminal.sh		end2end_terminal.sh
grep.sh		grep.sh
grep.txt		grep.txt
pycon_setup.py		pycon_setup.py
readme_usage.md		readme_usage.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

UF HOBI NLP GROUP

NLP package for de-identification EHR notes

implementation detail:

what does the code do:

How to use

Production Model Directory

requirments

Development Version:

Current progress:

TODO list:

issues:

AUTHORS

About

Uh oh!

Releases

Packages

Languages

am202/deepdeid-github

Folders and files

Latest commit

History

Repository files navigation

UF HOBI NLP GROUP

NLP package for de-identification EHR notes

implementation detail:

what does the code do:

How to use

Production Model Directory

requirments

Development Version:

Current progress:

TODO list:

issues:

AUTHORS

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages