Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 59 additions & 1 deletion LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,62 @@
This project contains content developed by The MITRE Corporation. If this code
is used in a deployment or embedded within another project, it is requested
that you send an email to opensource@mitre.org in order to let us know where
this software is being used.
this software is being used.

*****************************************************************************

The nlp_text_splitter utlity uses the following sentence detection libraries:

*****************************************************************************

The WtP, "Where the Point", sentence segmentation library falls under the MIT License:

https://github.com/bminixhofer/wtpsplit/blob/main/LICENSE

MIT License

Copyright (c) 2024 Benjamin Minixhofer

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

*****************************************************************************

The spaCy Natural Language Processing library falls under the MIT License:

The MIT License (MIT)

Copyright (C) 2016-2024 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
50 changes: 50 additions & 0 deletions detection/nlp_text_splitter/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Overview

This directory contains the source code, test examples, and installation script
for the OpenMPF NlpTextSplitter tool, which uses WtP and spaCy libraries
to detect sentences in a given chunk of text.

# Background

Our primary motivation for creating this tool was to find a lightweight, accurate
sentence detection capability to support a large variety of text processing tasks
including translation and tagging.

Through preliminary investigation, we identified the [WtP library ("Where's the
Point")](https://github.com/bminixhofer/wtpsplit) and [spaCy's multilingual sentence
detection model](https://spacy.io/models) for identifying sentence breaks
in a large section of text.

WtP models are trained to split up multilingual text by sentence without the need of an
input language tag. The disadvantage is that the most accurate WtP models will need ~3.5
GB of GPU memory. On the other hand, spaCy has a single multilingual sentence detection
that appears to work better for splitting up English text in certain cases. Unfortunately
this model lacks support handling for Chinese punctuation.

# Installation

To install this tool users will need to run `./install.sh`. By default this will set up a
CPU-only PyTorch installation.

Please note that several customizations are supported:

- `--text-splitter-dir|-t <path_to_src>`: This parameter specifies where the
source code is located relative to the installation script. In general,
since the installation script and source code are both located here, it's not
necessary to update this parameter unless the user is running the `install.sh`
script from a different directory.

- `--gpu`: Add this parameter to the installation command line above to
setup a PyTorch installation with CUDA (GPU) libraries.

- `--wtp-models-dir |-m <wtp-models-dir >`: Add this parameter to
change the default WtP model installation directory
(default: `/opt/wtp/models`).

- `--install-wtp-model|-w <model-name>`: Add this parameter to specify
additional WTP models for installation. This parameter can be provided
multiple times to install more than one model.

- `--install-spacy-model|-s <model-name>`: Add this parameter to specify
additional spaCy models for installation. This parameter can be provided
multiple times to install more than one model.
168 changes: 168 additions & 0 deletions detection/nlp_text_splitter/install.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
#!/usr/bin/env bash

#############################################################################
# NOTICE #
# #
# This software (or technical data) was produced for the U.S. Government #
# under contract, and is subject to the Rights in Data-General Clause #
# 52.227-14, Alt. IV (DEC 2007). #
# #
# Copyright 2024 The MITRE Corporation. All Rights Reserved. #
#############################################################################

#############################################################################
# Copyright 2024 The MITRE Corporation #
# #
# Licensed under the Apache License, Version 2.0 (the "License"); #
# you may not use this file except in compliance with the License. #
# You may obtain a copy of the License at #
# #
# http://www.apache.org/licenses/LICENSE-2.0 #
# #
# Unless required by applicable law or agreed to in writing, software #
# distributed under the License is distributed on an "AS IS" BASIS, #
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. #
# See the License for the specific language governing permissions and #
# limitations under the License. #
#############################################################################

set -o errexit -o pipefail

main() {
if ! options=$(getopt --name "$0" \
--options t:gm:w:s: \
--longoptions text-splitter-dir:,gpu,wtp-models-dir:,install-wtp-model:,install-spacy-model: \
-- "$@"); then
print_usage
fi
eval set -- "$options"
local wtp_models_dir=/opt/wtp/models
local wtp_models=("wtp-bert-mini")
local spacy_models=("xx_sent_ud_sm")
while true; do
case "$1" in
--text-splitter-dir | -t )
shift
local text_splitter_dir=$1
;;
--gpu | -g )
local gpu_enabled=true
;;
--wtp-models-dir | -m )
shift
wtp_models_dir=$1;
;;
--install-wtp-model | -w )
shift
wtp_models+=("$1")
;;
--install-spacy-model | -s )
shift
spacy_models+=("$1")
;;
-- )
shift
break
;;
esac
shift
done

install_text_splitter "$text_splitter_dir"
install_py_torch "$gpu_enabled"
download_wtp_models "$wtp_models_dir" "${wtp_models[@]}"
download_spacy_models "${spacy_models[@]}"
}


install_text_splitter() {
local text_splitter_dir=$1
if [[ ! $text_splitter_dir ]]; then
text_splitter_dir=$(dirname "$(realpath "${BASH_SOURCE[0]}")")
fi

echo "Installing text splitter from source directory: $text_splitter_dir"
pip3 install "$text_splitter_dir"
}


install_py_torch() {
local gpu_enabled=$1
local torch_package='torch~=2.3'
if [[ $gpu_enabled ]]; then
echo "Installing GPU enabled PyTorch."
pip3 install "$torch_package"
else
echo "Installing CPU only version of PyTorch."
# networkx is a dependency of PyTorch, but the version of networkx in the PyTorch package
# index requires Python 3.9. networkx needs to be installed in a separate command so that
# pip can get networkx from PyPi.
pip3 install 'networkx~=3.1'
pip3 install "$torch_package" --index-url https://download.pytorch.org/whl/cpu
fi
}


download_wtp_models() {
local wtp_models_dir=$1
shift
local model_names=("$@")
setup_wtp_models_dir "$wtp_models_dir"

for model_name in "${model_names[@]}"; do
echo "Downloading the $model_name model to $wtp_models_dir."
local wtp_model_dir="$wtp_models_dir/$model_name"
python3 -c \
"from huggingface_hub import snapshot_download; \
snapshot_download('benjamin/$model_name', local_dir='$wtp_model_dir')"
done
}

setup_wtp_models_dir() {
local wtp_models_dir=$1

if [[ ! $REQUESTS_CA_BUNDLE ]]; then
export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt
fi

if ! mkdir --parents "$wtp_models_dir"; then
echo "ERROR: Failed to create the $wtp_models_dir directory."
exit 3
fi

if [[ ! -w "$wtp_models_dir" ]]; then
echo -n "ERROR: The model directory, \"$wtp_models_dir\" is not writable by the current user. "
echo "The permissions on \"$wtp_models_dir\" must be modified."
exit 4
fi
}

download_spacy_models() {
for model_name in "$@"; do
echo "Downloading the $model_name spaCy model."
python3 -m spacy download "$model_name"
done
}


print_usage() {
echo
echo "Usage:
$0 [--text-splitter-dir|-t <path_to_src>] [--gpu|-g] [--wtp-models-dir |-m <wtp-models-dir >] [--install-wtp-model|-w <model-name>]* [--install-spacy-model|-s <model-name>]*
Options
--text-splitter-dir, -t <path>: Path to text splitter source code. (defaults to to the
same directory as this script)
--gpu, -g: Install the GPU version of PyTorch
--wtp-models-dir , -m <path>: Path where WTP models will be stored.
(defaults to /opt/wtp/models)
--install-wtp-model, -w <name>: Name of a WTP model to install in addtion to wtp-bert-mini.
This option can be provided more than once to specify
multiple models.
--install-spacy-model | -s <name>: Names of a spaCy model to install in addtion to
xx_sent_ud_sm. The option can be provided more than once
to specify multiple models.
"
exit 1
}

main "$@"
Loading