Skip to content

Add nlp_text_splitter package#78

Merged
hhuangMITRE merged 12 commits into
developfrom
feat/text-split-pkg
May 7, 2024
Merged

Add nlp_text_splitter package#78
hhuangMITRE merged 12 commits into
developfrom
feat/text-split-pkg

Conversation

@hhuangMITRE
Copy link
Copy Markdown
Contributor

@hhuangMITRE hhuangMITRE commented Apr 30, 2024

Copy link
Copy Markdown
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 9 files at r1, 1 of 2 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @brosenberg42 and @hhuangMITRE)


detection/nlp_text_splitter/install.sh line 1 at r2 (raw file):

#!/usr/bin/env bash

Needs header comments.


detection/nlp_text_splitter/install.sh line 7 at r2 (raw file):

main() {
    if ! options=$(getopt --name "$0"  \
            --options t:gm:w:b: \

Should this be --options t:gm:w:s: ?

Specifically, the b: should be replaced with s: since that's the short version of --install-spacy-model | -s.


detection/nlp_text_splitter/install.sh line 13 at r2 (raw file):

    fi
    eval set -- "$options"
    local models_dir=/opt/wtp/models

Call this wtp_models_dir for clarity.


detection/nlp_text_splitter/install.sh line 130 at r2 (raw file):

                                       same directory as this script)
    --gpu, -g:                         Install the GPU version of PyTorch
    --models-dir, -m <path>:           Path where WTP models will be stored.

Call this --wtp-models-dir for clarity. Update in the README as well.

@hhuangMITRE hhuangMITRE requested a review from jrobble May 2, 2024 16:05
Copy link
Copy Markdown
Contributor Author

@hhuangMITRE hhuangMITRE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 5 of 9 files reviewed, 4 unresolved discussions (waiting on @brosenberg42 and @jrobble)


detection/nlp_text_splitter/install.sh line 1 at r2 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Needs header comments.

Fixed, thank you! I've updated the headers in this file to 2024. I can update the other headers from 2023->2024 as well.


detection/nlp_text_splitter/install.sh line 7 at r2 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Should this be --options t:gm:w:s: ?

Specifically, the b: should be replaced with s: since that's the short version of --install-spacy-model | -s.

Yes, it should be s: instead of b:. I fixed it, thanks!


detection/nlp_text_splitter/install.sh line 13 at r2 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Call this wtp_models_dir for clarity.

Done, I've also updated the function call setup_wtp_models_dir.


detection/nlp_text_splitter/install.sh line 130 at r2 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Call this --wtp-models-dir for clarity. Update in the README as well.

Done.


detection/nlp_text_splitter/README.md line 40 at r4 (raw file):

  setup a PyTorch installation with CUDA (GPU) libraries.

- `--wtp-models-dir |-m <wtp-models-dir >`: Add this parameter to

Refactored to wtp-models-dir

@jrobble
Copy link
Copy Markdown
Member

jrobble commented May 3, 2024

detection/nlp_text_splitter/install.sh line 34 at r5 (raw file):

    if ! options=$(getopt --name "$0"  \
            --options t:gm:w:s: \
            --longoptions text-splitter-dir:,gpu,wtp-models-dir :,install-wtp-model:,install-spacy-model: \

This space before the colon made it so that I could not do a GPU build no matter what I provided for BUILD_TYPE: wtp-models-dir :,install-wtp-model

I fixed this in a commit.

@jrobble
Copy link
Copy Markdown
Member

jrobble commented May 4, 2024

detection/nlp_text_splitter/nlp_text_splitter/__init__.py line 105 at r6 (raw file):

            self._default_lang = default_lang

        if self._model_name == wtp_model_name:

Not only do you need to check if the model name is the same, you need to also check if the model setting is the same (CPU vs. GPU).

I ran into a situation where on first run with SENTENCE_MODEL_CPU_ONLY=false the component was Using cached model: wtp-bert-mini but I saw no GPU activity.

Copy link
Copy Markdown
Contributor Author

@hhuangMITRE hhuangMITRE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 9 files reviewed, 5 unresolved discussions (waiting on @brosenberg42 and @jrobble)


detection/nlp_text_splitter/nlp_text_splitter/__init__.py line 105 at r6 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Not only do you need to check if the model name is the same, you need to also check if the model setting is the same (CPU vs. GPU).

I ran into a situation where on first run with SENTENCE_MODEL_CPU_ONLY=false the component was Using cached model: wtp-bert-mini but I saw no GPU activity.

Thanks for catching that, I tested the canine series and didn't catch this. I've made the requested changes and rebuilt the images just to make sure. I've tested out the text-splitter with gpu servers: confirmed that the caching works when updating a cpu wtp-bert-mini to use cuda/gpu.

Confirmed that works, bert-mini used about 800 MB of GPU memory.

Copy link
Copy Markdown
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 4 files at r4, 1 of 1 files at r5, 1 of 1 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @brosenberg42)

@hhuangMITRE hhuangMITRE merged commit 23be9c4 into develop May 7, 2024
@hhuangMITRE hhuangMITRE deleted the feat/text-split-pkg branch May 7, 2024 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants