Skip to content

Add StopWordFilter#78

Closed
rth wants to merge 6 commits intomainfrom
stop-word-filter
Closed

Add StopWordFilter#78
rth wants to merge 6 commits intomainfrom
stop-word-filter

Conversation

@rth
Copy link
Owner

@rth rth commented Jun 14, 2020

Add the StopWordFilter struct to filter stop words, as as example of a TokenProcessor trait implementation that takes in an iterator and returns an iterator of strings (following discussion in #21)

TODO

  • decide what should be the default stop word list: either take an english stop word list from somewhere (e.g. spacy), or ask users to explicitly provide one.

@rth rth mentioned this pull request Jun 14, 2020
2 tasks
@joshlk
Copy link
Collaborator

joshlk commented Jun 15, 2020

decide what should be the default stop word list: either take an english stop word list from somewhere (e.g. spacy), or ask users to explicitly provide one.

I think it's useful to include sensible defaults such as a stop word list. It allows people to experiment with the package quicker. But it's also important not to be English/European language centric.

How about having separate preference defaults for different languages. Such as:

StopWordFilter::default("en")

Additionally, I think it would be sensible to identify different languages throughout the package using ISO two-letter codes (e.g. en, fr, de ...).

@rth
Copy link
Owner Author

rth commented Jun 15, 2020

I think it's useful to include sensible defaults such as a stop word list. It allows people to experiment with the package quicker. But it's also important not to be English/European language centric.

Absolutely. It's just that there is not clear consensus what should a stop word include/exclude and when one is provided people tend to use it without thinking too much (see e.g. this paper). I agree we can include stop word list for a few common world languages.

Additionally, I think it would be sensible to identify different languages throughout the package using ISO two-letter codes (e.g. en, fr, de ...).

+1

@joshlk
Copy link
Collaborator

joshlk commented Jun 15, 2020

Interesting paper. Might be worth including a standard stop word list from spacy but add a note in the documentation that refers to the paper.

@rth rth closed this Nov 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants