Conversation
I think it's useful to include sensible defaults such as a stop word list. It allows people to experiment with the package quicker. But it's also important not to be English/European language centric. How about having separate preference defaults for different languages. Such as: Additionally, I think it would be sensible to identify different languages throughout the package using ISO two-letter codes (e.g. en, fr, de ...). |
Absolutely. It's just that there is not clear consensus what should a stop word include/exclude and when one is provided people tend to use it without thinking too much (see e.g. this paper). I agree we can include stop word list for a few common world languages.
+1 |
|
Interesting paper. Might be worth including a standard stop word list from spacy but add a note in the documentation that refers to the paper. |
Add the
StopWordFilterstruct to filter stop words, as as example of aTokenProcessortrait implementation that takes in an iterator and returns an iterator of strings (following discussion in #21)TODO