-
Notifications
You must be signed in to change notification settings - Fork 41
Add Sesotho Stemmer #33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
sesotho/voc.txt
Outdated
| Sebata | ||
| pholosa | ||
| Rongoa | ||
| Mothusi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you lower-case the capitalised words in this file that'll fix the CI failures (it's arguably a bug that e.g. the Ada version of stemwords doesn't appear to, but I'm not sure I know enough Ada to fix that).
sesotho/voc.txt
Outdated
| qetoa | ||
| bitsoa | ||
| boikarabelo | ||
| boikakaso No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Github doesn't show it here for some reason, but judging from the CI test failures there's a missing newline character on this final line.
sesotho/voc.txt
Outdated
| @@ -0,0 +1,77 @@ | |||
| motho | |||
| batho | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems much too short a sample vocabulary. What's appropriate depends on the language (highly inflected languages will have a lot more forms of each stem so a longer list is appropriate, and some languages just have larger vocabularies - e.g. English). To give an idea:
4000 nepali/voc.txt
20895 norwegian/voc.txt
21386 basque/voc.txt
21642 french/voc.txt
23830 danish/voc.txt
28378 spanish/voc.txt
29881 hungarian/voc.txt
29996 serbian/voc.txt
30738 swedish/voc.txt
32016 portuguese/voc.txt
35033 german/voc.txt
35495 italian/voc.txt
42603 lovins/voc.txt
42603 porter/voc.txt
42649 english/voc.txt
45670 dutch_porter/voc.txt
45670 dutch/voc.txt
48897 catalan/voc.txt
49785 russian/voc.txt
49962 finnish/voc.txt
57470 estonian/voc.txt
62566 polish/voc.txt
64586 indonesian/voc.txt
65118 hindi/voc.txt
84456 esperanto/voc.txt
86105 lithuanian/voc.txt
87642 romanian/voc.txt
90727 greek/voc.txt
96325 turkish/voc.txt
99996 yiddish/voc.txt
101780 armenian/voc.txt
186280 irish/voc.txt
443271 tamil/voc.txt
9196214 arabic/voc.txt.gz
So I'd probably expect 10000+ (the nepali list is probably shorter than is ideal).
The arabic list was apparently made by generating all theoretically possible forms, and as a result contains a lot of words that will never occur in practice, and it's really much too long which causes practical problems such as github not supporting such large files (hence the need to store it gzipped) and the tests taking ages for slower target languages. There's #31 to try to address this.
If we can't find a usable Sesotho word list, we have scripts to extract a list of all words used more than N times from a wikipedia dump and there is a Sesotho wikipedia. It only has 1554 pages currently, but that's still going to be better than just 77 words.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this and there are 11512 words which occur more than once - here's a list by frequency:
It looks like English words start to creep in more towards the end (though maybe some are actually loanwords). Having a few common foreign words isn't a big problem (a stemmer is likely to encounter such words in real use, so how it deals with them is of some interest) but it may well be that the threshold for inclusion should be more than 2.
sesotho/voc.txt
Outdated
| @@ -0,0 +1,77 @@ | |||
| motho | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We standardly sort voc.txt files which makes it easier to work with them (at least when they're large). You can run scripts/sort-by-voc sesotho which will sort this file and reorder output.txt to match.
…s, sorted the words
Added the neccessary files as requested.