Add Sesotho Stemmer #33

MrLebjane · 2025-11-27T10:48:41Z

Added the neccessary files as requested.

ojwb · 2025-12-06T22:45:39Z

sesotho/voc.txt

+Sebata
+pholosa
+Rongoa
+Mothusi


I think if you lower-case the capitalised words in this file that'll fix the CI failures (it's arguably a bug that e.g. the Ada version of stemwords doesn't appear to, but I'm not sure I know enough Ada to fix that).

ojwb · 2025-12-06T22:57:50Z

sesotho/voc.txt

+qetoa
+bitsoa
+boikarabelo
+boikakaso


Github doesn't show it here for some reason, but judging from the CI test failures there's a missing newline character on this final line.

ojwb · 2025-12-06T23:05:28Z

sesotho/voc.txt

@@ -0,0 +1,77 @@
+motho
+batho


This seems much too short a sample vocabulary. What's appropriate depends on the language (highly inflected languages will have a lot more forms of each stem so a longer list is appropriate, and some languages just have larger vocabularies - e.g. English). To give an idea:

4000 nepali/voc.txt 20895 norwegian/voc.txt 21386 basque/voc.txt 21642 french/voc.txt 23830 danish/voc.txt 28378 spanish/voc.txt 29881 hungarian/voc.txt 29996 serbian/voc.txt 30738 swedish/voc.txt 32016 portuguese/voc.txt 35033 german/voc.txt 35495 italian/voc.txt 42603 lovins/voc.txt 42603 porter/voc.txt 42649 english/voc.txt 45670 dutch_porter/voc.txt 45670 dutch/voc.txt 48897 catalan/voc.txt 49785 russian/voc.txt 49962 finnish/voc.txt 57470 estonian/voc.txt 62566 polish/voc.txt 64586 indonesian/voc.txt 65118 hindi/voc.txt 84456 esperanto/voc.txt 86105 lithuanian/voc.txt 87642 romanian/voc.txt 90727 greek/voc.txt 96325 turkish/voc.txt 99996 yiddish/voc.txt 101780 armenian/voc.txt 186280 irish/voc.txt 443271 tamil/voc.txt 9196214 arabic/voc.txt.gz

So I'd probably expect 10000+ (the nepali list is probably shorter than is ideal).

The arabic list was apparently made by generating all theoretically possible forms, and as a result contains a lot of words that will never occur in practice, and it's really much too long which causes practical problems such as github not supporting such large files (hence the need to store it gzipped) and the tests taking ages for slower target languages. There's #31 to try to address this.

If we can't find a usable Sesotho word list, we have scripts to extract a list of all words used more than N times from a wikipedia dump and there is a Sesotho wikipedia. It only has 1554 pages currently, but that's still going to be better than just 77 words.

I tried this and there are 11512 words which occur more than once - here's a list by frequency:

st.freq.txt

It looks like English words start to creep in more towards the end (though maybe some are actually loanwords). Having a few common foreign words isn't a big problem (a stemmer is likely to encounter such words in real use, so how it deals with them is of some interest) but it may well be that the threshold for inclusion should be more than 2.

ojwb · 2025-12-06T23:09:04Z

sesotho/voc.txt

@@ -0,0 +1,77 @@
+motho


We standardly sort voc.txt files which makes it easier to work with them (at least when they're large). You can run scripts/sort-by-voc sesotho which will sort this file and reorder output.txt to match.

…s, sorted the words

Added neccessary files for sesotho stemmer

12c9f4e

ojwb reviewed Dec 6, 2025

View reviewed changes

added larger sample of words, added the newline, removed capital word…

b104447

…s, sorted the words

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Sesotho Stemmer #33

Add Sesotho Stemmer #33

Uh oh!

MrLebjane commented Nov 27, 2025

Uh oh!

ojwb Dec 6, 2025

Uh oh!

ojwb Dec 6, 2025

Uh oh!

ojwb Dec 6, 2025

Uh oh!

ojwb Dec 7, 2025

Uh oh!

ojwb Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,77 @@
		motho
		batho

Add Sesotho Stemmer #33

Are you sure you want to change the base?

Add Sesotho Stemmer #33

Uh oh!

Conversation

MrLebjane commented Nov 27, 2025

Uh oh!

ojwb Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

ojwb Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

ojwb Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

ojwb Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

ojwb Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants