Skip to content

Conversation

@MrLebjane
Copy link

Added the neccessary files as requested.

sesotho/voc.txt Outdated
Sebata
pholosa
Rongoa
Mothusi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you lower-case the capitalised words in this file that'll fix the CI failures (it's arguably a bug that e.g. the Ada version of stemwords doesn't appear to, but I'm not sure I know enough Ada to fix that).

sesotho/voc.txt Outdated
qetoa
bitsoa
boikarabelo
boikakaso No newline at end of file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github doesn't show it here for some reason, but judging from the CI test failures there's a missing newline character on this final line.

sesotho/voc.txt Outdated
@@ -0,0 +1,77 @@
motho
batho
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems much too short a sample vocabulary. What's appropriate depends on the language (highly inflected languages will have a lot more forms of each stem so a longer list is appropriate, and some languages just have larger vocabularies - e.g. English). To give an idea:

    4000 nepali/voc.txt
   20895 norwegian/voc.txt
   21386 basque/voc.txt
   21642 french/voc.txt
   23830 danish/voc.txt
   28378 spanish/voc.txt
   29881 hungarian/voc.txt
   29996 serbian/voc.txt
   30738 swedish/voc.txt
   32016 portuguese/voc.txt
   35033 german/voc.txt
   35495 italian/voc.txt
   42603 lovins/voc.txt
   42603 porter/voc.txt
   42649 english/voc.txt
   45670 dutch_porter/voc.txt
   45670 dutch/voc.txt
   48897 catalan/voc.txt
   49785 russian/voc.txt
   49962 finnish/voc.txt
   57470 estonian/voc.txt
   62566 polish/voc.txt
   64586 indonesian/voc.txt
   65118 hindi/voc.txt
   84456 esperanto/voc.txt
   86105 lithuanian/voc.txt
   87642 romanian/voc.txt
   90727 greek/voc.txt
   96325 turkish/voc.txt
   99996 yiddish/voc.txt
  101780 armenian/voc.txt
  186280 irish/voc.txt
  443271 tamil/voc.txt
 9196214 arabic/voc.txt.gz

So I'd probably expect 10000+ (the nepali list is probably shorter than is ideal).

The arabic list was apparently made by generating all theoretically possible forms, and as a result contains a lot of words that will never occur in practice, and it's really much too long which causes practical problems such as github not supporting such large files (hence the need to store it gzipped) and the tests taking ages for slower target languages. There's #31 to try to address this.

If we can't find a usable Sesotho word list, we have scripts to extract a list of all words used more than N times from a wikipedia dump and there is a Sesotho wikipedia. It only has 1554 pages currently, but that's still going to be better than just 77 words.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this and there are 11512 words which occur more than once - here's a list by frequency:

st.freq.txt

It looks like English words start to creep in more towards the end (though maybe some are actually loanwords). Having a few common foreign words isn't a big problem (a stemmer is likely to encounter such words in real use, so how it deals with them is of some interest) but it may well be that the threshold for inclusion should be more than 2.

sesotho/voc.txt Outdated
@@ -0,0 +1,77 @@
motho
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We standardly sort voc.txt files which makes it easier to work with them (at least when they're large). You can run scripts/sort-by-voc sesotho which will sort this file and reorder output.txt to match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants