Skip to content

Arabic test data size #31

@ojwb

Description

@ojwb

The voc.txt for arabic is much larger than for any other language at 149MB. The next largest is tamil/voc.txt which is only 14MB. In fact the total for the other 32 languages is only 34MB.

This causes some practical problems:

It means the testsuite takes much longer to run (this is compounded by the arabic stemmer being one of the more complex so it also does more work per word than most). This is particularly a problem in target languages where the generated stemmers are slower (notably Python) and crunching through such a large test vocabulary takes quite a long time. This has resulted in the THIN_FACTOR mechanism in the snowball testsuite which means we only actually test one word in N (3 by default) for Arabic with Python.

Github imposes a limit on the size of a single file in a git repo of 100MB, so currently we have gzipped voc.txt for arabic. That requires special handling in the testsuite in the snowball repo. This also makes ad-hoc testing a bit more complicated - pulling out words matching a pattern with text processing tools needs an extra zcat or similar step.

I'd previously vaguely assumed this was just down to Arabic having a lot of different forms, but seeing the factor of 10 compared to the next language made me dig into this a bit more and I found this about the list that voc.txt comes from:

Arabic word list for spell checking containing 9 million Arabic words. The words are automatically generated from the AraComLex open-source finite state transducer. The entire list is validated against Microsoft Word spell checker.

If I read that correctly, the list is essentially an attempt to mechanically generate all possible forms of all Arabic words. That means it presumably contains many forms which are technically valid Arabic words, but will never be used in practice. That's actually unhelpful for evaluating the stemmer - if someone proposes a change to the arabic stemming algorithm, how it affects the stemming of words which never occur in practice really is irrelevant (and how it affects really rare words is much less important than how it affects words in common usage).

For other languages, the vocabularies are typically generated by frequency analysis of a large text corpus, and a threshold frequency is chosen to try to capture a vocabulary of words we might care about.

My conclusion is that it would be better all-round to significantly reduce the size of the arabic test vocabulary. By at least 1/3 would at least eliminate the need to gzip, but probably by rather more than that.

I downloaded a dump of ar.wikipedia.org and produced a list of all words formed only of letters from the Arabic alphabet which occur at least once, which is 1.8 million words (so only 20% of the size of the current vocabulary):

$ scripts/wikipedia-dump-to-freq /mnt/data/scratch/wikipedia-dumps/arwiki-latest-pages-articles.xml.bz2 1 arabic > ar.freq
[...]
$ wc -l ar.freq
1845730 ar.freq

Curiously the overlap seems to be only 616,907 words:

$ scripts/freq-to-voc < ar.freq > ar1.voc
$ gzip -dc arabic/voc.txt.gz|LANG=C sort -u|LANG=C comm -12 - ar1.voc > arcommon.voc
$ wc -l arcommon.voc 
616907 arcommon.voc

The threshold of a single occurrence in ar.wikipedia.org likely means ar1.voc includes a lot of typos, and a list intended for spellchecking may not have many (or any) proper nouns, but even so it surprises me that 2/3 of the words in Arabic wikipedia are apparently not in a seemingly very comprehensive spellchecker dictionary.

I wondered if some of the difference was due to Unicode normalisation, but both lists seem to be unchanged by converting to NFC form.

@assem-ch I'd really appreciate your input on this. Happy to provide you with the files from my testing so far and/or to run additional tests if you can think of anything useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions