The original corpus in the paper was processed using spacy version 0.99. Using a newer spacy version creates a much larger triplet file (over 11TB, while the original file was ~900GB). For now the possible solutions are:
-
Use spacy version 0.99 - install using:
pip install spacy==0.99
python -m spacy.en.download all --force
-
Limit parse_wikipedia.py to a specific vocabulary as in LexNET.
I'm working on figuring out what happens in the newer spacy version, and writing a memory-efficient version of parse_wikipedia.py, in case the older spacy version is the buggy one, and the number of paths should in fact be much larger.
Thanks @christos-c for finding this bug!