parse_wikipedia.py produces a very large file with a newer version of spacy

The original corpus in the paper was processed using spacy version 0.99. Using a newer spacy version creates a much larger triplet file (over 11TB, while the original file was ~900GB). For now the possible solutions are:
1. Use spacy version 0.99 - install using:
`pip install spacy==0.99`
`python -m spacy.en.download all --force`

2. Limit parse_wikipedia.py to a specific vocabulary as in [LexNET](https://github.com/vered1986/LexNET/blob/master/corpus/parse_wikipedia.py).

I'm working on figuring out what happens in the newer spacy version, and writing a memory-efficient version of parse_wikipedia.py, in case the older spacy version is the buggy one, and the number of paths should in fact be much larger.

Thanks @christos-c for finding this bug!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parse_wikipedia.py produces a very large file with a newer version of spacy #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

parse_wikipedia.py produces a very large file with a newer version of spacy #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions