Skip to content

parse_wikipedia.py produces a very large file with a newer version of spacy #2

@vered1986

Description

@vered1986

The original corpus in the paper was processed using spacy version 0.99. Using a newer spacy version creates a much larger triplet file (over 11TB, while the original file was ~900GB). For now the possible solutions are:

  1. Use spacy version 0.99 - install using:
    pip install spacy==0.99
    python -m spacy.en.download all --force

  2. Limit parse_wikipedia.py to a specific vocabulary as in LexNET.

I'm working on figuring out what happens in the newer spacy version, and writing a memory-efficient version of parse_wikipedia.py, in case the older spacy version is the buggy one, and the number of paths should in fact be much larger.

Thanks @christos-c for finding this bug!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions