Skip to content

DocBin.to_bytes fails with a "ValueError: bytes object is too large" #5219

@kuppulur

Description

@kuppulur

How to reproduce the behaviour

I am trying to train a sense2vec model from scratch on a corpus which has around 5 million sentences (lines/docs). As the first step is to parse it and create a .spacy file, I ran the parse script on this corpus and the code crashes at doc_bin_bytes = doc_bin.to_bytes() with a value error saying the bytes object is too large. Can somebody help me with this issue? Thanks

Your Environment

  • Operating System: Ubuntu 16.04.6 LTS
  • Python Version Used: 3.5.2
  • spaCy Version Used: 2.2.4
  • Environment Information:

Metadata

Metadata

Assignees

No one assigned

    Labels

    feat / docFeature: Doc, Span and Token objectsfeat / serializeFeature: Serialization, saving and loadingresolvedThe issue was addressed / answered

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions