Skip to content

Improve handling of duplicate documents during write_documents() #1069

@tholor

Description

@tholor

Is your feature request related to a problem? Please describe.
In #1000 we introduced a hashing mechanism that creates the file id depending on document content (default: the text).

The current behaviour:

If you call write_documents() with a batch of documents and one of them is already existing in the document_store (determined by the ID), the request is failing:

Memory store - ValueError
ES based store - Throw BulkIndexError
SQL based store - Throw IntegrityError exception because of UNIQUE constraint

Describe the solution you'd like
Add an argument handle_duplicates to write_documents() that allows users to change the behavior in case of duplicates:

  • Ignore duplicate (with warning) - Default option
  • Overwrite if exist
  • Fail if duplicate

In all cases write_documents should return inserted document ids (same in case of rest API). So the user knows what was written and what was skipped.

Ideally, we also make the thrown exceptions more consistent across document stores.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions