Improve handling of duplicate documents during write_documents()

**Is your feature request related to a problem? Please describe.**
In #1000 we introduced a hashing mechanism that creates the file id depending on document content (default: the text).

The current behaviour:  

If you call  `write_documents()` with a batch of documents and one of them is already existing in the document_store (determined by the ID), the request is failing:

Memory store - ValueError
ES based store - Throw BulkIndexError
SQL based store - Throw IntegrityError exception because of UNIQUE constraint



**Describe the solution you'd like**
Add an argument `handle_duplicates` to `write_documents()` that allows users to change the behavior in case of duplicates:
 
- Ignore duplicate (with warning) - Default option
- Overwrite if exist
- Fail if duplicate

In all cases write_documents should return inserted document ids (same in case of rest API). So the user knows what was written and what was skipped.

Ideally, we also make the thrown exceptions more consistent across document stores. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of duplicate documents during write_documents() #1069

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve handling of duplicate documents during write_documents() #1069

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions