Is your feature request related to a problem? Please describe.
In #1000 we introduced a hashing mechanism that creates the file id depending on document content (default: the text).
The current behaviour:
If you call write_documents() with a batch of documents and one of them is already existing in the document_store (determined by the ID), the request is failing:
Memory store - ValueError
ES based store - Throw BulkIndexError
SQL based store - Throw IntegrityError exception because of UNIQUE constraint
Describe the solution you'd like
Add an argument handle_duplicates to write_documents() that allows users to change the behavior in case of duplicates:
- Ignore duplicate (with warning) - Default option
- Overwrite if exist
- Fail if duplicate
In all cases write_documents should return inserted document ids (same in case of rest API). So the user knows what was written and what was skipped.
Ideally, we also make the thrown exceptions more consistent across document stores.
Is your feature request related to a problem? Please describe.
In #1000 we introduced a hashing mechanism that creates the file id depending on document content (default: the text).
The current behaviour:
If you call
write_documents()with a batch of documents and one of them is already existing in the document_store (determined by the ID), the request is failing:Memory store - ValueError
ES based store - Throw BulkIndexError
SQL based store - Throw IntegrityError exception because of UNIQUE constraint
Describe the solution you'd like
Add an argument
handle_duplicatestowrite_documents()that allows users to change the behavior in case of duplicates:In all cases write_documents should return inserted document ids (same in case of rest API). So the user knows what was written and what was skipped.
Ideally, we also make the thrown exceptions more consistent across document stores.