Add options for handling duplicate documents (skip, fail, overwrite)#1088
Add options for handling duplicate documents (skip, fail, overwrite)#1088tholor merged 10 commits intodeepset-ai:masterfrom akkefa:handle_duplicate_documents
Conversation
|
@tholor Please review this pull request. |
There was a problem hiding this comment.
Nice progress... I think we're almost there :)
Left a few minor comments.
One bigger thing that I was wondering about: Speed of handle_duplicate_documents. I don't believe a single get_documents_by_id() is that expensive, but we need to make sure that it doesn't become a bottleneck for indexing. We can check this in our next benchmarking and iterate on it in case it's really an issue.
| :param documents: A list of Haystack Document objects. | ||
| :param duplicate_documents: Handle duplicates document based on parameter options. | ||
| Parameter options : ( 'skip','overwrite','fail') | ||
| skip (default option): Ignore the duplicates documents |
There was a problem hiding this comment.
Wondering if we want to make skip or overwrite the default option here. Any thoughts @lalitpagaria @oryx1729?
There was a problem hiding this comment.
In my view 'overwrite' will be good. User will get warning along with without any changes his latest/updated docs can be written to doc store.
In skip options, he can't move forward until he delete old documents.
There was a problem hiding this comment.
I assume from the above comment, overwrite will be the default option.
There was a problem hiding this comment.
I think overwrite could be a good default value. I assume it'd overwrite the meta fields if they're different from the original?
In the haystack, there are still improvements that can be made for speed optimization and architecture improvements. Once the duplicate functionality criteria is approved then we will look into optimization. |
tholor
left a comment
There was a problem hiding this comment.
Awesome. I think this is ready to be merged. Thanks for addressing the comments so fast 👍
Proposed changes:
Status (please check what you already did):