Using text hash as id to prevent document duplication#1000
Using text hash as id to prevent document duplication#1000tholor merged 11 commits intodeepset-ai:masterfrom lalitpagaria:doc_hash
Conversation
| - `train_filename`: training filename | ||
| - `dev_filename`: development set filename, file to be used by model in eval step of training | ||
| - `test_filename`: test set filename, file to be used by model in test step after training | ||
| - `max_sample`: maximum number of input samples to convert. Can be used for debugging a smaller dataset. |
There was a problem hiding this comment.
Not sure why these changes are in this PR. I have not added these
|
Thanks, @lalitpagaria . I will review it later today. BTW this is #1000 🎉 |
tholor
left a comment
There was a problem hiding this comment.
I like the implementation. Looks easy and solid.
One thing that is not clear to me yet: What's the behaviour if I add now a second "duplicate" document with the same hash? Will I replace the existing doc with the new one or will I ignore the new doc? Is this behavior consistent across all document stores? Probably we can cover it by adding a few additional test cases and ideally adding a warning message in write_document() to inform users about duplicates.
(Sorry didn't have time today to test the behavior myself)
|
I have added a test for that. But there is constancy issue -
Even though I added the test but I think better to make memory store also constant ie throw exception in case of writing document with duplicate id. Check here to know what happen when we write documents with duplicate ids https://github.com/deepset-ai/haystack/runs/2442736307 |
|
Thanks for checking the behavior and adding the test. We should definitely make it consistent across doc stores. I think just throwing an exception is not an ideal user experience here. Imagine I add 100 docs in a batch via IMO it would be nicer to issue a warning including the problematic duplicate document and make sure that the rest of the documents get indexed correctly. What do you think @lalitpagaria @oryx1729 ? |
|
I agree with you @tholor
In all cases write_documents should return inserted document ids (same in case of rest API). So user knows what was written and what was skipped. We can tackle this in separate PR as resultant changes will be bigger, also handling these in ES, memory and SQL based store would be different. |
|
I am fine with tackling the above behavior in a separate PR. However, let's at least make sure that the Documentstore's have consistent behavior in the meantime. So I'd suggest:
|
|
I agree this make sense -
Implementing this is not easy as elastic search always return bulkexception even in case of network error during transaction. Similarly SQL store can throw another constraint error during write documents function.
|
…_keys from object attribute
tholor
left a comment
There was a problem hiding this comment.
Ready to merge.
We will address the aforementioned limitations in a separate PR.
Proposed changes:
This PR is in response to slack discussion.
New haystack users most of the time encounter duplicate answer in their result. This happen due to duplicate ingestion of same text passage multiple time because each time haystack generate new id via uuid. In order to prevent this we will be using text hash as id to prevent document duplication. Also providing a way customize it where user pass value such that cleaned text, file hash, paragraph number or page no to customize id generation via hashing. Fast hashing algorithm
MurmurHashis being used for 128 bit hash generation.Status (please check what you already did):