Different optimizations for report#197
Different optimizations for report#197smacker merged 10 commits intosrc-d:masterfrom smacker:speedup_report
Conversation
|
Python stylecheck is failing but it also failing on master (weird, all prs passed it. Most probably something changed in the linter). So it shouldn't block this pr. |
| var bucket = List[Int]() | ||
|
|
||
| getHashValues(hashTable).foreach { case FileHash(sha1, value) => | ||
| val elId = elementIds(sha1) |
There was a problem hiding this comment.
The assumption here is that it's not possible that there's an element id in the non-first hashtable that is actually not present in the first hashtable, right?
There was a problem hiding this comment.
yes. WMH produces one long hash for a file. Then we split it to "bands" so each file has number of hashtables rows with partial hash in it.
- makes it faster - allows to pass number of buckets that doesn't much number of cc Signed-off-by: Maxim Sukharev <max@smacker.ru>
Bucket with only 1 element means that element isn't connected to anything. Currently such elements are filtered only when we build graph but we can remove it much earlier which would improve performance a lot. Signed-off-by: Maxim Sukharev <max@smacker.ru>
It repeats a little bit of code for the first hashtable but more performant because it loops only once both for building elementsIds map and for buckets generation. Signed-off-by: Maxim Sukharev <max@smacker.ru>
Signed-off-by: Maxim Sukharev <max@smacker.ru>
Order of keys in map is random. But python code relies on indexes as element id. Signed-off-by: Maxim Sukharev <max@smacker.ru>
previous commits introduced filtering of elements that appear only in one bucket. But it breaks python logic. Signed-off-by: Maxim Sukharev <max@smacker.ru>
because now python receives only elements appeared in more than 1 bucket it's possible that bucket id and element ids in scala will collide. Signed-off-by: Maxim Sukharev <max@smacker.ru>
find dups in scala instead of a new query for each hash Signed-off-by: Maxim Sukharev <max@smacker.ru>
|
I have reworked the PR a lot. Please, another pass. (sorry Marvin) P.S. On test dataset report time went down from 60+ hours (I stopped it on 3rd day of running) to 15 minutes. |
|
@smacker unfortunately I barely know what |
| for el_id, bucket in id_to_buckets: | ||
| indices[pos:(pos + len(bucket))] = bucket | ||
| pos += len(bucket) | ||
| indptr[el_id + 1:] = pos |
There was a problem hiding this comment.
Maybe there is a performance gain we can squeeze here, avoiding to write always up to the end of the array.
Something like this (did not test this, please double check)
prev_el_id = 0
prev_pos = 0
for el_id, bucket in id_to_buckets:
indices[pos:(pos + len(bucket))] = bucket
pos += len(bucket)
indptr[prev_el_id+2:el_id] = prev_pos
prev_el_id = el_id
prev_pos = post
indptr[prev_el_id+1:] = prev_posThere was a problem hiding this comment.
thanks! It's a good idea. I re-wrote it a little bit different and added one more test to be sure it works correctly for an edge case.
Signed-off-by: Maxim Sukharev <max@smacker.ru>
Signed-off-by: Maxim Sukharev <max@smacker.ru>
|
@se7entyse7en thanks anyway for your valuable review. Carlos knows internal of gemini better and he approved the code so I think we are good to merge now without you going deep into details. |
Please look at each commit message for details.