Skip to content

Different optimizations for report#197

Merged
smacker merged 10 commits intosrc-d:masterfrom
smacker:speedup_report
Feb 14, 2019
Merged

Different optimizations for report#197
smacker merged 10 commits intosrc-d:masterfrom
smacker:speedup_report

Conversation

@smacker
Copy link
Copy Markdown
Contributor

@smacker smacker commented Feb 11, 2019

Please look at each commit message for details.

@smacker
Copy link
Copy Markdown
Contributor Author

smacker commented Feb 11, 2019

Python stylecheck is failing but it also failing on master (weird, all prs passed it. Most probably something changed in the linter). So it shouldn't block this pr.

Comment thread src/main/python/community-detector/community_detector.py Outdated
var bucket = List[Int]()

getHashValues(hashTable).foreach { case FileHash(sha1, value) =>
val elId = elementIds(sha1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption here is that it's not possible that there's an element id in the non-first hashtable that is actually not present in the first hashtable, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. WMH produces one long hash for a file. Then we split it to "bands" so each file has number of hashtables rows with partial hash in it.

- makes it faster
- allows to pass number of buckets that doesn't much number of cc

Signed-off-by: Maxim Sukharev <max@smacker.ru>
Bucket with only 1 element means that element isn't connected to
anything. Currently such elements are filtered only when we build graph
but we can remove it much earlier which would improve performance a lot.

Signed-off-by: Maxim Sukharev <max@smacker.ru>
It repeats a little bit of code for the first hashtable but more
performant because it loops only once both for building elementsIds map
and for buckets generation.

Signed-off-by: Maxim Sukharev <max@smacker.ru>
Signed-off-by: Maxim Sukharev <max@smacker.ru>
Order of keys in map is random. But python code relies on indexes as
element id.

Signed-off-by: Maxim Sukharev <max@smacker.ru>
previous commits introduced filtering of elements that appear only in
one bucket. But it breaks python logic.

Signed-off-by: Maxim Sukharev <max@smacker.ru>
because now python receives only elements appeared in more than 1 bucket
it's possible that bucket id and element ids in scala will collide.

Signed-off-by: Maxim Sukharev <max@smacker.ru>
find dups in scala instead of a new query for each hash

Signed-off-by: Maxim Sukharev <max@smacker.ru>
@smacker
Copy link
Copy Markdown
Contributor Author

smacker commented Feb 13, 2019

I have reworked the PR a lot. Please, another pass. (sorry Marvin)
//cc @se7entyse7en @carlosms

P.S. On test dataset report time went down from 60+ hours (I stopped it on 3rd day of running) to 15 minutes.

Comment thread src/main/python/community-detector/community_detector.py Outdated
Comment thread src/main/python/community-detector/community_detector.py Outdated
Comment thread src/main/python/community-detector/community_detector.py Outdated
Comment thread src/main/python/community-detector/community_detector.py Outdated
Comment thread src/main/scala/tech/sourced/gemini/Report.scala Outdated
@smacker smacker changed the title Diffrerent optimizations for report Different optimizations for report Feb 13, 2019
@se7entyse7en
Copy link
Copy Markdown

@smacker unfortunately I barely know what gemini does, so as in the previous reviews I just reviewed things that are narrower in scope. Maybe it worth for me taking some time to also take a deeper look at gemini for future reviews, but I'm afraid that if I do this now it would take much time and would delay these PRs for the POC.

Copy link
Copy Markdown
Contributor

@carlosms carlosms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 👏 👏

Comment thread src/main/python/community-detector/community_detector.py Outdated
for el_id, bucket in id_to_buckets:
indices[pos:(pos + len(bucket))] = bucket
pos += len(bucket)
indptr[el_id + 1:] = pos
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there is a performance gain we can squeeze here, avoiding to write always up to the end of the array.
Something like this (did not test this, please double check)

    prev_el_id = 0
    prev_pos = 0

    for el_id, bucket in id_to_buckets:
        indices[pos:(pos + len(bucket))] = bucket
        pos += len(bucket)
        indptr[prev_el_id+2:el_id] = prev_pos

        prev_el_id = el_id
        prev_pos = post
    
    indptr[prev_el_id+1:] = prev_pos

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! It's a good idea. I re-wrote it a little bit different and added one more test to be sure it works correctly for an edge case.

Signed-off-by: Maxim Sukharev <max@smacker.ru>
Signed-off-by: Maxim Sukharev <max@smacker.ru>
@smacker
Copy link
Copy Markdown
Contributor Author

smacker commented Feb 14, 2019

@se7entyse7en thanks anyway for your valuable review. Carlos knows internal of gemini better and he approved the code so I think we are good to merge now without you going deep into details.

@smacker smacker merged commit d4eb34f into src-d:master Feb 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants