-
Notifications
You must be signed in to change notification settings - Fork 106
midx: verify: group objects by packfile to speed up object verification #124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
midx: verify: group objects by packfile to speed up object verification #124
Conversation
|
On a repo with 3600 packfiles, run time went from 12 minutes to 26 seconds. |
7f7d840 to
81ba82c
Compare
midx.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, and follows a pattern that exists elsewhere in this file (ie midx_repack()). Clearly results in a huge perf win by avoiding opening/closing pack files all the time.
midx.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm reading this correctly, this is an (optional) optimization that will keep the pack files open to a minimum. I'm assuming without it, they would start being closed transparently as you reached some max threshold. Since you know they are sorted, makes sense to do the optimization here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, the problem with max file descriptors that i fixed yesterday said we were holding 2000+ packfiles open when we started running out. fixing that caused us to still hold 2000+ open, but close and open packfiles as necessary to do the random access.
So yeah, this fix kinda eliminates the need for the previous fix. But i'm keeping that one in for now since it is harmless and just seems like the correct thing to do.
packfile.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wouldn't be needed without the optimization above but I don't see any problem making this public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right. the new verify loop completely verifies all objects in one packfile before moving to the next packfile (because of the sort). But when we hit 2000+ packfiles in the directory, visiting the next packfile requires us to free up a fd, and
this triggers the LRU search in close_one_pack(). So by closing the previous packfile in that loop, we'll only have 1
packfile open and avoid all of the LRU searching (which is O(n^2)).
kewillford
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me and a great performance win. Would it be possible to multi-thread it now that we are only going through one pack at a time to speed it up even more?
jrbriggs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question about the memory usage of the parallel data structure:
midx.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For very large repos, this could be asking for many GB, right? Should a failure to alloc cause a failure in midx validation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jeff and I talked offline and I misunderstood the number of objects in play. We should ask for less than 500MB, which should be serviceable.
jrbriggs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved.
|
@kewillford Yeah, that was my thought too. Now that we have the loop looking at a single packfile at a time, we should be able to thread that. I'll save that for another day though. :-) |
Sort the set of objects by packfile so that only one packfile needs to be open at a time. This is a performance improvement. Previously, objects were verified in OID order. This essentially requires all packfiles to be open at the same time. If the number of packfiles exceeds the open file limit, packfiles would be closed and re-opened many times. Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
81ba82c to
e7fe2d4
Compare
Log multi-pack-index sub-command (cmd_mode). Log number of objects and number of packfiles. Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
e7fe2d4 to
4a4f2f4
Compare
midx: verify: group objects by packfile to speed up object verification
Sort the set of objects by packfile so that only one packfile needs
to be open at a time.
This is a performance improvement. Previously, objects were
verified in OID order. This essentially requires all packfiles
to be open at the same time. If the number of packfiles exceeds
the open file limit, packfiles would be closed and re-opened
many times.
Signed-off-by: Jeff Hostetler jeffhost@microsoft.com