Skip to content

Conversation

@jeffhostetler
Copy link

Sort the set of objects by packfile so that only one packfile needs
to be open at a time.

This is a performance improvement. Previously, objects were
verified in OID order. This essentially requires all packfiles
to be open at the same time. If the number of packfiles exceeds
the open file limit, packfiles would be closed and re-opened
many times.

Signed-off-by: Jeff Hostetler jeffhost@microsoft.com

@jeffhostetler
Copy link
Author

On a repo with 3600 packfiles, run time went from 12 minutes to 26 seconds.

midx.c Outdated

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, and follows a pattern that exists elsewhere in this file (ie midx_repack()). Clearly results in a huge perf win by avoiding opening/closing pack files all the time.

midx.c Outdated

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this correctly, this is an (optional) optimization that will keep the pack files open to a minimum. I'm assuming without it, they would start being closed transparently as you reached some max threshold. Since you know they are sorted, makes sense to do the optimization here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the problem with max file descriptors that i fixed yesterday said we were holding 2000+ packfiles open when we started running out. fixing that caused us to still hold 2000+ open, but close and open packfiles as necessary to do the random access.

So yeah, this fix kinda eliminates the need for the previous fix. But i'm keeping that one in for now since it is harmless and just seems like the correct thing to do.

packfile.c Outdated

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wouldn't be needed without the optimization above but I don't see any problem making this public.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. the new verify loop completely verifies all objects in one packfile before moving to the next packfile (because of the sort). But when we hit 2000+ packfiles in the directory, visiting the next packfile requires us to free up a fd, and
this triggers the LRU search in close_one_pack(). So by closing the previous packfile in that loop, we'll only have 1
packfile open and avoid all of the LRU searching (which is O(n^2)).

Copy link
Member

@kewillford kewillford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me and a great performance win. Would it be possible to multi-thread it now that we are only going through one pack at a time to speed it up even more?

Copy link
Member

@jrbriggs jrbriggs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question about the memory usage of the parallel data structure:

midx.c Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For very large repos, this could be asking for many GB, right? Should a failure to alloc cause a failure in midx validation?

Copy link
Member

@jrbriggs jrbriggs Feb 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jeff and I talked offline and I misunderstood the number of objects in play. We should ask for less than 500MB, which should be serviceable.

Copy link
Member

@jrbriggs jrbriggs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.

@jeffhostetler
Copy link
Author

@kewillford Yeah, that was my thought too. Now that we have the loop looking at a single packfile at a time, we should be able to thread that. I'll save that for another day though. :-)

Sort the set of objects by packfile so that only one packfile needs
to be open at a time.

This is a performance improvement.  Previously, objects were
verified in OID order.  This essentially requires all packfiles
to be open at the same time.  If the number of packfiles exceeds
the open file limit, packfiles would be closed and re-opened
many times.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
Log multi-pack-index sub-command (cmd_mode).
Log number of objects and number of packfiles.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants