When trying to beat git verify-pack
Attempt 1
I remembered timings on a cold cache that indicated something around 5:50min for git to run a verify pack on the linux kernel pack. However, turns out that if the environment is a little more controlled, git is still considerably faster than us despite using an LRU cache and despite using multiple cores quite efficiently.

Observation
Git uses a streaming pack approach which is optimized to apply objects inversely. It works by
- decompressing all deltas
- applying all deltas that depend on a base, recursively (and thus avoiding to have to decompress deltas multiple times)
We work using a memory mapped file which is optimized for random access, but won't be very fast for this kind of workload.
How to fix
Wait until we have implemented a streaming pack as well and try again, having the same algorithmical benefits possibly faired with more efficient memory handling.
Git for some reason limits the application to 3 threads, even though we do benefit from having more threads so could be faster just because of this.
The streaming (indexing) phase of reading a pack can be parallelised in case we have a pack on disk, and it should be easy to implement if the index datastructure itself is threadsafe (but might not be worth the complexity or memory overhead, let's see).
When trying to beat
git verify-packAttempt 1
I remembered timings on a cold cache that indicated something around 5:50min for git to run a verify pack on the linux kernel pack. However, turns out that if the environment is a little more controlled, git is still considerably faster than us despite using an LRU cache and despite using multiple cores quite efficiently.
Observation
Git uses a streaming pack approach which is optimized to apply objects inversely. It works by
We work using a memory mapped file which is optimized for random access, but won't be very fast for this kind of workload.
How to fix
Wait until we have implemented a streaming pack as well and try again, having the same algorithmical benefits possibly faired with more efficient memory handling.
Git for some reason limits the application to 3 threads, even though we do benefit from having more threads so could be faster just because of this.
The streaming (indexing) phase of reading a pack can be parallelised in case we have a pack on disk, and it should be easy to implement if the index datastructure itself is threadsafe (but might not be worth the complexity or memory overhead, let's see).