-
Notifications
You must be signed in to change notification settings - Fork 9
Description
This is not so much a bug as general observations and recommendations about using this chunker. I'm putting this here since this chunker is widely copied by other chunker implementations for other languages, and people copying this probably also want to know this. I'm not sure where else this kind of info should go. Apologies if this is annoying. This comes out of some testing and analysis I did of chunker algorithms here;
https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst
Observations related to this FastCDC implementation are;
- The
averageargument is not really the average block length. The actual average is very close tominimum + average/2(for reasons explained below). - This implementation changes the FastCDC "transition point" to
max(0, average - 1.5*minimum). This means that the transition point is less thanminimumunlessminimum < 0.4*average. - FastCDC gets most of its speed advantages from "cut-point-skipping" past
minimum, and recommendsminimum = 0.5*averageor evenminimum=average. This means for the FastCDC recommendedminimumsettings that give good speed, the first mask is never used, and only the second mask is used. - When only the second mask is used, FastCDC's "normalized chunking" feature is not used. Since this implementation uses "NC1" with only one bit of normalization, it behaves like a normal exponential chunker with
averageset to half the value. An exponential chunker's actual average block length (excluding truncation frommaximum) isminimum + average, hence this chunker's actual average beingminimum + average/2(providedmaximumis set large enough to minimize truncations). - Not using "normalized chunking" is probably a good thing anyway, since my testing suggests that is worse than simple exponential chunking when you compare using the same actual average and minimum chunk sizes.
With all of that in mind, based on my testing, for this chunker the best minimum/average/maximum settings for optimal deduplication with good speed would be minimum=average/2, maximum>2.5*average with an average chunk length of average. For faster chunking speed with slightly reduced deduplication use minimum=average, maximum>3*average with an average chunk length of 1.5*average. Unfortunately this chunker throws an error if minimum >= average, so it doesn't let you do that. However, you can get away with setting minimum = average -1.