Observations and Recommendations using this chunker.

This is not so much a bug as general observations and recommendations about using this chunker. I'm putting this here since this chunker is widely copied by other chunker implementations for other languages, and people copying this probably also want to know this. I'm not sure where else this kind of info should go. Apologies if this is annoying. This comes out of some testing and analysis I did of chunker algorithms here;

https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst

Observations related to this FastCDC implementation are;

1. The `average` argument is not really the average block length. The actual average is very close to `minimum + average/2` (for reasons explained below).
2. This implementation changes the FastCDC "transition point" to `max(0, average - 1.5*minimum)`. This means that the transition point is less than `minimum` unless `minimum < 0.4*average`.
3. FastCDC gets most of its speed advantages from "cut-point-skipping" past `minimum`, and recommends `minimum = 0.5*average` or even `minimum=average`. This means for the FastCDC recommended `minimum` settings that give good speed, the first mask is never used, and only the second mask is used.
4. When only the second mask is used, FastCDC's "normalized chunking" feature is not used. Since this implementation uses "NC1" with only one bit of normalization, it behaves like a normal exponential chunker with `average` set to half the value. An exponential chunker's actual average block length (excluding truncation from `maximum`) is `minimum + average`, hence this chunker's actual average being `minimum + average/2` (provided `maximum` is set large enough to minimize truncations).
5. Not using "normalized chunking" is probably a good thing anyway, since my testing suggests that is worse than simple exponential chunking when you compare using the same actual average and minimum chunk sizes.

With all of that in mind, based on my testing, for this chunker the best `minimum/average/maximum` settings for optimal deduplication with good speed would be `minimum=average/2, maximum>2.5*average` with an average chunk length of `average`. For faster chunking speed with slightly reduced deduplication use `minimum=average, maximum>3*average` with an average chunk length of `1.5*average`. Unfortunately this chunker throws an error if `minimum >= average`, so it doesn't let you do that. However, you can get away with setting `minimum = average -1`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Observations and Recommendations using this chunker. #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Observations and Recommendations using this chunker. #8

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions