better handling of repeated chunks to speed up extracting sparse files

When chunking sparse files the chunker will converge on an "idle tone" for runs of zeroes ~>= 2 chunks.

When extracting these chunks are fetched over-and-over again, and also decrypted, checked etc. making it more slow than it has to be.

Suggestions:
1. LRUCache (chunk-id,) -> (length,) whose express purpose is to store all-zero chunks when --sparse is used. This needs a bit of work in extract_file and in the DownloadPipeline. As usual preload_ids may make this harder to implement (therefore creating this issue, so this doesn't get buried in my stack of notes). If we figure out #1665 this shouldn't be hard then - basically the same problem description regarding preload.
2. An entirely different way to do this would be to make this work transparently in DownloadPipeline, by collapsing runs of the same chunk ID and noting the number of reptitions (ie. run-length coding), yielding repeated chunks _locally_. On second thought this may be a much better implementation path.
   
   **Preload still has to be considered**, but on the plus side this works for any kind of repetition, not just zeroes or sparse files, and generally feels like DownloadPipeline is a more apt abstraction layer for this optimization.
   
   Preload may be solvable differently than in #1665, by doing the same RLE already in fetch_many, so not submitting the preload for repeated chunks in the first place.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

better handling of repeated chunks to speed up extracting sparse files #1678

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

better handling of repeated chunks to speed up extracting sparse files #1678

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions