Conversation
pipeline increased from 4 to 8 slots. This change substantially improves decompression speed when there are long distance offsets. example with enwik9 compressed at level 22 : gcc-9 : 947 -> 1039 MB/s clang-10: 884 -> 946 MB/s I also checked the "cold dictionary" scenario, and found a smaller benefit, around ~2% (measurements are more noisy for this scenario).
|
What are the scenarios in which this prefetching might be left unused/wasted? |
The decoder only prefetches memory regions That being said, in some cases, when a match's offset is really small, That should not matter much because it means this memory region is very fresh, So we could say that, in this case, the unconditional prefetching was "useless". |
by using a deeper prefetching pipeline, increased from 4 to 8 slots.
This change substantially improves decompression speed when there are long distance offsets.
example with
enwik9compressed at level 22 :gcc-9 : 947 -> 1039 MB/s
clang-10: 884 -> 946 MB/s
I also checked the "cold dictionary" scenario, with
largeNbDicts,and found a smaller benefit, around ~2%
(measurements are more noisy for this scenario).
This is a follow up from #2547,
though it's separate because in this case, the benefits are much more clear cut.