Skip to content

faster speed for decompressSequencesLong#2614

Merged
Cyan4973 merged 1 commit intodevfrom
dlong8
May 5, 2021
Merged

faster speed for decompressSequencesLong#2614
Cyan4973 merged 1 commit intodevfrom
dlong8

Conversation

@Cyan4973
Copy link
Contributor

@Cyan4973 Cyan4973 commented May 5, 2021

by using a deeper prefetching pipeline, increased from 4 to 8 slots.

This change substantially improves decompression speed when there are long distance offsets.
example with enwik9 compressed at level 22 :
gcc-9 : 947 -> 1039 MB/s
clang-10: 884 -> 946 MB/s

I also checked the "cold dictionary" scenario, with largeNbDicts,
and found a smaller benefit, around ~2%
(measurements are more noisy for this scenario).

This is a follow up from #2547,
though it's separate because in this case, the benefits are much more clear cut.

pipeline increased from 4 to 8 slots.
This change substantially improves decompression speed when there are long distance offsets.
example with enwik9 compressed at level 22 :
gcc-9 : 947 -> 1039 MB/s
clang-10: 884 -> 946 MB/s

I also checked the "cold dictionary" scenario,
and found a smaller benefit, around ~2%
(measurements are more noisy for this scenario).
@senhuang42
Copy link

What are the scenarios in which this prefetching might be left unused/wasted?

@Cyan4973
Copy link
Contributor Author

Cyan4973 commented May 5, 2021

What are the scenarios in which this prefetching might be left unused/wasted?

The decoder only prefetches memory regions
that are effectively going to be copied later for a match operations.

That being said, in some cases, when a match's offset is really small,
the source region may not yet be filled (at the time prefetching is issued).

That should not matter much because it means this memory region is very fresh,
hence likely already in L1, and would have been in L1 anyway, even without prefetching.

So we could say that, in this case, the unconditional prefetching was "useless".
However, branching on the speculative presence of a memory region within L1
is way more expensive than merely prefetching unconditionally inside the hot loop.
So it's preferable to always prefetch, even when a memory region is likely already present in L1.

@Cyan4973 Cyan4973 merged commit fed8589 into dev May 5, 2021
@Cyan4973 Cyan4973 deleted the dlong8 branch December 9, 2021 00:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants