Skip to content

[RFC] Rebalance compression levels#2692

Merged
senhuang42 merged 3 commits intofacebook:devfrom
senhuang42:rebalance_clevel
Aug 6, 2021
Merged

[RFC] Rebalance compression levels#2692
senhuang42 merged 3 commits intofacebook:devfrom
senhuang42:rebalance_clevel

Conversation

@senhuang42
Copy link
Copy Markdown

@senhuang42 senhuang42 commented Jun 1, 2021

This PR rebalances some of the CParams for the middle compression levels to smooth out the ratio/speed curve a little bit. The methodology used was letting paramgrill give us some initial ideas to begin with, then do some manual tuning based on what the old curve looked like, trying to keep memory usage mostly in check, and maintaining monotonically increasing wLog. The file used was silesia.tar (will check against other files too, but this is the main one for tuning).

Currently, this only changes the > 256KB params.

I've provided some figures for how ratio and speed changes as we move through the compression levels.

>256K
Old curve 1.4.9:

 4#silesia.tar       : 211957760 ->  65507109 (3.236), 178.1 MB/s ,1030.8 MB/s 
 5#silesia.tar       : 211957760 ->  63995214 (3.312), 100.8 MB/s ,1011.5 MB/s R: +2.3%, speed: -43%
 6#silesia.tar       : 211957760 ->  62897118 (3.370),  80.3 MB/s ,1027.2 MB/s R: +1.2%, speed: -20%
 7#silesia.tar       : 211957760 ->  61341367 (3.455),  58.2 MB/s ,1100.3 MB/s R: +2.5%, speed: -27.5%
 8#silesia.tar       : 211957760 ->  60761795 (3.488),  46.9 MB/s ,1129.9 MB/s R: +0.9%, speed: -19.5%
 9#silesia.tar       : 211957760 ->  60191375 (3.521),  35.4 MB/s ,1139.7 MB/s R: +0.9%, speed: -25%
10#silesia.tar       : 211957760 ->  59537903 (3.560),  30.5 MB/s ,1130.5 MB/s R: +1.1%, speed: -14%
11#silesia.tar       : 211957760 ->  59244498 (3.578),  25.0 MB/s ,1131.5 MB/s R: +0.5%, speed: -18%
12#silesia.tar       : 211957760 ->  58778847 (3.606),  17.3 MB/s ,1142.3 MB/s R: +0.78%, speed: -30%
13#silesia.tar       : 211957760 ->  58144787 (3.645),  12.6 MB/s ,1126.6 MB/s R: +1%, speed: -29%

Ratio/speed for 1.5.0 release (for reference):

 4#silesia.tar       : 211957760 ->  65507109 (3.236), 179.5 MB/s , 984.0 MB/s 
 5#silesia.tar       : 211957760 ->  63807763 (3.322), 125.9 MB/s , 972.6 MB/s 
 6#silesia.tar       : 211957760 ->  62981592 (3.365), 120.7 MB/s , 993.3 MB/s 
 7#silesia.tar       : 211957760 ->  61485353 (3.447),  85.1 MB/s ,1060.9 MB/s 
 8#silesia.tar       : 211957760 ->  60918801 (3.479),  67.8 MB/s ,1088.6 MB/s 
 9#silesia.tar       : 211957760 ->  59932279 (3.537),  55.5 MB/s ,1096.2 MB/s 
10#silesia.tar       : 211957760 ->  59299234 (3.574),  51.6 MB/s ,1091.6 MB/s 
11#silesia.tar       : 211957760 ->  59157938 (3.583),  47.2 MB/s ,1090.5 MB/s 
12#silesia.tar       : 211957760 ->  58644580 (3.614),  36.8 MB/s ,1104.6 MB/s

Proposed ratio/speed:

 4#silesia.tar       : 211957760 ->  65507109 (3.236), 176.7 MB/s , 982.9 MB/s 
 5#silesia.tar       : 211957760 ->  62473463 (3.393), 118.4 MB/s ,1067.1 MB/s R: +4.8%, speed: -33%
 6#silesia.tar       : 211957760 ->  61461315 (3.449),  90.2 MB/s ,1008.8 MB/s R: +1.6%, speed: -24%
 7#silesia.tar       : 211957760 ->  60459438 (3.506),  71.7 MB/s ,1069.5 MB/s R: +1.2%, speed: -21%
 8#silesia.tar       : 211957760 ->  59989973 (3.533),  57.9 MB/s ,1083.3 MB/s R: +0.7%, speed: -20%
 9#silesia.tar       : 211957760 ->  59707605 (3.550),  51.7 MB/s ,1093.6 MB/s R: +0.4%, speed: -11%
10#silesia.tar       : 211957760 ->  59157938 (3.583),  46.7 MB/s ,1090.4 MB/s R: +0.9%, speed: -10%
11#silesia.tar       : 211957760 ->  58644580 (3.614),  37.1 MB/s ,1098.2 MB/s R: +0.8% speed: -21%
12#silesia.tar       : 211957760 ->  58590098 (3.618),  35.0 MB/s ,1098.8 MB/s R: +0.1% speed: -6%
13#silesia.tar       : 211957760 ->  58093348 (3.649),  12.3 MB/s ,1118.1 MB/s

Notes:

  • The main issue with the new proposed curve is just that levels 11 and 12 are nearly identical. I'll have to play around a bit more with the parameters to distribute it a bit more evenly (since paramgrill was mostly useful for levels 5-8ish), but I'm putting this PR up for now for any comments or suggestions.
  • I've also toyed around with LDM and block splitting, but LDM actually hurts compression ratio when wLog ~= 22, and we don't want to artificially increase wLog to 27 just for the sake of LDM.
  • The speed dropoff from level 12 to 13 is not really possible to bridge - row hash doesn't really get any slower even at extreme parameter settings. I think this mostly signals that btlazy2 has some room for improvement, speed-wise.

The smaller srcSize parameters also get some slight modifications to smooth out the curve a bit and fix redundant levels. There is less of an emphasis on shifting to compression ratio though, since that was already present.

<=256K
1.5.0:

 3#silesia.tar       : 211957760 ->  69920109 (3.031), 171.5 MB/s , 922.1 MB/s 
 4#silesia.tar       : 211957760 ->  68090094 (3.113), 124.2 MB/s , 994.1 MB/s 
 5#silesia.tar       : 211957760 ->  67413125 (3.144), 118.1 MB/s ,1012.2 MB/s 
 6#silesia.tar       : 211957760 ->  66219074 (3.201),  83.9 MB/s ,1058.9 MB/s 
 7#silesia.tar       : 211957760 ->  65481211 (3.237),  67.0 MB/s ,1000.1 MB/s 
 8#silesia.tar       : 211957760 ->  65039023 (3.259),  53.7 MB/s ,1026.1 MB/s 
 9#silesia.tar       : 211957760 ->  64715335 (3.275),  45.5 MB/s ,1041.0 MB/s 
10#silesia.tar       : 211957760 ->  64715335 (3.275),  45.5 MB/s ,1041.0 MB/s 
11#silesia.tar       : 211957760 ->  64254016 (3.299),  17.8 MB/s ,1055.5 MB/s

new proposal:

 3#silesia.tar       : 211957760 ->  69920109 (3.031), 171.4 MB/s , 921.7 MB/s (dfast)
 4#silesia.tar       : 211957760 ->  67558139 (3.137), 118.3 MB/s ,1012.1 MB/s 
 5#silesia.tar       : 211957760 ->  67086391 (3.159), 113.0 MB/s ,1024.2 MB/s 
 6#silesia.tar       : 211957760 ->  66786627 (3.174), 101.6 MB/s ,1033.8 MB/s 
 7#silesia.tar       : 211957760 ->  65987628 (3.212),  76.1 MB/s , 980.5 MB/s 
 8#silesia.tar       : 211957760 ->  65481211 (3.237),  67.0 MB/s ,1001.3 MB/s 
 9#silesia.tar       : 211957760 ->  65039023 (3.259),  53.7 MB/s ,1027.3 MB/s 
10#silesia.tar       : 211957760 ->  64715335 (3.275),  45.5 MB/s ,1042.2 MB/s 
11#silesia.tar       : 211957760 ->  64254016 (3.299),  17.8 MB/s ,1056.9 MB/s (btlazy2)

<=128K
1.5.0:

 4#silesia.tar       : 211957760 ->  71259831 (2.974), 168.4 MB/s , 910.2 MB/s  (dfast)
 5#silesia.tar       : 211957760 ->  69243649 (3.061), 115.8 MB/s , 905.5 MB/s 
 6#silesia.tar       : 211957760 ->  67767332 (3.128),  78.1 MB/s , 964.5 MB/s 
 7#silesia.tar       : 211957760 ->  67305272 (3.149),  65.4 MB/s , 988.4 MB/s 
 8#silesia.tar       : 211957760 ->  66948973 (3.166),  56.3 MB/s ,1004.1 MB/s 
 9#silesia.tar       : 211957760 ->  66689276 (3.178),  49.2 MB/s ,1016.5 MB/s 
10#silesia.tar       : 211957760 ->  66689276 (3.178),  49.2 MB/s ,1016.6 MB/s
11#silesia.tar       : 211957760 ->  66344747 (3.195),  20.1 MB/s ,1027.4 MB/s (btlazy2)

new proposed:

 4#silesia.tar       : 211957760 ->  71259831 (2.974), 168.4 MB/s , 910.4 MB/s (dfast)
 5#silesia.tar       : 211957760 ->  68812273 (3.080), 108.7 MB/s , 920.2 MB/s 
 6#silesia.tar       : 211957760 ->  68494934 (3.095),  98.6 MB/s , 931.4 MB/s 
 7#silesia.tar       : 211957760 ->  67375761 (3.146),  69.7 MB/s , 980.9 MB/s 
 8#silesia.tar       : 211957760 ->  67088863 (3.159),  62.9 MB/s , 993.8 MB/s 
 9#silesia.tar       : 211957760 ->  66948973 (3.166),  56.3 MB/s ,1004.4 MB/s 
10#silesia.tar       : 211957760 ->  66689276 (3.178),  49.2 MB/s ,1016.8 MB/s 
11#silesia.tar       : 211957760 ->  66344747 (3.195),  20.1 MB/s ,1027.4 MB/s (btlazy2)

Params <= 16K are unchanged, since we don't use the row matchfinder by default for those levels.

@Cyan4973
Copy link
Copy Markdown
Contributor

Cyan4973 commented Jun 1, 2021

Thanks for this rebalancing @senhuang42 , which already looks like a nice improvement.

I agree that the main issue is that compression speed (and ratio) plateau'd at level 11, leaving a bit gap with level 13.
I also find that the decreasing slope is a bit too gentle, we should have bigger speed differences between levels. But this is likely a side-effect of hashRow speed limitation.

I'm wondering if this speed limitation is related to maximum row size.
A larger rowLog should cost more cpu.
Would it improve compression ratio, and by how much, that's another question.

Finally, any potential opportunity to bring level 4 to greedy ?

@senhuang42
Copy link
Copy Markdown
Author

senhuang42 commented Jun 1, 2021

I'm wondering if this speed limitation is related to maximum row size.
A larger rowLog should cost more cpu.
Would it improve compression ratio, and by how much, that's another question.

Yeah, I think this is something definitely worth taking a look at (though it would require additional refactoring of the rowhash code to support 64-entry rows), though I'm not sure by how much it can improve ratio since cLevel 13 == btlazy2 with its much more powerful search still only improves ratio by less than 1% over cLevel 12

Finally, any potential opportunity to bring level 4 to greedy ?

greedy at max speed settings (searchLog=1, hLog=18) is still worse than dfast in both speed and ratio:

 4#silesia.tar       : 211957760 ->  65507109 (3.236), 179.5 MB/s , 983.5 MB/s 
 5#silesia.tar       : 211957760 ->  65605004 (3.231), 129.8 MB/s , 960.2 MB/s

@senhuang42 senhuang42 closed this Jun 1, 2021
@senhuang42 senhuang42 reopened this Jun 1, 2021
@senhuang42 senhuang42 force-pushed the rebalance_clevel branch 3 times, most recently from bac275b to 6140a64 Compare June 3, 2021 08:30
@senhuang42
Copy link
Copy Markdown
Author

senhuang42 commented Jun 3, 2021

I've added support for 64-row entries in order to further assist the rebalancing effort. Now, the compression ratio gap between max rowhash settings and btlazy2 is pretty much bridged, and the speed gap is smaller.

Note that this PR is now rebased on top of #2681. The idea is to get #2681 merged first, then we can make further adjustments to how the 64-entry rows are integrated in this PR, after the adjustments/refactors to some rowhash internal APIs in #2681 are merged. So as of now, the code that adds 64-entry in this PR is just a WIP and exists just to have a working version.

>256K:

 5#silesia.tar       : 211957760 ->  62473463 (3.393), 106.2 MB/s ,1006.6 MB/s 
 6#silesia.tar       : 211957760 ->  61461315 (3.449),  89.8 MB/s ,1010.1 MB/s 
 7#silesia.tar       : 211957760 ->  60459438 (3.506),  71.2 MB/s ,1069.4 MB/s 
 8#silesia.tar       : 211957760 ->  59989973 (3.533),  60.1 MB/s ,1083.9 MB/s 
 9#silesia.tar       : 211957760 ->  59707605 (3.550),  50.4 MB/s ,1094.0 MB/s 
10#silesia.tar       : 211957760 ->  59157938 (3.583),  45.7 MB/s ,1091.2 MB/s 
11#silesia.tar       : 211957760 ->  58644580 (3.614),  37.0 MB/s ,1105.4 MB/s 
12#silesia.tar       : 211957760 ->  58243087 (3.639),  27.2 MB/s ,1116.7 MB/s

<=256K:

 4#silesia.tar       : 211957760 ->  67558139 (3.137), 115.8 MB/s ,1012.8 MB/s 
 5#silesia.tar       : 211957760 ->  66836666 (3.171), 103.3 MB/s ,1033.0 MB/s 
 6#silesia.tar       : 211957760 ->  66219074 (3.201),  85.7 MB/s ,1060.9 MB/s 
 7#silesia.tar       : 211957760 ->  65481211 (3.237),  69.4 MB/s ,1001.6 MB/s 
 8#silesia.tar       : 211957760 ->  65039023 (3.259),  54.6 MB/s ,1027.9 MB/s 
 9#silesia.tar       : 211957760 ->  64715335 (3.275),  46.7 MB/s ,1043.0 MB/s 
10#silesia.tar       : 211957760 ->  64498612 (3.286),  36.2 MB/s ,1053.8 MB/s

<=128K:

 5#silesia.tar       : 211957760 ->  69243649 (3.061), 114.2 MB/s , 906.1 MB/s 
 6#silesia.tar       : 211957760 ->  67767332 (3.128),  81.1 MB/s , 965.3 MB/s 
 7#silesia.tar       : 211957760 ->  67305272 (3.149),  66.0 MB/s , 989.2 MB/s 
 8#silesia.tar       : 211957760 ->  66948973 (3.166),  57.1 MB/s ,1005.0 MB/s 
 9#silesia.tar       : 211957760 ->  66689276 (3.178),  50.5 MB/s ,1017.3 MB/s 
10#silesia.tar       : 211957760 ->  66525512 (3.186),  40.2 MB/s ,1025.9 MB/s

@senhuang42 senhuang42 force-pushed the rebalance_clevel branch 12 times, most recently from 7a58f13 to 746ac8d Compare June 11, 2021 11:47
@senhuang42 senhuang42 marked this pull request as ready for review June 12, 2021 13:21
#define ZSTD_ROW_HASH_CACHE_MASK (ZSTD_ROW_HASH_CACHE_SIZE - 1)

typedef U32 ZSTD_VecMask; /* Clarifies when we are interacting with a U32 representing a mask of matches */
typedef U64 ZSTD_VecMask; /* Clarifies when we are interacting with a U64 representing a mask of matches */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the performance impact on 32-bit builds?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like a ~15% regression.

rebalanced 32-bit:
 5#silesia.tar       : 211957760 ->  62473463 (3.393),  72.5 MB/s , 596.1 MB/s 
 6#silesia.tar       : 211957760 ->  61461315 (3.449),  61.0 MB/s , 596.9 MB/s 
 7#silesia.tar       : 211957760 ->  60459438 (3.506),  47.6 MB/s , 629.6 MB/s 
 8#silesia.tar       : 211957760 ->  59989973 (3.533),  38.0 MB/s , 638.0 MB/s 
 9#silesia.tar       : 211957760 ->  59707605 (3.550),  35.9 MB/s , 644.7 MB/s 
10#silesia.tar       : 211957760 ->  59157938 (3.583),  33.5 MB/s , 644.0 MB/s 
11#silesia.tar       : 211957760 ->  58644580 (3.614),  25.3 MB/s , 652.9 MB/s 
12#silesia.tar       : 211957760 ->  58243087 (3.639),  18.6 MB/s , 660.0 MB/s

dev 32-bit with same params:
 5#silesia.tar       : 211957760 ->  62473463 (3.393),  78.8 MB/s , 593.5 MB/s 
 6#silesia.tar       : 211957760 ->  61461315 (3.449),  70.1 MB/s , 594.7 MB/s 
 7#silesia.tar       : 211957760 ->  60459438 (3.506),  52.9 MB/s , 627.3 MB/s 
 8#silesia.tar       : 211957760 ->  59989973 (3.533),  48.0 MB/s , 635.8 MB/s 
 9#silesia.tar       : 211957760 ->  59707605 (3.550),  42.4 MB/s , 642.4 MB/s 
10#silesia.tar       : 211957760 ->  59157938 (3.583),  39.6 MB/s , 641.8 MB/s 
11#silesia.tar       : 211957760 ->  58644580 (3.614),  31.8 MB/s , 650.4 MB/s 
12-silesia.tar       : 211957760 ->  58644580 (3.614),  31.6 MB/s , 650.6 MB/s

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't care about 32-bit builds... but branching on rowEntries in ZSTD_VecMask_next() would probably recover most performance in the 16/32 bits case. In the 64-bits case, on 32-bit platforms, it would be faster to process each each 32-bit half separately (which doesn’t really fit with the current ZSTD_VecMask_next() abstraction).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matches &= (matches - 1) could be pull into its own function ... that branches on rowEntries and sizeof(size_t)

@senhuang42 senhuang42 force-pushed the rebalance_clevel branch 2 times, most recently from 9fede99 to 3c7fb73 Compare June 14, 2021 16:25
@terrelln
Copy link
Copy Markdown
Contributor

Whats the status of this PR?

@senhuang42
Copy link
Copy Markdown
Author

Whats the status of this PR?

Added the minor optimization for 32-bit mode, level 5 silesia.tar speed goes from 72 -> 77 mbps, now the 32-bit regression is only 2-3%, and level 12 speed is 16% faster.

@senhuang42 senhuang42 merged commit 6a25804 into facebook:dev Aug 6, 2021
@sebres
Copy link
Copy Markdown
Contributor

sebres commented Aug 30, 2021

Unfortunately this rebalancing causes a massive performance "regression" on HTML/XML/JSON and other text files with good compressible data (due to repeatable structure, tags, etc) by marginal increase of compression ratio.
Here is an example how it looks on 131MB XML file:

+ $ _x64-before-gh-2692/zstd --long -b5 -e12 -i5s -T0 dblp.xml
- $ _x64--after-gh-2692/zstd --long -b5 -e12 -i5s -T0 dblp.xml
+  5#dblp.xml          : 131 MiB -> 21.0 MiB (6.240),  284.2 MB/s, 1259.1 MB/s
-  5#dblp.xml          : 131 MiB -> 20.4 MiB (6.438),  264.7 MB/s, 1313.9 MB/s |  -7.37% 
+  6#dblp.xml          : 131 MiB -> 20.4 MiB (6.439),  270.7 MB/s, 1268.1 MB/s
-  6#dblp.xml          : 131 MiB -> 19.8 MiB (6.624),  164.9 MB/s, 1305.3 MB/s | -64.16% 
+  7#dblp.xml          : 131 MiB -> 19.6 MiB (6.708),  230.1 MB/s, 1381.4 MB/s
-  7#dblp.xml          : 131 MiB -> 19.2 MiB (6.821),  183.0 MB/s, 1397.5 MB/s | -25.74% 
+  8#dblp.xml          : 131 MiB -> 19.3 MiB (6.792),  190.1 MB/s, 1418.5 MB/s
-  8#dblp.xml          : 131 MiB -> 19.0 MiB (6.887),  137.7 MB/s, 1409.0 MB/s | -38.05% 
+  9#dblp.xml          : 131 MiB -> 19.0 MiB (6.901),  153.1 MB/s, 1439.7 MB/s
-  9#dblp.xml          : 131 MiB -> 18.7 MiB (7.027),  133.0 MB/s, 1357.8 MB/s | -15.11% 
+ 10#dblp.xml          : 131 MiB -> 18.7 MiB (7.027),  133.7 MB/s, 1338.6 MB/s
- 10#dblp.xml          : 131 MiB -> 18.4 MiB (7.147),  117.2 MB/s, 1259.0 MB/s | -14.08% 
+ 11#dblp.xml          : 131 MiB -> 18.4 MiB (7.147),  117.7 MB/s, 1254.1 MB/s
- 11#dblp.xml          : 131 MiB -> 18.1 MiB (7.232),   86.6 MB/s, 1261.8 MB/s | -35.91% 
+ 12#dblp.xml          : 131 MiB -> 18.1 MiB (7.232),   87.6 MB/s, 1263.7 MB/s
- 12#dblp.xml          : 131 MiB -> 18.0 MiB (7.300),   57.9 MB/s, 1275.6 MB/s | -51.30% 

Especially with level 6 and 12 it is more than 50% slower now.
Even single threaded it is still obvious:

+ $ _x64-before-gh-2692/zstd --long -b6,12 -i3s -T1 dblp.xml
- $ _x64--after-gh-2692/zstd --long -b6,12 -i3s -T1 dblp.xml
+  6#dblp.xml          : 131 MiB -> 20.1 MiB (6.531),  108.7 MB/s, 1178.9 MB/s
-  6#dblp.xml          : 131 MiB -> 19.3 MiB (6.798),   77.0 MB/s, 1107.1 MB/s | -41.17%
+ 12#dblp.xml          : 131 MiB -> 17.5 MiB (7.487),   34.5 MB/s, 1211.3 MB/s
- 12#dblp.xml          : 131 MiB -> 17.4 MiB (7.547),   24.2 MB/s, 1217.9 MB/s | -42.56%

# without `--long`:
+ $ _x64-before-gh-2692/zstd -b12 -i3s -T1 dblp.xml
- $ _x64--after-gh-2692/zstd -b12 -i3s -T1 dblp.xml
+ 12#dblp.xml          : 131 MiB -> 18.5 MiB (7.108),   39.9 MB/s, 1318.1 MB/s
- 12#dblp.xml          : 131 MiB -> 18.3 MiB (7.173),   26.6 MB/s, 1289.1 MB/s | -50.00%

@Cyan4973
Copy link
Copy Markdown
Contributor

Cyan4973 commented Aug 30, 2021

For inter-version speed comparison, prefer comparing current dev with v1.4.9.

v1.5.0 is more like an anomaly, that has seriously impacted the shape of the speed curve (while generally preserving the same compression ratio as v1.4.9 for a given level).
We try to get it back to a "normal" speed curve, compared to previous versions of zstd, where a compression level corresponds more or less to a speed budget, into which we try to cram as much compression ratio as possible.
For a simple example, note how the "new" level 11 is essentially identical to v1.5.0 level 12. Yet, compared to v1.4.9, it's both faster and compresses more.

If the speed is not fast enough, just lessen the compression level, typically by one notch.
It's normal for compression ratios to "plateau" once the "easy" part of the job is done. With the new algorithm, we plateau much faster, so it's reasonable to select a lower compression level to benefit from higher speed, while only marginally impacting compression ratio.

I don't have your file to reproduce exactly, but we do have silesia/xml which is public and seems to belong to a similar category.
Indeed, the compression ratio is fairly high for this type of files, so conclusions do not necessarily extends to other types of files.
Anyway, here is a comparison, on a core i7-9700k (turbo off), in compression ratio increasing order.

Version Level C.Speed C.Ratio
v1.4.9 5 225 MB/s 8.715
v1.4.9 6 191 MB/s 9.200
dev 5 188 MB/s 9.569
v1.4.9 7 131 MB/s 9.681
dev 6 156 MB/s 9.824
v1.4.9 8 105 MB/s 9.965
dev 7 124 MB/s 10.04
v1.4.9 9 77.7 MB/s 10.22
dev 8 103 MB/s 10.28
dev 9 91.5 MB.s 10.30
v1.4.9 10 73.6 MB/s 10.32
dev 10 85.0 MB/s 10.32
v1.4.9 11 70.2 MB/s 10.32
dev 11 68.9 MB/s 10.53
v1.4.9 12 50.3 MB/s 10.53
dev 12 55 MB/s 10.68

As can be seen dev compresses more, or compresses faster, or both, compared to v1.4.9, and does so generally at a lower compression level.
For this specific example, I would recommend to settle at level 8, because compression ratio benefits tend to be small after this point.
For v1.4.9, my recommendation would have been level 9, though note that it compresses less and slower at this point in the graph.

@sebres
Copy link
Copy Markdown
Contributor

sebres commented Aug 30, 2021

v1.5.0 is more like an anomaly, that has seriously impacted the shape of the speed curve

Well, related to my tests this PR introduced still more worse "anomaly" than 1.5.0... Please take a careful look at level 5 - 8 in my excerpt above.
Let's say we need a target compression speed near to 200 MB/s (for such a file types), so previously one could use level 8 (or 7 for sure) to reach that, and since this PR got merged it would be only level 5 suitable (because the next level would be abrupt 60% slower).
Not to mention strange speed gap at level 6 now (164.9 MB/s), which is basically slower than level 7 (183.0 MB/s)!

I don't have your file to reproduce exactly

Here you go - https://dblp.org/xml/release/ (I used an excerpt of 130MB).

Indeed, the compression ratio is fairly high for this type of files, so conclusions do not necessarily extends to other types of files.

Sure... just exactly this types of files need often a good speed (by fewer interesting ratio, because it doesn't change too fast or as you correctly said entering the "plateau" faster than other files).
And I wanted to provide an example where a "regression" by this kind of levels re-balancing cannot be simply eliminated by switch down to previous compression level (accidentally up to 3 levels either) to return to the same speed.

Option --fast is almost unusable here for that purposes, because even by --fast=1 the CR is simply too inefficient.
Possibly parameter --speed=... or some mix of -# with --fast=# (and partially --adapt[=min=#,max=#]) would be good here, because there is no other possibility at the moment to regulate the speed, excepting the fine tuning with --zstd=....

@senhuang42
Copy link
Copy Markdown
Author

senhuang42 commented Aug 30, 2021

That's interesting. I've benchmarked the first 2709MB of the dblp.xml corpus with the following results.
Also, it's a little confusing, but --fast=# are basically negative compression levels. If you want a bit less compression ratio with better speed, try levels 3-4 rather than the --fast flag. For example, level 4 on dblp.xml corpus is around 80-100% faster than level 5 with a 10% ratio regression.

commit ae131282 (right before this PR):
6#dblp.xml          : 2.65 GiB -> 443 MiB (6.122),  184.1 MB/s, 1421.5 MB/s
7#dblp.xml          : 2.65 GiB -> 424 MiB (6.392),  110.6 MB/s, 1434.0 MB/s

after rebalance PR:
5#dblp.xml          : 2.65 GiB -> 435 MiB (6.233),  152.1 MB/s, 1485.2 MB/s
6#dblp.xml          : 2.65 GiB -> 424 MiB (6.391),  121.4 MB/s, 1418.5 MB/s 
7#dblp.xml          : 2.65 GiB -> 413 MiB (6.553),  104.8 MB/s, 1663.0 MB/s 

So at least in my benchmarking, I'm seeing that the new level 6 is equivalent to the old level 7 in compressibility but faster. And the new level 7 is better in compressibility than the old level 7 at a just slightly slower speed. Indeed, we still don't have a perfectly smooth transition from levels 4 to 5, but that is sort of the step-function price you pay for switching to an entirely new algorithm.

And of course, the file you are benchmarking itself matters a lot too in how the curve looks - we tune ours based off of common corpuses like enwik, silesia, http_archive, gh_users, etc., so we can't expect the tradeoffs/cliff-between-leves-4-and-5 to look exactly the same once we're working with an entirely different file. But even then it seems to me that it's working mostly fine for this particular case.

Also, the new level 11 is almost identical to the old level 12, so if you desire that exact compressibility/speed tradeoff, use 11 instead of 12. The new 12 is just stronger than the old one.

Let's say we need a target compression speed near to 200 MB/s (for such a file types)

If your use-case absolutely required this ability to hit these exact speeds, then that would push it into an "advanced scenario" where it makes sense to use a more customized query. In that case, you could use the --zstd= flag, and re-use the exact parameters found here, but switched to the old commit:

https://github.com/senhuang42/zstd/blob/539b3aab9be79d2b8b0537633ed51ca4214ed40f/lib/compress/zstd_compress.c#L6182

@sebres
Copy link
Copy Markdown
Contributor

sebres commented Aug 31, 2021

I'm seeing that the new level 6 is equivalent to the old level 7 in compressibility but faster.

But also you see a large distance between new and old levels regarding the compression speed. Anyway the difference between CR 6.3 and 6.1 is almost to neglect (it is ca 4% in target size), whereas the difference in speed 121.4 vs 184.1 MB/s is significant (more than 50%). And a switch to level 4 is not an option, because the resulting CR "degradation" would be also substantial.
And I guess your test was single threaded?.. Multithreaded it'll be still sharper.

Also, it's a little confusing, but --fast=# are basically negative compression levels.

Sure, and I have been well aware of it, therefore I tried max-possible --fast=1 and as already said it is not enough good due to its poor CR. Neither they cross each other nor there is a gentle transition between --fast=1 and -1.
In practice it would be good to have --fast=# with negative numbers or even afore-mentioned --speed=# parameter with the focus on the compression speed (with higher precedence for it).

If you want a bit less compression ratio with better speed, try levels 3-4 rather than the --fast flag.

It was not about what I can try, but which characteristic values zstd would get by default. I know what I could do here.
Again I see a bit more ugly "anomaly" as before this PR...
Trying to illustrate it with a chart comparing before/after for -b4 -e14 -T4 (upper lines) and -b4 -e14 -T1 (lower lines):
image
I think it's obvious, isn't it?
Calculated in percent (as a size saving) the whole chart stretches still more to the right (at least left part of it) and the whole X-axis goes from 83% to 86% only (so we'll speak round about 1% of size saving by almost double performance "degradation" for several levels). Below are both charts zoomed to level 5-10 in percent on X-axis and logarithmic speed on Y-axis:

zstd -b4 -e14 -i3s -T4 dblp.xml (zoomed to l. 5-10)

image

zstd -b4 -e14 -i3s -T1 dblp.xml (zoomed to l. 5-10)

image

So the anomaly of v.1.5.0, Yann told about, is still there, but in my opinion it becomes more abnormal than previously, at least as for level-related CS/CR "normal" distribution.

I am aware that this is basically true only for xml-file like that (or probably other text files), but such files are often typical "victims" of compression tools (much often than silesia.tar similar stuff).
As well as the considering of silesia.tar solely for rebalance purposes may be not fully correct too (rather not sufficient enough), isn't it?

re-use the exact parameters found here ...

Basically it is almost enough to reduce slog parameter (decrease it to previous values), at least for levels that kept the same strategy.
Anyway a test with --zstd=slog=$prevslog[$level] shows almost the same performance (like 1.5.0) as well as more smooth CS/CR distribution in my case.

@senhuang42
Copy link
Copy Markdown
Author

senhuang42 commented Aug 31, 2021

But also you see a large distance between new and old levels regarding the compression speed.

First, I appreciate the chart in visualizing the differences. It's definitely helpful and presents the findings in an easily digestible way. However, I still can't reproduce level 7 being faster than 6.

And yes, in this particular case, we see that the cliff is larger. You make really great points about all the issues regarding the cliff, and maybe we should take a closer look at corpuses with more structured data. But also consider that all this is happening on this particular file. The search and matchfinding algorithm is entirely different between levels 4 and 5. We can make a best effort on trying to smooth out the difference, but it's impossible to maintain the exact same "cliff magnitude" across very different file classes.

It was not about what I can try, but which characteristic values zstd would get by default

If this is your main concern, rather than trying to get a specific performance profile for your specific use-case, then I think we also must necessarily expand this discussion beyond the scope of this single file since we're talking at a larger scope now. So compared to 1.4.9 in general, the cliff from 4->5 is still smaller, and compression ratio is a lot better.

Compared to prior to this PR, the cliff is a tiny bit larger, but compression ratio is better. Consider clevels as speed targets - the purpose of this PR was to bring 1.5.0 zstd more in line with the 1.4.9 speed targets, so slowing down 5 in favor of more compression ratio brings zstd more in line with that goal.

I am aware that this is basically true only for xml-file like that (or probably other text files), but such files are often typical "victims" of compression tools (much often than silesia.tar similar stuff). As well as the considering of silesia.tar solely for rebalance purposes may be not fully correct too (rather not sufficient enough), isn't it?

Definitely, using silesia.tar as this "ultimate benchmarking corpus" won't produce ideal results, but gives us an okay idea of how this performs overall (and has translated fairly well into production results across a massive fleet that compresses a wide variety of things). Above I had presented silesia.tar results, but as I mentioned before, we also sanity check against enwik, http_archive, and gh_users, the latter two of which contain some more structured data (though the smaller filesizes make us hit some different parameter sets).

Basically it is almost enough to reduce slog parameter (decrease it to previous values), at least for levels that kept the same strategy.
Anyway a test with --zstd=slog=$prevslog[$level] shows almost the same performance (like 1.5.0) as well as more smooth CS/CR distribution in my case.

Yeah, that makes sense - slog is a very significant factor in performance for this new algo. This PR in general bumps up slog that were < 4, because if slog < 4, then we are essentially wasting a bit of memory (due to the nature of the new algorithm), which naturally makes the ratio/speed trade off worse in favor of having a bit more speed.

And it appears that on this file in particular, having more searches is particularly costly without as much benefit.

@sebres
Copy link
Copy Markdown
Contributor

sebres commented Aug 31, 2021

However, I still can't reproduce level 7 being faster than 6.

Hmm... how you compile it? I used gcc 11.1 with -Ofast -march=core2 (so it would be able to run on older Xeon's), but I was also trying native (haswell and skylake) for my i5/i7 machines (with similar result).
May be it is simply some obscure case which affects strategy switch ZSTD_greedy and ZSTD_lazy between 6 and 7 which cannot be optimized well enough in some concrete case (e. g. some extra cache washout as result of slog growth - note that this is more evident multi-threaded).

Anyway I'll try to reproduce this with other files as well as to make more tests (also with silesia.tar and other stuff I have in my tests cases) and provide a summary later.

@sebres
Copy link
Copy Markdown
Contributor

sebres commented Aug 31, 2021

Here are the results with the charts for silesia.tar...

Results for silesia.tar...
+ _x64-before-gh-2692/zstd -b4 -e14 -i3s -T4 silesia.tar
 4#silesia.tar       : 202 MiB -> 62.5 MiB (3.233),  241.1 MB/s,  891.8 MB/s
 5#silesia.tar       : 202 MiB -> 60.9 MiB (3.319),  206.3 MB/s,  834.2 MB/s
 6#silesia.tar       : 202 MiB -> 60.1 MiB (3.362),  197.7 MB/s,  877.7 MB/s
 7#silesia.tar       : 202 MiB -> 58.7 MiB (3.443),  171.9 MB/s,  940.2 MB/s
 8#silesia.tar       : 202 MiB -> 58.1 MiB (3.477),  143.5 MB/s,  962.5 MB/s
 9#silesia.tar       : 202 MiB -> 57.2 MiB (3.532),  111.1 MB/s,  965.8 MB/s
10#silesia.tar       : 202 MiB -> 56.6 MiB (3.570),   96.4 MB/s,  940.2 MB/s
11#silesia.tar       : 202 MiB -> 56.5 MiB (3.579),   89.1 MB/s,  933.1 MB/s
12#silesia.tar       : 202 MiB -> 56.0 MiB (3.609),   65.4 MB/s,  945.3 MB/s
13#silesia.tar       : 202 MiB -> 55.4 MiB (3.646),   22.8 MB/s,  982.4 MB/s
14#silesia.tar       : 202 MiB -> 55.0 MiB (3.676),   19.8 MB/s,  956.3 MB/s

- _x64--after-gh-2692/zstd -b4 -e14 -i3s -T4 silesia.tar
 4#silesia.tar       : 202 MiB -> 62.5 MiB (3.233),  237.5 MB/s,  809.9 MB/s
 5#silesia.tar       : 202 MiB -> 59.6 MiB (3.389),  172.7 MB/s,  822.1 MB/s
 6#silesia.tar       : 202 MiB -> 58.7 MiB (3.442),  105.7 MB/s,  904.3 MB/s
 7#silesia.tar       : 202 MiB -> 57.8 MiB (3.499),  127.6 MB/s,  961.7 MB/s
 8#silesia.tar       : 202 MiB -> 57.3 MiB (3.525),   93.6 MB/s,  968.4 MB/s
 9#silesia.tar       : 202 MiB -> 57.0 MiB (3.545),   99.4 MB/s,  978.1 MB/s
10#silesia.tar       : 202 MiB -> 56.5 MiB (3.579),   88.5 MB/s,  929.4 MB/s
11#silesia.tar       : 202 MiB -> 56.0 MiB (3.609),   57.2 MB/s,  942.1 MB/s
12#silesia.tar       : 202 MiB -> 55.6 MiB (3.634),   43.2 MB/s,  954.1 MB/s
13#silesia.tar       : 202 MiB -> 55.4 MiB (3.646),   22.9 MB/s,  988.8 MB/s
14#silesia.tar       : 202 MiB -> 55.0 MiB (3.676),   19.7 MB/s,  970.8 MB/s

+ _x64-before-gh-2692/zstd -b4 -e14 -i3s -T1 silesia.tar
 4#silesia.tar       : 202 MiB -> 62.5 MiB (3.236),  158.1 MB/s,  881.3 MB/s
 5#silesia.tar       : 202 MiB -> 60.9 MiB (3.322),  108.7 MB/s,  852.8 MB/s
 6#silesia.tar       : 202 MiB -> 60.1 MiB (3.365),  103.9 MB/s,  875.0 MB/s
 7#silesia.tar       : 202 MiB -> 58.6 MiB (3.447),   77.3 MB/s,  942.1 MB/s
 8#silesia.tar       : 202 MiB -> 58.1 MiB (3.479),   58.7 MB/s,  968.1 MB/s
 9#silesia.tar       : 202 MiB -> 57.2 MiB (3.536),   46.8 MB/s,  982.0 MB/s
10#silesia.tar       : 202 MiB -> 56.6 MiB (3.574),   40.0 MB/s,  839.3 MB/s
11#silesia.tar       : 202 MiB -> 56.4 MiB (3.583),   35.5 MB/s,  890.6 MB/s
12#silesia.tar       : 202 MiB -> 55.9 MiB (3.614),   26.1 MB/s,  919.6 MB/s
13#silesia.tar       : 202 MiB -> 55.4 MiB (3.648),   8.45 MB/s,  991.7 MB/s
14#silesia.tar       : 202 MiB -> 54.9 MiB (3.681),   7.10 MB/s,  971.3 MB/s

- _x64--after-gh-2692/zstd -b4 -e14 -i3s -T1 silesia.tar
 4#silesia.tar       : 202 MiB -> 62.5 MiB (3.236),  156.9 MB/s,  882.4 MB/s
 5#silesia.tar       : 202 MiB -> 59.6 MiB (3.393),   93.9 MB/s,  900.4 MB/s
 6#silesia.tar       : 202 MiB -> 58.6 MiB (3.448),   70.4 MB/s,  914.0 MB/s
 7#silesia.tar       : 202 MiB -> 57.7 MiB (3.506),   58.2 MB/s,  954.1 MB/s
 8#silesia.tar       : 202 MiB -> 57.2 MiB (3.533),   46.2 MB/s,  965.5 MB/s
 9#silesia.tar       : 202 MiB -> 56.9 MiB (3.550),   42.4 MB/s,  953.7 MB/s
10#silesia.tar       : 202 MiB -> 56.4 MiB (3.583),   38.4 MB/s,  936.3 MB/s
11#silesia.tar       : 202 MiB -> 55.9 MiB (3.614),   29.0 MB/s,  945.5 MB/s
12#silesia.tar       : 202 MiB -> 55.5 MiB (3.639),   20.8 MB/s,  955.5 MB/s
13#silesia.tar       : 202 MiB -> 55.4 MiB (3.648),   8.46 MB/s,  989.1 MB/s
14#silesia.tar       : 202 MiB -> 54.9 MiB (3.681),   7.23 MB/s,  968.9 MB/s
Chart for silesia.tar, before (green)/after(red), MT (-T4) and ST(-T1), Y - logarithm. C.Speed in MB/s, X - linear C.Ratio and Size-Saving in %...

image

* dashed gray line shows your result from #2692 (comment) for comparison.

Interesting is that multi-threaded 6 is still slower than 7 (and same for 8 and 9) in new rebalanced variant, whereas single-threaded it looks "normal". Something seems to confuse them drastically in case of -T4 (tested on 8x core with under 50% load, without any parasitic load). May be some longer lock causes it in case of enlarged slog.

Adding parameter --long (so increasing the wlog to 27) shows similar outcome and makes it simply more obvious:

Results for --long silesia.tar...
+ _x64-before-gh-2692/zstd -b4 -e14 -i3s --long -T4 silesia.tar
 4#silesia.tar       : 202 MiB -> 62.2 MiB (3.251),  222.9 MB/s,  905.8 MB/s
 5#silesia.tar       : 202 MiB -> 60.8 MiB (3.322),  199.4 MB/s,  884.0 MB/s
 6#silesia.tar       : 202 MiB -> 59.9 MiB (3.377),  192.2 MB/s,  899.0 MB/s
 7#silesia.tar       : 202 MiB -> 58.5 MiB (3.454),  166.1 MB/s,  946.8 MB/s
 8#silesia.tar       : 202 MiB -> 57.9 MiB (3.489),  139.1 MB/s,  960.8 MB/s
 9#silesia.tar       : 202 MiB -> 57.1 MiB (3.539),  106.1 MB/s,  958.9 MB/s
10#silesia.tar       : 202 MiB -> 56.6 MiB (3.574),   93.0 MB/s,  927.8 MB/s
11#silesia.tar       : 202 MiB -> 56.2 MiB (3.597),   83.8 MB/s,  892.2 MB/s
12#silesia.tar       : 202 MiB -> 55.7 MiB (3.626),   61.9 MB/s,  902.9 MB/s
13#silesia.tar       : 202 MiB -> 55.4 MiB (3.648),   23.7 MB/s,  952.7 MB/s
14#silesia.tar       : 202 MiB -> 54.8 MiB (3.687),   19.6 MB/s,  911.9 MB/s

- _x64--after-gh-2692/zstd -b4 -e14 -i3s --long -T4 silesia.tar
 4#silesia.tar       : 202 MiB -> 62.2 MiB (3.251),  213.5 MB/s,  840.0 MB/s
 5#silesia.tar       : 202 MiB -> 59.8 MiB (3.383),  178.8 MB/s,  910.1 MB/s
 6#silesia.tar       : 202 MiB -> 58.6 MiB (3.449),  106.2 MB/s,  904.0 MB/s
 7#silesia.tar       : 202 MiB -> 57.7 MiB (3.505),  120.8 MB/s,  903.8 MB/s
 8#silesia.tar       : 202 MiB -> 57.3 MiB (3.528),   89.2 MB/s,  914.7 MB/s
 9#silesia.tar       : 202 MiB -> 56.6 MiB (3.574),   89.4 MB/s,  940.8 MB/s
10#silesia.tar       : 202 MiB -> 56.2 MiB (3.597),   82.4 MB/s,  874.9 MB/s
11#silesia.tar       : 202 MiB -> 55.7 MiB (3.626),   57.9 MB/s,  907.5 MB/s
12#silesia.tar       : 202 MiB -> 55.4 MiB (3.649),   38.3 MB/s,  910.7 MB/s
13#silesia.tar       : 202 MiB -> 55.4 MiB (3.648),   23.6 MB/s,  940.4 MB/s
14#silesia.tar       : 202 MiB -> 54.8 MiB (3.687),   19.4 MB/s,  912.6 MB/s

+ _x64-before-gh-2692/zstd -b4 -e14 -i3s --long -T1 silesia.tar
 4#silesia.tar       : 202 MiB -> 61.6 MiB (3.281),  115.3 MB/s,  835.1 MB/s
 5#silesia.tar       : 202 MiB -> 60.4 MiB (3.349),   94.8 MB/s,  849.9 MB/s
 6#silesia.tar       : 202 MiB -> 59.6 MiB (3.390),   86.6 MB/s,  857.7 MB/s
 7#silesia.tar       : 202 MiB -> 58.2 MiB (3.470),   62.8 MB/s,  849.9 MB/s
 8#silesia.tar       : 202 MiB -> 57.7 MiB (3.502),   52.6 MB/s,  948.9 MB/s
 9#silesia.tar       : 202 MiB -> 56.7 MiB (3.566),   43.3 MB/s,  919.2 MB/s
10#silesia.tar       : 202 MiB -> 56.2 MiB (3.595),   38.5 MB/s,  897.4 MB/s
11#silesia.tar       : 202 MiB -> 56.0 MiB (3.610),   35.3 MB/s,  882.3 MB/s
12#silesia.tar       : 202 MiB -> 55.5 MiB (3.640),   28.1 MB/s,  900.2 MB/s
13#silesia.tar       : 202 MiB -> 55.1 MiB (3.667),   8.22 MB/s,  903.7 MB/s
14#silesia.tar       : 202 MiB -> 54.7 MiB (3.698),   6.96 MB/s,  901.9 MB/s

- _x64--after-gh-2692/zstd -b4 -e14 -i3s --long -T1 silesia.tar
 4#silesia.tar       : 202 MiB -> 61.6 MiB (3.281),  114.0 MB/s,  836.8 MB/s
 5#silesia.tar       : 202 MiB -> 59.2 MiB (3.414),   83.9 MB/s,  886.9 MB/s
 6#silesia.tar       : 202 MiB -> 58.2 MiB (3.472),   62.9 MB/s,  844.4 MB/s
 7#silesia.tar       : 202 MiB -> 57.2 MiB (3.534),   53.4 MB/s,  909.5 MB/s
 8#silesia.tar       : 202 MiB -> 56.8 MiB (3.557),   44.0 MB/s,  940.0 MB/s
 9#silesia.tar       : 202 MiB -> 56.2 MiB (3.595),   38.4 MB/s,  905.9 MB/s
10#silesia.tar       : 202 MiB -> 56.0 MiB (3.610),   35.2 MB/s,  888.2 MB/s
11#silesia.tar       : 202 MiB -> 55.5 MiB (3.640),   27.0 MB/s,  887.3 MB/s
12#silesia.tar       : 202 MiB -> 55.2 MiB (3.663),   19.3 MB/s,  898.8 MB/s
13#silesia.tar       : 202 MiB -> 55.1 MiB (3.667),   8.20 MB/s,  913.3 MB/s
14#silesia.tar       : 202 MiB -> 54.7 MiB (3.698),   6.86 MB/s,  890.5 MB/s
Chart for --long silesia.tar, before (green)/after(red), MT (-T4) and ST(-T1), Y - logarithm. C.Speed in MB/s, X - linear C.Ratio and Size-Saving in %...

image

Anyway if one'd consider the whole picture independently --long parameter, old variant looks preferable distributed to me and illustrate better balanced distances between levels.
At least till this strange bottleneck in new variant would get eliminated somehow algorithmic.

sebres added a commit to sebres/zstd that referenced this pull request Aug 31, 2021
the calculation of rowLog in e411040 was implemented not correctly - it was always 4 no matter how large `slog` is, now it is [4 .. 6] depending on `slog`
sebres added a commit to sebres/zstd that referenced this pull request Aug 31, 2021
the calculation of rowLog in e411040 was implemented not correctly - it was growing not restricted with the `slog`, now it is [4 .. 6] depending on `slog`
@senhuang42
Copy link
Copy Markdown
Author

Thanks for the benchmarks!
A few notes:

  1. I notice decent speed perturbations on your level 4 measurements which should not be different. I suggest using cpu shielding to get better stability in your results. Or run more trials and look at the average, median, and p90 results.
  2. Benchmarking compression levels with --long for the sake of comparison generally isn't too informative since it changes the wlog to a fixed value for all levels. Therefore we're not really using compression levels anyway.
  3. We optimize zstd for single-threaded performance in the library (--single-thread is different from -T1. -T1 is the multithreaded code path, but using 1 core). Ultimately we expect multithreaded performance to have different performance characteristics, and we can't optimally tune for both.
  4. I still cannot reproduce your speed regressions on an i9-9900K @ 3.6 GHz, or an MBP 15" i9. The speed gap from 6 to 7 tightens in -T2 and -T1, but level 7 is still not slower on my end. This is using gcc with make -j zstd. Try some different compilers and machines as well.

Finally:
Compression levels are just a heuristic. Advanced use-cases tend to require more work for the user. If your use-case is advanced enough to require specific speed/ratio targets and the defaults are not satisfactory, then you can use the advanced parameters. You can use paramgrill on your specific corpus/file/cpu and get the best results tuned specifically to that file on your CPU. That would actually probably be the best bet for your case, creating a set of params best suited for dblp.xml.

@sebres
Copy link
Copy Markdown
Contributor

sebres commented Sep 1, 2021

I notice decent speed perturbations on your level 4 measurements which should not be different. I suggest using cpu shielding to get better stability in your results.

Regarding ultimate results, sure, you are right... But for interim tests it is almost to neglect as long as the results are reasonably stable with a small measurement error (e. g. l.4 values - 241/238 shows 1%, which is surely acceptable tolerance and 158/157 is even 0.5%), especially if it is repeated many times with the same tendency.
And I noticed the same in your results for level 4 (e. g. 179.5/176.7 ~ 1.5%), so I suggest... well, you know :)

Ultimately we expect multithreaded performance to have different performance characteristics, and we can't optimally tune for both.

Yes, just a rebalance of parameters showing previously good characteristics in single- and multi-threaded modes to some new values with a known degradation of multithreaded performance will be not reasonable probably, is it?

By the way, would it be advisable to introduce some additional rules or another variant of ZSTD_defaultCParameters for multithreaded compression?

I still cannot reproduce your speed regressions on an i9-9900K

Well, one difference to me (i5-6500) would be the CPU cache size - 6MB (i5) vs. 16MB (i9). Larger CPU cache means fewer cache-misses and washouts, fewer cache refreshes on inter-threaded change due to larger amount of pages in TLB, etc. Typically it results in less frequent memory access, but in other case inappropriate parameter could cause that O(1) rapidly mutates to O(n) or O(n) to O(n**2), especially if the algorithm uses memory in some unsuitable way as for the cache granularity for instance.
Another difference were the way how I compile it (-Ofast and -march vs. -O2 on your side) - my variant of executable is more optimized as yours, so I guess it is much faster generally (therefore able to show the bottleneck much clearly).

Compression levels are just a heuristic. Advanced use-cases tend to require more work for the user.

We speak about defaults all the time, so let us concentrate on this, please.

That would actually probably be the best bet for your case, creating a set of params best suited for dblp.xml.

I'm not in mode to find an optimal way for dblp.xml similar files. Not to mention that the results for silesia.tar as you can see above are almost the same.
Additionally, neither they are reasonably better to solve the known "anomaly" of v.1.5.0, nor they really showing perfectly normal distributed levels characteristics (CS/CR).
To be more precise here are the results of your measurements from your first comments for silesia.tar (rebalanced values from #2692 (comment)):

Chart for silesia.tar, v.1.4.9 (blue), v.1.5.0 (green) and rebalanced (red), Y - logarithm. C.Speed in MB/s, X - linear C.Ratio and Size-Saving in %...

image

I could by no stretch of the imagination see any real benefit of new rebalanced values compared to v.1.5.0, excepting that new l.6-l.8 are slightly better than old l.7-l.9, and other levels vice versa. And that all on your CPU (or at least there are others like my where the picture changing), let alone the other weird things like the distance between l.4 and l.5 is enormous now, etc.

@Cyan4973
Copy link
Copy Markdown
Contributor

Cyan4973 commented Sep 1, 2021

I presume you are looking for "regularity" on the horizontal axis, which is more natural to read.
In which case, swap your axis, to have log(speed) on horizontal, and compression ratio on vertical.
You'll notice that rebalanced does a slightly better job at spreading compression levels across more regular speed distances.
It's not perfect, and doesn't replicate v1.4.9, first because v1.4.9 is itself an approximation, and second because the new algorithm can't be as slow as the old one, leaving a gap at the end, between levels 12 and 13.

Regarding level 4, the distance with this level is big because level 4 is a bit "too fast". We would preferably have a slower yet more powerful level 4, but we were unable to find good parameters for this specific speed range. Level 4 is essentially a "souped up" level 3, consuming more memory, which ends up being a little slower and stronger, but nowhere near the distance we would ideally want level 4 to be. When a better setup will exist, it will take this spot. In the meantime, we do what we can with what we have.

@sebres
Copy link
Copy Markdown
Contributor

sebres commented Sep 1, 2021

I presume you are looking for "regularity" on the horizontal axis, which is more natural to read.

No. Basically I was confused by speed regression bothering me on several level, hereafter I noticed that levels are not normally distributed across the speed axis (no matter vertical or horizontal). Moreover it shows drastic bottlenecks in MT-mode, which also don't help by regularity search.

I could rotate it how you want, it changes nothing to me - neither I see "slightly better job" here, nor I think the distances is more "regular" now.

leaving a gap at the end, between levels 12 and 13.

Well, then probably the rebalancing for levels 5 - 10 (or 11) would be not really necessary, is it?

I understand what you mean, but my primary issue is - I cannot reproduce the picture of @senhuang42, I see almost always the pictures like in #2692 (comment) which are pretty irregular.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants