SpinLock.TryEnter fail fast for timeout 0 by benaadams · Pull Request #6944 · dotnet/coreclr

benaadams · 2016-08-26T22:19:54Z

Previously the timeout 0 would Interlocked.Add to set the waiters then CAS spin to unset it immediately after; now it exits before trying to set the waiters so skips both.

Added a Thread.SpinWait(1) when the thread didn't yield.

Changed the spin calculation to do the same thing (fairly sure) but in less operations per iteration.

Changed the exit mechanism to use break rather than inline returns as it generates less asm, for popping the registers - makes it less clear though :-/

Improved handling for high waiter count; though it would be in the billion range so hopefully would never be hit on a spinwait anyway (otherwise something has gone very wrong).

Moved yielding to the start of the yield loop; as if you've got there you've already just tried to acquire the lock.

Moved the time check after spinning into the if; as if you've skipped spinning you haven't really done anything yet so no point in checking if timed out by >= 1 millisecond (and 0 zero timeout fast-paths at the start).

Changed the spin type thresholds to be powers of 2 and changed the %/idiv/mod to &. Which means: Sleep(1) moved from 40 -> 64; Sleep(0) moved from 10 -> 16

Added comments throughout, also corrected some strange spellings.

Trims the asm from 1107 bytes of instructions to 890; and jit local vars from 31 to 24 (more loc, less tmp and less cse)

Also indentation changes due to some of the rearrangement so &w=1 is better for a compare.

Passes corefx tests

1M iters (single thread, uncontended but locked) for code

bool lockTaken = false;
var s = new SpinLock(false);
s.Enter(ref lockTaken);

method	pre (ms)	post (ms)	improvement
s.TryEnter(0, ref lockTaken)	24.55	5.95	x 4.1

Adding threadpool perf timings to gist and will post highlights

@stephentoub follow up on #6911

benaadams · 2016-08-26T23:19:11Z

Results using https://github.com/benaadams/ThreadPoolTaskTesting
results gist

Individual result variances due to threading timing and GC; also TrySteal uses a random factor; so chart trends are the more significant (bunch of different sections to follow am grouping them together by type)

SubTask Chain Return 
Testing 2,621,440 calls, with GCs after 262,144 calls.
Operations per second
                                                                           Parallelism
                             Serial          2x         16x         64x        512x

i5 4 core no HT
Pre
SubTask Chain Return      615.388 k     1.027 M     4.280 M     3.964 M     4.547 M
- Depth    2              548.552 k   540.726 k     3.375 M     3.340 M     4.557 M
- Depth   16              545.361 k     1.323 M     4.195 M     4.469 M     5.397 M
- Depth   64              247.245 k     1.787 M     3.118 M     4.696 M     5.510 M
- Depth  512              172.549 k   803.055 k     3.405 M     5.498 M     5.731 M

Post
SubTask Chain Return      702.488 k     1.629 M     4.319 M     4.314 M     4.336 M
- Depth    2              731.134 k     1.957 M     4.530 M     4.665 M     4.651 M
- Depth   16              975.378 k     2.413 M     4.836 M     5.498 M     5.534 M
- Depth   64              972.742 k     2.541 M     5.291 M     5.552 M     5.603 M
- Depth  512              989.954 k     2.578 M     5.277 M     5.585 M     5.684 M

Improvements across the board

i5 4 core no HT - 500 min thread
Pre
SubTask Chain Return      960.999 k     1.179 M     3.479 M     4.361 M     4.614 M
- Depth    2                1.108 M     1.281 M     3.736 M     4.268 M     4.941 M
- Depth   16                1.303 M     1.548 M     3.988 M     5.475 M     5.774 M
- Depth   64                1.385 M     1.575 M     4.064 M     5.335 M     5.815 M
- Depth  512                1.095 M     2.810 M     4.828 M     5.739 M     5.781 M

Post
SubTask Chain Return      904.786 k   990.460 k     3.251 M     4.159 M     4.386 M
- Depth    2                1.027 M     1.232 M     3.657 M     4.464 M     4.834 M
- Depth   16                1.232 M     1.475 M     3.901 M     5.385 M     5.763 M
- Depth   64                1.053 M     2.075 M     4.172 M     5.384 M     5.827 M
- Depth  512                1.175 M     2.851 M     4.735 M     5.590 M     5.749 M

Some regression

i7 4 core 8 HT
Pre
SubTask Chain Return      615.606 k   813.288 k     4.670 M     5.121 M     5.612 M
- Depth    2              696.861 k   871.519 k     5.813 M     5.705 M     5.986 M
- Depth   16              354.586 k   746.973 k     5.190 M     6.362 M     6.378 M
- Depth   64              435.186 k   778.422 k     5.410 M     5.533 M     6.383 M
- Depth  512              464.164 k   730.233 k     5.205 M     6.814 M     6.632 M

Post
SubTask Chain Return      604.834 k     1.133 M     5.359 M     5.641 M     5.724 M
- Depth    2              691.719 k     1.281 M     6.097 M     5.827 M     6.237 M
- Depth   16              724.503 k     1.617 M     6.529 M     6.639 M     6.847 M
- Depth   64              736.150 k     1.650 M     6.758 M     6.930 M     6.841 M
- Depth  512              824.113 k     1.765 M     6.943 M     6.858 M     6.964 M

Improvements across the board

i7 4 core 8 HT - 500 min thread
Pre
SubTask Chain Return      552.093 k   655.964 k     2.293 M     4.387 M     5.167 M
- Depth    2              597.881 k   699.831 k     3.099 M     5.356 M     5.658 M
- Depth   16              690.066 k   769.582 k     4.011 M     5.486 M     6.395 M
- Depth   64              749.703 k   800.779 k     3.956 M     5.607 M     6.452 M
- Depth  512              645.235 k     1.495 M     5.354 M     6.382 M     6.429 M

Post
SubTask Chain Return      618.259 k   619.092 k     2.446 M     4.590 M     5.327 M
- Depth    2              652.892 k   711.321 k     2.991 M     5.413 M     5.714 M
- Depth   16              741.617 k   827.076 k     3.781 M     5.636 M     6.414 M
- Depth   64              769.178 k   783.729 k     4.486 M     6.226 M     6.475 M
- Depth  512              687.981 k     1.434 M     6.388 M     6.554 M     6.634 M

General Improvements

benaadams · 2016-08-26T23:32:43Z

SubTask Chain Awaited
Testing 2,621,440 calls, with GCs after 262,144 calls.
Operations per second
                                                                           Parallelism
                             Serial          2x         16x         64x        512x

i5 4 core no HT
Pre
SubTask Chain Awaited     433.582 k   698.576 k     2.355 M     2.459 M     2.470 M
- Depth    2              447.282 k   754.095 k     2.417 M     2.626 M     2.730 M
- Depth   16              384.794 k   933.787 k     2.202 M     2.670 M     2.930 M
- Depth   64              453.963 k     1.106 M     2.340 M     2.778 M     2.870 M
- Depth  512              507.166 k   960.942 k     2.625 M     2.510 M     2.855 M

Post
SubTask Chain Awaited     472.009 k     1.241 M     2.416 M     2.618 M     2.664 M
- Depth    2              565.431 k     1.511 M     2.638 M     2.813 M     2.817 M
- Depth   16              792.927 k     1.761 M     3.002 M     3.317 M     3.271 M
- Depth   64              775.420 k     1.820 M     3.179 M     3.169 M     3.346 M
- Depth  512              873.529 k     1.876 M     2.920 M     3.179 M     3.107 M

Improvements across the board

i5 4 core no HT - 500 min thread
Pre
SubTask Chain Awaited     603.821 k   702.970 k     1.794 M     2.377 M     2.729 M
- Depth    2              653.331 k   713.702 k     1.871 M     2.693 M     2.895 M
- Depth   16              739.070 k   797.390 k     2.197 M     2.793 M     3.229 M
- Depth   64              797.874 k     1.210 M     2.484 M     3.003 M     3.260 M
- Depth  512              758.383 k     1.546 M     2.777 M     2.971 M     2.941 M

Post
SubTask Chain Awaited     596.226 k   665.390 k     1.788 M     2.294 M     2.687 M
- Depth    2              651.716 k   696.489 k     1.873 M     2.602 M     2.909 M
- Depth   16              729.769 k   924.021 k     2.329 M     2.935 M     3.347 M
- Depth   64              807.734 k     1.292 M     2.528 M     3.076 M     3.345 M
- Depth  512              769.498 k     1.687 M     2.701 M     2.878 M     3.099 M

Mixed

i7 4 core 8 HT
Pre
SubTask Chain Awaited     520.123 k   647.306 k     3.040 M     3.299 M     3.626 M
- Depth    2              462.123 k   666.259 k     3.143 M     3.554 M     3.626 M
- Depth   16              413.802 k   934.982 k     3.192 M     3.777 M     4.064 M
- Depth   64              454.067 k   792.816 k     3.330 M     3.584 M     4.014 M
- Depth  512              486.897 k   696.634 k     2.782 M     3.272 M     3.624 M

Post
SubTask Chain Awaited     503.086 k   857.364 k     3.309 M     3.326 M     3.528 M
- Depth    2              514.673 k   920.604 k     3.225 M     3.573 M     3.697 M
- Depth   16              546.228 k     1.257 M     3.673 M     3.908 M     3.981 M
- Depth   64              607.220 k     1.280 M     3.774 M     3.968 M     4.008 M
- Depth  512              709.936 k     1.463 M     3.235 M     3.242 M     3.478 M

Improvements across the board

i7 4 core 8 HT - 500 min thread
Pre
SubTask Chain Awaited     379.956 k   418.250 k     1.942 M     2.399 M     3.190 M
- Depth    2              397.497 k   436.692 k     1.760 M     2.355 M     3.455 M
- Depth   16              454.366 k   484.191 k     1.341 M     2.880 M     3.757 M
- Depth   64              449.875 k   889.808 k     1.931 M     3.119 M     3.819 M
- Depth  512              424.484 k   912.257 k     2.365 M     2.933 M     3.209 M

Post
SubTask Chain Awaited     382.641 k   398.446 k     2.056 M     2.400 M     3.197 M
- Depth    2              404.685 k   447.393 k     2.037 M     2.570 M     3.516 M
- Depth   16              462.104 k   490.836 k     1.285 M     2.884 M     3.795 M
- Depth   64              451.043 k   899.644 k     1.904 M     3.112 M     3.822 M
- Depth  512              429.539 k   946.724 k     2.314 M     2.982 M     3.358 M

Similar

benaadams · 2016-08-27T00:27:25Z

SubTask Fanout Awaited
Testing 2,621,440 calls, with GCs after 262,144 calls.
Operations per second
                                                                           Parallelism
                             Serial          2x         16x         64x        512x

i5 4 core no HT
Pre
SubTask Fanout Awaited    244.211 k   568.096 k     1.384 M     1.404 M     1.399 M
- Depth    2              522.159 k     1.053 M     1.722 M     1.722 M     1.735 M
- Depth   16                1.330 M     2.137 M     2.342 M     2.412 M     2.343 M
- Depth   64                1.789 M     2.281 M     2.487 M     2.507 M     2.488 M
- Depth  512                1.923 M     2.353 M     2.557 M     2.556 M     2.497 M

Post
SubTask Fanout Awaited    287.239 k   816.204 k     1.455 M     1.472 M     1.478 M
- Depth    2              598.324 k     1.310 M     1.832 M     1.863 M     1.856 M
- Depth   16                1.536 M     2.271 M     2.488 M     2.547 M     2.464 M
- Depth   64                1.856 M     2.472 M     2.686 M     2.679 M     2.690 M
- Depth  512                2.048 M     2.528 M     2.715 M     2.769 M     2.711 M

Some Improvement

i5 4 core no HT - 500 min thread
Pre
SubTask Fanout Awaited    346.810 k   380.862 k   953.233 k     1.236 M     1.453 M
- Depth    2              497.513 k   521.983 k     1.272 M     1.643 M     1.842 M
- Depth   16              906.260 k     1.209 M     2.179 M     2.458 M     2.404 M
- Depth   64                1.088 M     1.250 M     2.291 M     2.567 M     2.602 M
- Depth  512                1.121 M     1.516 M     2.430 M     2.605 M     2.593 M

Post
SubTask Fanout Awaited    346.806 k   382.608 k   833.971 k     1.251 M     1.400 M
- Depth    2              496.223 k   558.188 k     1.250 M     1.675 M     1.844 M
- Depth   16              931.701 k     1.165 M     2.191 M     2.485 M     2.483 M
- Depth   64                1.129 M     1.394 M     2.357 M     2.627 M     2.590 M
- Depth  512                1.184 M     1.450 M     2.441 M     2.672 M     2.619 M

Similar

i7 4 core 8 HT
Pre
SubTask Fanout Awaited    252.416 k   385.534 k     1.831 M     2.062 M     2.167 M
- Depth    2              395.468 k   844.312 k     2.556 M     2.589 M     2.632 M
- Depth   16                1.369 M     2.439 M     3.316 M     3.377 M     3.315 M
- Depth   64                1.763 M     2.797 M     3.457 M     3.510 M     3.420 M
- Depth  512                2.014 M     2.898 M     3.498 M     3.496 M     3.522 M

Post
SubTask Fanout Awaited    280.779 k   516.919 k     2.079 M     2.089 M     2.090 M
- Depth    2              418.349 k     1.039 M     2.415 M     2.519 M     2.543 M
- Depth   16                1.382 M     2.456 M     3.164 M     3.247 M     3.182 M
- Depth   64                1.872 M     2.783 M     3.271 M     3.317 M     3.284 M
- Depth  512                2.147 M     2.899 M     3.420 M     3.355 M     3.373 M

Mixed

i7 4 core 8 HT - 500 min thread
Pre
SubTask Fanout Awaited    229.300 k   240.043 k   839.135 k     1.444 M     1.925 M
- Depth    2              342.694 k   367.394 k     1.381 M     1.758 M     2.444 M
- Depth   16              683.237 k   750.378 k     2.494 M     2.881 M     3.200 M
- Depth   64              837.230 k   877.774 k     2.660 M     3.043 M     3.234 M
- Depth  512              887.270 k   912.580 k     2.848 M     2.973 M     3.092 M

Post
SubTask Fanout Awaited    238.340 k   257.873 k   900.434 k     1.510 M     1.946 M
- Depth    2              355.021 k   386.566 k     1.330 M     1.905 M     2.453 M
- Depth   16              687.540 k   750.731 k     2.414 M     2.978 M     3.074 M
- Depth   64              797.330 k   951.923 k     2.581 M     3.058 M     3.229 M
- Depth  512              877.081 k     1.113 M     2.784 M     3.036 M     3.180 M

Similar

benaadams · 2016-08-27T00:47:38Z

Continuation Chain
Testing 2,621,440 calls, with GCs after 262,144 calls.
Operations per second
                                                                           Parallelism
                             Serial          2x         16x         64x        512x

i5 4 core no HT
Pre
Continuation Chain        150.844 k   396.001 k     1.376 M     1.445 M     1.489 M
- Depth    2              349.510 k   881.096 k     2.158 M     2.603 M     2.602 M
- Depth   16              806.582 k     2.819 M     6.803 M     7.105 M     7.040 M
- Depth   64                1.246 M     4.002 M     8.628 M     8.530 M     8.385 M
- Depth  512                1.139 M     4.659 M     8.354 M     8.039 M     7.982 M

Post
Continuation Chain        240.058 k   644.372 k     1.574 M     1.603 M     1.592 M
- Depth    2              395.840 k     1.098 M     2.767 M     2.743 M     2.753 M
- Depth   16              941.757 k     3.001 M     7.380 M     7.372 M     7.311 M
- Depth   64                1.121 M     3.823 M     8.920 M     8.903 M     8.787 M
- Depth  512                1.215 M     4.120 M     9.444 M     9.368 M     9.309 M

Improvements across the board

i5 4 core no HT - 500 min thread
Pre
Continuation Chain        364.895 k   393.347 k   999.829 k     1.378 M     1.589 M
- Depth    2              609.613 k   665.320 k     2.180 M     2.565 M     2.754 M
- Depth   16                1.643 M     1.885 M     7.073 M     7.191 M     7.208 M
- Depth   64                2.159 M     2.604 M     8.177 M     8.539 M     8.534 M
- Depth  512                2.619 M     3.775 M     8.871 M     9.109 M     9.006 M

Post
Continuation Chain        371.949 k   402.795 k     1.008 M     1.358 M     1.599 M
- Depth    2              623.992 k   677.816 k     2.133 M     2.584 M     2.729 M
- Depth   16                1.662 M     1.931 M     7.047 M     7.287 M     7.277 M
- Depth   64                2.166 M     2.466 M     7.913 M     8.679 M     8.651 M
- Depth  512                2.672 M     3.365 M     9.000 M     9.216 M     9.143 M

Mixed/Improve

i7 4 core 8 HT
Pre
Continuation Chain        180.548 k   251.513 k     2.131 M     2.195 M     2.188 M
- Depth    2              297.147 k   421.529 k     2.925 M     3.684 M     3.681 M
- Depth   16              667.285 k     1.060 M     8.785 M     8.705 M     8.659 M
- Depth   64              801.636 k     1.205 M    10.172 M    10.028 M    10.014 M
- Depth  512              815.435 k     1.323 M     8.936 M    10.179 M     9.822 M

Post
Continuation Chain        197.906 k   413.404 k     2.048 M     2.080 M     2.129 M
- Depth    2              320.715 k   685.434 k     3.586 M     3.496 M     3.549 M
- Depth   16              740.831 k     1.540 M     8.183 M     8.144 M     8.146 M
- Depth   64              870.924 k     1.818 M     9.605 M     9.576 M     9.531 M
- Depth  512              916.890 k     1.912 M     9.919 M     9.955 M     9.755 M

Mostly Improved

i7 4 core 8 HT - 500 min thread
Pre
Continuation Chain        232.648 k   242.818 k   744.910 k     1.554 M     2.030 M
- Depth    2              383.585 k   399.078 k     1.434 M     2.808 M     3.432 M
- Depth   16              961.601 k     1.051 M     7.114 M     8.001 M     6.631 M
- Depth   64                1.255 M     1.441 M     7.812 M     9.026 M     9.600 M
- Depth  512                1.430 M     1.620 M     9.122 M     9.506 M     9.533 M

Post
Continuation Chain        241.044 k   260.005 k   923.966 k     1.604 M     2.068 M
- Depth    2              387.877 k   422.354 k     1.600 M     2.889 M     3.449 M
- Depth   16              964.811 k     1.063 M     5.804 M     7.894 M     8.224 M
- Depth   64                1.247 M     1.497 M     7.797 M     9.059 M     9.435 M
- Depth  512                1.440 M     1.360 M     9.178 M     9.643 M     9.509 M

Mostly Improved

benaadams · 2016-08-27T00:54:17Z

Continuation Fanout
Testing 2,621,440 calls, with GCs after 262,144 calls.
Operations per second
                                                                           Parallelism
                             Serial          2x         16x         64x        512x

i5 4 core no HT
Pre
Continuation Fanout       133.352 k   320.861 k   989.129 k     1.120 M     1.122 M
- Depth    2              202.924 k   631.553 k     1.605 M     1.741 M     1.805 M
- Depth   16              937.335 k     1.853 M     4.320 M     4.300 M     4.321 M
- Depth   64                1.845 M     2.452 M     5.397 M     5.258 M     5.210 M
- Depth  512                2.742 M     3.868 M     5.554 M     5.733 M     5.528 M

Post
Continuation Fanout       196.591 k   558.073 k     1.134 M     1.148 M     1.153 M
- Depth    2              395.617 k     1.006 M     1.784 M     1.786 M     1.835 M
- Depth   16                1.974 M     3.629 M     4.398 M     4.346 M     4.323 M
- Depth   64                3.368 M     5.087 M     5.390 M     5.388 M     5.313 M
- Depth  512                3.892 M     5.612 M     5.616 M     5.650 M     5.679 M

Improvements across the board

i5 4 core no HT - 500 min thread
Pre
Continuation Fanout       265.336 k   292.083 k   553.332 k   934.454 k     1.145 M
- Depth    2              414.743 k   442.279 k     1.111 M     1.482 M     1.686 M
- Depth   16                1.184 M     1.555 M     3.781 M     4.061 M     4.146 M
- Depth   64                1.697 M     2.107 M     4.965 M     5.097 M     5.156 M
- Depth  512                1.639 M     2.521 M     5.220 M     5.438 M     5.410 M

Post
Continuation Fanout       259.659 k   283.807 k   604.670 k   910.191 k     1.120 M
- Depth    2              408.807 k   439.765 k     1.171 M     1.534 M     1.658 M
- Depth   16                1.162 M     1.525 M     3.936 M     4.244 M     4.193 M
- Depth   64                1.827 M     2.292 M     5.068 M     5.166 M     5.208 M
- Depth  512                1.777 M     2.850 M     5.278 M     5.101 M     5.470 M

Generally improved

i7 4 core 8 HT
Pre
Continuation Fanout       167.245 k   242.750 k     1.541 M     1.589 M     1.633 M
- Depth    2              189.065 k   411.591 k     2.151 M     2.442 M     2.546 M
- Depth   16                1.162 M     2.604 M     5.645 M     5.608 M     5.620 M
- Depth   64                1.992 M     4.694 M     6.578 M     6.514 M     6.515 M
- Depth  512                1.960 M     4.778 M     6.679 M     6.856 M     6.914 M

Post
Continuation Fanout       166.887 k   342.927 k     1.525 M     1.555 M     1.588 M
- Depth    2              279.829 k   676.619 k     2.276 M     2.369 M     2.422 M
- Depth   16                1.647 M     3.597 M     5.266 M     5.263 M     5.205 M
- Depth   64                1.899 M     4.429 M     6.073 M     6.068 M     6.041 M
- Depth  512                1.826 M     4.462 M     6.179 M     6.297 M     6.352 M

Some regression

i7 4 core 8 HT - 500 min thread
Pre
Continuation Fanout       176.889 k   189.812 k   633.173 k     1.062 M     1.457 M
- Depth    2              262.854 k   282.592 k     1.067 M     1.631 M     2.218 M
- Depth   16              871.006 k   888.298 k     4.070 M     4.947 M     5.228 M
- Depth   64                1.079 M     1.320 M     4.848 M     3.746 M     6.045 M
- Depth  512                1.029 M     1.168 M     5.116 M     6.286 M     5.959 M

Post
Continuation Fanout       178.117 k   190.339 k   533.092 k     1.079 M     1.481 M
- Depth    2              274.058 k   276.105 k     1.379 M     1.863 M     2.271 M
- Depth   16              837.715 k   988.226 k     4.059 M     4.942 M     5.195 M
- Depth   64                1.063 M     1.163 M     5.286 M     5.815 M     6.008 M
- Depth  512              964.998 k     1.463 M     3.813 M     5.905 M     5.899 M

Some regression

benaadams · 2016-08-27T01:02:58Z

Yield Chain Awaited
Testing 2,621,440 calls, with GCs after 262,144 calls.
Operations per second
                                                                           Parallelism
                             Serial          2x         16x         64x        512x

i5 4 core no HT
Pre
Yield Chain Awaited       726.850 k     1.657 M     4.014 M     4.041 M     4.072 M
- Depth    2              904.511 k     2.197 M     4.601 M     4.617 M     4.921 M
- Depth   16                1.316 M     4.273 M     5.977 M     6.086 M     5.955 M
- Depth   64                2.305 M     4.647 M     6.198 M     6.120 M     5.847 M
- Depth  512                2.646 M     5.246 M     6.135 M     5.922 M     4.847 M

Post
Yield Chain Awaited       734.307 k     1.774 M     4.025 M     4.005 M     4.078 M
- Depth    2                1.073 M     2.604 M     4.519 M     4.786 M     4.910 M
- Depth   16                2.008 M     4.589 M     5.910 M     5.947 M     5.968 M
- Depth   64                2.392 M     5.033 M     6.206 M     6.054 M     5.867 M
- Depth  512                2.520 M     5.350 M     6.130 M     5.859 M     4.844 M

Similar

i5 4 core no HT - 500 min thread
Pre
Yield Chain Awaited       726.850 k     1.657 M     4.014 M     4.041 M     4.072 M
- Depth    2              904.511 k     2.197 M     4.601 M     4.617 M     4.921 M
- Depth   16                1.316 M     4.273 M     5.977 M     6.086 M     5.955 M
- Depth   64                2.305 M     4.647 M     6.198 M     6.120 M     5.847 M
- Depth  512                2.646 M     5.246 M     6.135 M     5.922 M     4.847 M

Post
Yield Chain Awaited       734.307 k     1.774 M     4.025 M     4.005 M     4.078 M
- Depth    2                1.073 M     2.604 M     4.519 M     4.786 M     4.910 M
- Depth   16                2.008 M     4.589 M     5.910 M     5.947 M     5.968 M
- Depth   64                2.392 M     5.033 M     6.206 M     6.054 M     5.867 M
- Depth  512                2.520 M     5.350 M     6.130 M     5.859 M     4.844 M

Similar

i7 4 core 8 HT
Pre
Yield Chain Awaited       742.757 k     1.196 M     5.133 M     5.180 M     6.125 M
- Depth    2              895.473 k     1.750 M     4.453 M     5.342 M     7.016 M
- Depth   16                1.270 M     2.632 M     5.751 M     7.500 M     7.589 M
- Depth   64                1.389 M     3.062 M     7.408 M     7.289 M     7.189 M
- Depth  512                1.360 M     3.220 M     7.098 M     6.938 M     6.097 M

Post
Yield Chain Awaited       716.754 k     1.170 M     4.969 M     5.012 M     5.776 M
- Depth    2              864.528 k     1.750 M     4.586 M     5.095 M     6.584 M
- Depth   16                1.295 M     2.941 M     5.971 M     7.078 M     7.306 M
- Depth   64                1.324 M     3.187 M     6.644 M     6.965 M     6.947 M
- Depth  512                1.328 M     3.175 M     6.922 M     6.764 M     5.780 M

Similar

i7 4 core 8 HT - 500 min thread
Pre
Yield Chain Awaited       740.148 k   820.592 k     3.253 M     4.770 M     5.765 M
- Depth    2              885.823 k   934.844 k     3.401 M     5.006 M     6.691 M
- Depth   16                1.237 M     1.374 M     4.895 M     7.197 M     7.401 M
- Depth   64                1.763 M     2.115 M     7.189 M     7.115 M     6.289 M
- Depth  512                1.847 M     2.471 M     6.297 M     6.811 M     4.975 M

Post
Yield Chain Awaited       751.905 k   817.659 k     3.333 M     4.805 M     5.723 M
- Depth    2              892.455 k   965.113 k     3.250 M     4.981 M     6.758 M
- Depth   16                1.209 M     1.429 M     4.837 M     7.090 M     7.384 M
- Depth   64                1.763 M     2.034 M     6.819 M     6.962 M     6.982 M
- Depth  512                1.840 M     2.363 M     6.828 M     6.508 M     4.738 M

Similar

benaadams · 2016-08-27T01:15:30Z

Async Chain Awaited
Testing 2,621,440 calls, with GCs after 262,144 calls.
Operations per second
                                                                           Parallelism
                             Serial          2x         16x         64x        512x

i5 4 core no HT
Pre
Async Chain Awaited       667.010 k     1.066 M     2.698 M     3.290 M     3.427 M
- Depth    2                1.079 M     1.707 M     4.558 M     5.069 M     5.155 M
- Depth   16                2.025 M     4.175 M     7.862 M     8.132 M     8.082 M
- Depth   64                2.121 M     4.281 M     8.181 M     8.436 M     8.548 M
- Depth  512                2.224 M     4.229 M     6.627 M     6.726 M     8.136 M

Post
Async Chain Awaited       675.204 k     1.294 M     3.245 M     3.371 M     3.357 M
- Depth    2                1.053 M     1.957 M     4.487 M     4.737 M     4.955 M
- Depth   16                2.086 M     4.200 M     7.830 M     8.046 M     8.028 M
- Depth   64                2.121 M     4.311 M     8.262 M     8.673 M     8.622 M
- Depth  512                2.263 M     4.557 M     6.806 M     7.068 M     8.276 M

Slight improvement

i5 4 core no HT - 500 min thread
Pre
Async Chain Awaited       814.231 k   971.207 k     2.639 M     3.163 M     3.255 M
- Depth    2                1.178 M     1.366 M     3.844 M     4.842 M     5.055 M
- Depth   16                1.900 M     2.462 M     5.705 M     7.624 M     7.868 M
- Depth   64                2.091 M     4.300 M     7.063 M     8.118 M     8.244 M
- Depth  512                2.096 M     3.948 M     6.362 M     6.438 M     8.273 M

Post
Async Chain Awaited       820.718 k   930.799 k     2.407 M     3.241 M     3.443 M
- Depth    2                1.170 M     1.308 M     3.640 M     4.187 M     4.974 M
- Depth   16                1.885 M     2.278 M     6.228 M     7.428 M     7.967 M
- Depth   64                2.048 M     3.987 M     7.209 M     8.235 M     8.288 M
- Depth  512                2.098 M     4.025 M     6.611 M     6.647 M     8.018 M

Slight improvement

i7 4 core 8 HT
Pre
Async Chain Awaited       550.323 k   805.145 k     4.696 M     4.723 M     4.448 M
- Depth    2                1.222 M     1.511 M     6.661 M     6.695 M     6.721 M
- Depth   16                2.122 M     3.232 M    10.887 M    10.924 M    10.872 M
- Depth   64                2.626 M     3.743 M    11.674 M    11.688 M    11.672 M
- Depth  512                2.487 M     4.818 M    10.701 M    10.938 M    11.753 M

Post
Async Chain Awaited       550.662 k   923.651 k     4.446 M     4.398 M     4.502 M
- Depth    2                1.241 M     1.455 M     5.934 M     6.400 M     6.447 M
- Depth   16                2.172 M     3.229 M    10.462 M    10.646 M    10.709 M
- Depth   64                2.763 M     3.812 M    11.581 M    11.515 M    11.416 M
- Depth  512                2.499 M     4.697 M    10.836 M    10.934 M    11.635 M

Similar

i7 4 core 8 HT - 500 min thread
Pre
Async Chain Awaited       498.238 k   557.739 k     1.449 M     3.662 M     4.255 M
- Depth    2              687.844 k   719.451 k     3.449 M     5.298 M     6.306 M
- Depth   16                1.407 M     2.447 M    10.275 M    10.227 M    10.183 M
- Depth   64                2.388 M     3.425 M    10.658 M    11.123 M    10.963 M
- Depth  512                2.061 M     3.709 M    10.247 M    10.340 M    11.292 M

Post
Async Chain Awaited       525.209 k   485.608 k     1.659 M     2.565 M     4.148 M
- Depth    2              728.118 k   784.423 k     3.815 M     5.217 M     6.249 M
- Depth   16                1.411 M     2.478 M     9.182 M     9.489 M    10.042 M
- Depth   64                2.404 M     3.400 M    10.544 M    10.904 M    11.009 M
- Depth  512                2.086 M     3.764 M    10.349 M    10.510 M    11.325 M

Similar

benaadams · 2016-08-27T01:20:14Z

QUWI Local Queues
Testing 2,621,440 calls, with GCs after 262,144 calls.
Operations per second
                                                                           Parallelism
                             Serial          2x         16x         64x        512x

i5 4 core no HT
Pre
QUWI Local Queues           9.919 M    10.878 M    11.594 M    11.769 M    12.569 M
- Depth    2                9.733 M     8.484 M     9.122 M     9.175 M     9.773 M
- Depth   16               10.259 M     9.908 M    10.153 M    10.251 M    10.111 M
- Depth   64               10.363 M    10.330 M    10.390 M    10.338 M    10.208 M
- Depth  512               10.390 M    10.358 M    10.417 M    10.313 M    10.399 M

Post
QUWI Local Queues          11.641 M    11.216 M    10.038 M    11.962 M    12.615 M
- Depth    2                9.851 M     8.609 M     9.129 M     9.589 M     9.479 M
- Depth   16               10.448 M    10.226 M    10.272 M    10.286 M    10.195 M
- Depth   64               10.410 M    10.358 M    10.319 M    10.491 M    10.393 M
- Depth  512               10.492 M    10.340 M    10.559 M    10.576 M    10.469 M

Generally improved

i5 4 core no HT - 500 min thread
Pre
QUWI Local Queues          12.616 M    11.392 M     9.527 M    11.639 M    12.307 M
- Depth    2                9.701 M     7.792 M     8.577 M     8.922 M     8.949 M
- Depth   16               10.014 M     9.941 M    10.122 M    10.106 M     8.767 M
- Depth   64               10.348 M    10.159 M    10.361 M    10.379 M    10.243 M
- Depth  512               10.506 M     9.424 M    10.246 M    10.377 M    10.432 M

Post
QUWI Local Queues          12.793 M    11.522 M    10.079 M    11.804 M    11.943 M
- Depth    2                9.877 M     8.447 M     8.950 M     9.430 M     8.963 M
- Depth   16               10.368 M    10.017 M    10.176 M    10.311 M    10.244 M
- Depth   64               10.311 M    10.502 M    10.431 M    10.472 M    10.334 M
- Depth  512               10.503 M    10.562 M    10.420 M    10.490 M    10.463 M

Generally improved

i7 4 core 8 HT
Pre
QUWI Local Queues           4.251 M     4.817 M     6.366 M     7.812 M     8.858 M
- Depth    2                4.504 M     5.076 M     5.963 M     7.407 M     8.038 M
- Depth   16                6.782 M     6.557 M     6.634 M     6.704 M     6.733 M
- Depth   64                6.767 M     6.618 M     6.809 M     6.769 M     6.827 M
- Depth  512                6.814 M     6.818 M     6.713 M     6.838 M     6.611 M

Post
QUWI Local Queues           3.932 M     4.810 M     5.890 M     8.106 M     8.346 M
- Depth    2                4.928 M     4.820 M     6.101 M     7.347 M     7.784 M
- Depth   16                6.756 M     6.596 M     6.554 M     6.729 M     6.624 M
- Depth   64                6.826 M     6.696 M     6.667 M     6.740 M     6.820 M
- Depth  512                6.765 M     6.799 M     6.663 M     6.723 M     6.725 M

Slight regression

i7 4 core 8 HT - 500 min thread
Pre
QUWI Local Queues           7.620 M     7.933 M     7.522 M     7.236 M     7.932 M
- Depth    2                7.036 M     7.566 M     6.363 M     7.225 M     6.275 M
- Depth   16                6.763 M     6.621 M     6.540 M     6.609 M     6.454 M
- Depth   64                6.736 M     6.719 M     6.671 M     6.699 M     6.725 M
- Depth  512                6.782 M     6.668 M     6.611 M     6.633 M     6.661 M

Post
QUWI Local Queues           8.912 M     8.746 M     6.623 M     7.563 M     8.078 M
- Depth    2                7.468 M     7.559 M     6.216 M     7.318 M     7.141 M
- Depth   16                6.645 M     6.554 M     6.501 M     6.686 M     6.607 M
- Depth   64                6.723 M     6.706 M     6.666 M     6.707 M     6.657 M
- Depth  512                6.766 M     6.517 M     6.552 M     6.635 M     6.477 M

Mixed

benaadams · 2016-08-27T01:55:19Z

Added some impressions to the before and after for the effects on threadpool; by eyeball so take with a pinch of salt.

Overall I think this is an improvement to that also.

Still haven't found what heavily impacts QUWI performance on HT (last set of results, second cpu is more powerful, but also HT); my quest continues...

stephentoub · 2016-08-27T12:36:37Z


        // After how many yields, check the timeout
-        private const int TIMEOUT_CHECK_FREQUENCY = 10;
+        private const int TIMEOUT_CHECK_FREQUENCY_MASK = 16;


I'm a little concerned about these changes. I don't remember how much effort went into selecting the values initially, but these could have a real impact on usage, and issues that arise from such changes could be difficult to spot from limited microbenchmark-based testing. Lots of factors impact this, including number of cores, layout of cores, usage patterns, etc.

stephentoub · 2016-08-27T12:40:47Z

@benaadams, thanks for the obvious effort you've put into this. I have to say, though, I started looking through it, and I'm feeling uneasy about this change. There's a lot that's rolled up into it, when the initial goal was here was just around removing the bulk of the additional work when a timeout of 0 was provided. The numbers you shared for throughput improvement in that case don't seem significantly different between the initial measurements from when this was just a few lines changed to now when there's several hundred lines changed. I get nervous when such a low-level, threading-related type is changed in this manner. What's the bare minimum change necessary to achieve the bulk of the benefits? Other incremental changes could be considered on their own after that; I would prefer not to roll all such changes together.

cc: @kouvel, @ericeil

benaadams · 2016-08-27T13:24:30Z

Min changes would look something like #6952 though it still is doing extra work like checking if the timeout has passed after a single CAS/Increment which is unlikely to take >= 1ms which would be the min value for the test to fail.

stephentoub · 2016-08-27T13:33:25Z

though it still is doing extra

And how does the throughput improvement from that for TryEnter(0, ...) compare to all of these changes?

benaadams · 2016-08-27T13:45:42Z

Probably close... added second change for overchecking the timeout.

Will run tests, though I imagine most of the gains were from the fail fast path.

SpinLock.TryEnter fail fast for timeout 0

30c134f

dnfclas added the cla-already-signed label Aug 26, 2016

stephentoub reviewed Aug 27, 2016
View reviewed changes

benaadams mentioned this pull request Aug 27, 2016

SpinLock.TryEnter fail fast for timeout 0 #6952

Merged

benaadams closed this Aug 27, 2016

benaadams deleted the spinlock-failfast branch March 27, 2018 05:11

benaadams mentioned this pull request Jan 31, 2020

Monitor.TryEnter should fail fast for timeout 0 dotnet/runtime#6573

Closed

Conversation

benaadams commented Aug 26, 2016

Uh oh!

benaadams commented Aug 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benaadams commented Aug 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benaadams commented Aug 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benaadams commented Aug 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benaadams commented Aug 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benaadams commented Aug 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benaadams commented Aug 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benaadams commented Aug 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benaadams commented Aug 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephentoub Aug 27, 2016

Choose a reason for hiding this comment

Uh oh!

stephentoub commented Aug 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benaadams commented Aug 27, 2016

Uh oh!

stephentoub commented Aug 27, 2016

Uh oh!

benaadams commented Aug 27, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

benaadams commented Aug 26, 2016 •

edited

Loading

benaadams commented Aug 26, 2016 •

edited

Loading

benaadams commented Aug 27, 2016 •

edited

Loading

benaadams commented Aug 27, 2016 •

edited

Loading

benaadams commented Aug 27, 2016 •

edited

Loading

benaadams commented Aug 27, 2016 •

edited

Loading

benaadams commented Aug 27, 2016 •

edited

Loading

benaadams commented Aug 27, 2016 •

edited

Loading

benaadams commented Aug 27, 2016 •

edited

Loading

stephentoub commented Aug 27, 2016 •

edited

Loading